Quickly extract sequence-based biological information from improved peptide array binding results and provide more precise and accurate information about the biological systems investigated.
A main objective of analyzing peptide array-based binding experiments is to uncover the relationship between a peptide sequence and the binding outcome. Limited by the peptide array technologies available for applications, few attempts have been made to construct qualitative or quantitative models that depict the peptide sequence: binding strength relationships in peptide microarray-based binding studies. There has been a long history of similar modeling efforts based on low-throughput binding data in the areas of T-cell epitope screening and kinase substrate mapping, however. The keen needs in peptide array applications and the success of the modeling efforts in related fields have prompted us to develop SVM-PEPARRAY, a Web-based program capable of constructing qualitative and quantitative models based on peptide microarray binding datasets using support vector machine (SVM) modeling methods. We expect that such modeling analysis will allow researchers to quickly extract sequence-based biological information from improved peptide array binding results and provide more precise and accurate information about the biological systems investigated.
SVM-PEPARRAY is accessible at http://pepcyber.umn.edu/SVM-PEPARRAY
SVM PEPARRAY Overview
SVM-PEPARRAY is a Web-based program that constructs qualitative (SVM classification) and quantitative (SVM regression, or SVR) models with user-provided peptide microarray data. The dataset provided should be a list of peptide sequence: binding result pairs, where the binding result should be binary values (1 representing a binder and 0 representing a non-binder) for constructing an SVM classification model, and real-valued numbers (representing binding intensities obtained from an on-chip binding experiment) for constructing an SVR model. Three kernel options are provided – the linear kernel, the polynomial kernel, and the radial basis function (RBF) kernel (Table 1).
Table 1 – The three kernel options implemented in SVM-PepChip
Selecting or Uploading a Microarray Dataset
After logging in, the user should select the option Construct a model. A list of all peptide microarray datasets previously uploaded by the user is displayed. The user can select a dataset from the list. Alternatively, the user can choose to provide a new dataset. In the latter case, the user is prompted to provide a name and (optionally) a textual description of the dataset. The dataset can either be uploaded as a comma separated value (CSV) file or pasted to the text area provided. The data file (or pasted text) should contain no header row and should include three columns: (a) peptide ID (numerical or textual); (b) peptide sequence; and (c) binding result. The peptide sequences should be aligned across rows. The character ‘-’ should be used as a name holder for leading empty positions. The binding results can be either binary values (where 1 represents binder and 0 represents non-binder) or real-valued numbers (binding intensities measured from the on-chip experiments). After the user clicks the Submit button, the dataset is checked for errors and, if no errors are found, the dataset is accepted and a statistics summary of the dataset is displayed.
Constructing a Qualitative SVM Model
To establish a qualitative SVM model (or classification model), the user should click the link Construct a SVM classification model. If the dataset selected by the user is a quantitative dataset (i.e., the binding result column of the dataset contains real-valued binding intensities), the user would be prompted to choose a cut-off value which would be used to discretize the data: the binding result values greater than or equal to the cut-off value would be converted to 1 representing binders, and the values less than the cutoff value will be converted to 0 representing non-binders. After data discretization, the user would be prompted to save the dataset before proceeding to the next step.
In the Configure SVM classification model construction page that follows, the user is prompted to choose one of the three kernel options – the linear kernel, the polynomial kernel, and the RBF kernel, and specify how cross-validation should be done – either an N-fold cross-validation (where N is specified by the user) or an LOO cross-validation can be conducted. In addition, the user should choose one of the two peptide encoding schemes – ‘‘sparse encoding’’ or ‘‘10-factor encoding’’. After these selections are made, the user clicks the Proceed button to go to the next step.
In the Choose SVM classification model parameters page, the user chooses the non-kernel and kernel parameters for the model construction. There is only one non-kernel parameter – the regularization parameter C – that needs to be specified for constructing an SVM classification model. If the user has chosen the polynomial kernel option, then n – the degree in the polynomial function needs to be specified. If the user has chosen the RBF kernel, then the kernel parameter ɣ would need to be specified. SVM-PEPARRAY would examine the dataset and provide a list of recommended values for each non-kernel or kernel parameter. The user can choose to alter one or more parameter values, or, if deemed proper, specify a completely different list of parameter values. In SVM-PEPARRAY, a grid-search is performed to find the parameter combination giving rise to the model with the best performance (evaluated under cross-validation). For example, if the user chooses the RBF kernel option and specifies four different values of C and five different values of , then 20 different SVM models – each constructed with one of the 20 ( = 4 x 5) different parameter combinations – will be tested. The model that achieves the highest cross-validated accuracy would be selected and presented to the user.
After specifying the model parameter values, the user clicks the Start constructing model button to initiate the model construction process. The user will receive an automatically generated notification email after the model construction is completed.
Constructing a Quantitative SVM Model
To construct a quantitative SVM model (or a SVR model), a quantitative dataset (for which the Binding result column of the dataset contains real-valued binding intensities) must be selected. After the dataset is selected, the Configure SVR model construction page is displayed, where the user specifies the kernel selection, cross-validation option and peptide encoding scheme.
At the next step, the Choose SVR model parameters page, the user specifies non-kernel and kernel parameter values. SVR models involve two non-kernel parameters: the regularization parameter C, and the parameter € in the €-insensitive loss function. The kernel parameters involved in SVR model construction are the same as those involved in the construction of SVM classification models. As in SVM classification model construction, SVMPEPARRAY examines the dataset and provides a list of recommended values for each parameter. These values can be altered by the user as deemed fit. After the model construction job is initiated, a grid-search would be conducted to find the parameter combination that gives rise to the model with the best performance (evaluated according to the cross-validated R2 value). After the model construction is completed and the best model is determined, a notification e-mail is automatically generated and sent to the user.
Examining a Newly Constructed Qualitative or Quantitative Model
After receiving an e-mail notification that a model construction job is completed, the user can log into the SVM-PEPARRAY system, choose the option View recently constructed models and select the model from the list. A comprehensive summary of the model is then displayed, which includes the model construction configurations, non-kernel and kernel parameters chosen, the performance assessment of the model, the original dataset used in the training of the model and the predictions made for each peptide in the original dataset under cross-validation. If the user is satisfied with the model, he could choose to store the model in the SVM-PEPARRAY system. If the model is not satisfactory, the user could discard it and try to construct a new model with modified configurations.
Making Predictions Using an Established Qualitative or Quantitative Model
When the user chooses the option Make predictions using a stored model, a list of models previously stored by the user is displayed. After selecting the model to use, the user is prompted to provide a list of peptides, either by uploading a CSV file or by pasting the peptide list into the text area displayed. The CSV file or the pasted text should contain no header and should include two columns: (a) peptide ID (numerical or textual) and (b) peptide sequence. After the user clicks the Make predictions button, the predicted binding results will be displayed to the user with a brief delay. If the model selected by the user is an SVM classification model, qualitative predictions would be made. Otherwise, if the selected model is an SVR model, then quantitative predictions would be made.
- The performance of an SVM model constructed using a peptide array dataset is dependent on the interplay of a variety of factors, including the quality of the binding data (which in turn is determined by the peptide synthesis and arraying techniques, the quality of the binding experiment and the intrinsic affinity or specificity of the peptide–protein binding system), the size of training dataset, and the similarity between the peptide sequences in the training samples. Although it is possible to construct an SVM model with a training dataset of size ~30, a larger dataset (with >100 samples) is often required to achieve a model with acceptable performance. The level of sequence similarity between the peptides in the training dataset influences the constructed model in a less straightforward manner: a high level of sequence similarity in the training dataset will result in a model with good performance when tested under cross-validation setting, but the performance may not extend to untested peptides whose sequences are very different from the ones used during the training of the model.
- Our experience is, for most datasets, the models constructed using the RBF kernel demonstrated most satisfactory performance. However, there are cases where models constructed with the polynomial kernel or the linear kernel achieved better performance. It is, therefore, advisable that the user tries out different kernel options when constructing models. Similarly, although the ‘‘10-factor encoding’’ scheme often yields models with better performance than the ‘‘sparse encoding’’ scheme, it is sensible for the user to try out both encoding schemes. In theory, the way cross-validation is conducted does not have direct influence on the performance of the model, but it influences how accurately the model performance is assessed. A larger N (in N-fold cross-validation) would lead to more accurate assessment of the model performance. The downside of choosing a larger N is that it would result in higher running time cost for model construction. Generally, 5- or 10-fold cross-validation suffices for most model construction tasks.