Verification of a Support Vector Machine Model for Predicting Proteotypic Peptides
The current method to match mass spectra from tandem mass spectrometry (MS) to a peptide sequence requires searching a large database of all possible peptides encoded by an organism. However, only a subset of these possible peptides is consistently and repeatedly identified by MS (proteotypic peptides). Matching spectra to this smaller, proteotypic peptide search space increases computational efficiency and improves accuracy of the peptide identification, hence increasing the confidence that a protein has been accurately identified. Currently, it is labor-intensive to build a proteotypic peptide database of experimentally observed peptides! thus computationally deriving such a database is desirable. Webb-Robertson et al. trained a statistical learning algorithm called a support vector machine (SVM) from Yersinia pestis data that computationally classifies a peptide as proteotypic or not proteotypic. Preliminary tests by these authors showed that this SVM accurately predicted proteotypic peptides for two closely related bacterial species — Salmonella typhimurium and Shewanella oneidensis. To test the versatility of the classifier, experimentally generated proteotypic peptide databases from three bacteria more distantly related to Y. pestis, as well as one vertebrate species, were gathered - Pelagibacter ubique, Caulobacter crescentus, Cyanothece, and Mus musculus (mouse). For each of these species, those proteins with at least four experimentally determined proteotypic peptides were extracted and all possible peptides for those proteins were classified with the SVM. The resulting information was analyzed in MatLab, creating a Receiver Operating Characteristic (ROC) curve and associated area under the curve (AUC) value to describe the sensitivity and specificity of the SVM model, where an AUC of 1.0 describes a perfect classifier and a random binary classifier would generate an AUC value of 0.5. The average AUC values for Y. pestis, P. ubique, C. crescentus, Cyanothece, and mouse were 0.8351, 0.7442, 0.7622, 0.7455, and 0.7457 respectively. Therefore, the current SVM classifier accurately predicts proteotypic peptides for diverse bacterial species as well as the mouse. Future research may include retraining SVMs to target a specific protein sample preparation method or species.