An Advanced Look at De Novo Sequencing
PEAKS performs de novo sequencing directly from the MS/MS data and therefore does not rely on a protein database. It computes the best possible sequence among all possible amino acid combinations.
Analogous approaches have been described, but were computationally inefficient and abandoned.1,2 Instead, PEAKS relies on a sophisticated dynamic programming algorithm to perform the computation efficiently. The mathematical model that PEAKS uses is also different from the graph theory approach. In our approach, PEAKS computes peptides whose ions corresponds to as many high abundance peaks in the spectrum as possible.
During candidate computation within de novo sequencing, this critical step generates the 10000 best sequences of all possible combinations of amino acids for a given precursor ion mass; a, b, c, x, y and b/y-17/18 ions are all considered. The basic assumption of our model is that the more high abundance peaks are matched by those ions of a sequence, the more likely the predicted sequence is the correct sequence. For each mass value m, this new algorithm first computes the reward/penalty that a y (or b) ion has mass m . If there is a peak close to m, the reward is equal to the logarithmic abundance of the peak, multiplied by a factor reflecting the mass error between m and mass value of the peak, and multiplied by a factor reflecting the co-existence of the x, y-H2O, y-NH3 (or a, c, b-H2O, b-NH3) ions. If there is no peak close to m, the reward is a negative constant value. The problem is then reduced to finding a sequence, such that its y and b ions maximize the total rewards at their mass values.
Our approach to tabulate the total reward is very different than the spectrum graph model used by previous de novo sequencing software and algorithms. Because the spectrum graph model attempts to find a path connecting the N and C termini, the absence of ions may break such a path and makes the finding of the sequence very difficult. However in our approach, a reward/penalty score is computed for every possible mass value, regardless of the existence of a peak around that mass value. Therefore, the absence of peaks does not cause notable problems. Also, the reward/penalty score accounts for many factors like the abundance of the peak, the mass errors and the co-existence of other peaks, all of which significantly improve the accuracy of the de novo sequencing results.
In the next step, each of the 10000 candidates is re-evaluated by a more stringent scoring scheme, and the best candidates (the number can be specified by users) under the new scoring scheme will be output. In this refined rescoring step, ion mass error tolerance is stricter. The rewards of immonium ions as well as internal cleavage ions are now considered. The reward/penalty computation is the same as y and b ions. The immonium and internal cleavage ions are not counted in the previous step because their inclusion would be too computationally inefficient to derive the best 10000 candidates. Finally, a recalibration of the data is performed to account for minor deviation in the MS data. This recalibration method is similar to the one performed by Taylor et al.3
In the last step, PEAKS computes a confidence score for each of the top-scoring peptide sequences. The refined scores can be seen as non-normalized measures of the likelihood of correctness for each peptide, and the distribution of scores gives a measure of the overall probability of successful sequencing. PEAKS first converts the refined score x of each peptide sequence to a raw confidence X by the following formula X=exp(cx), where c is a parameter that is estimated from the spectrum by PEAKS. Then the raw confidence scores for all the top scoring peptide sequences are normalized to be the final confidence scores so that they sum up to 1. Finally, the positional confidences for each residue are derived from consensus among the globally top-scoring sequences.4
Footnotes:
- Hamm CW, Wilson WE, Harvan DJ. Peptide Sequencing Program. Comput Appl Biosci. 1986 Jun;2(2):115-8.
- Hines WM, Falick AM, Burlingame AL, Gibson BW. Pattern-based algorithm for peptide sequencing from tandem high energy collision-induced dissociation mass spectra. Journal of the American Society for Mass Spectrometry 1992, 3:326-336.
- Taylor JA, Johnson RS. Implementation and uses of automated de novo peptide sequencing by tandem mass spectrometry. Anal Chem. 2001 Jun 1;73(11):2594-604.
- Ma B, Zhang K, Hendrie C, Liang C, Li M, Doherty-Kirby A, Lajoie G. PEAKS: powerful software for peptide de novo sequencing by tandem mass spectrometry. Rapid Commun Mass Spectrom. 2003;17(20):2337-42.
|