De Novo Result Validation

Figure 3(a) and 3(b) provide a guideline for determining a proper ALC score threshold for filtering “de novo only” sequences. The two figures show the local confidence score distribution of residues in de novo sequences that are filtered by the current ALC score threshold.

After database search, de novo sequences can be categorized as:

1. Verifiable de novo sequences

A de novo sequence is verifiable if the associated MS/MS spectrum is confidently matched to a database peptide. Residues in a verifiable de novo sequence can be validated using the database peptide as a reference.

db-denovo-valid

2. “De novo only” sequences

A de novo sequence is “de novo only” if the associated MS/MS spectrum is not confidently matched to any database peptide. “De novo only” sequences may suggest novel peptides, peptides with unknown modifications, or other interesting research subjects.
“De novo only” sequences are crucial for a complete proteomic analysis. However, it is often necessary to remove the low quality sequences. “De novo only” peptides can be filtered by de novo ALC score, which is the average local confidence score of residues in the de novo sequence.

Figure 3(a) shows the score distribution of residues in verifiable de novo sequences. These residues are validated by aligning the de novo sequence with the database peptide. A residue is considered correct if it is consistent in the database peptide. Otherwise, the residue is considered incorrect. The figure shows the score distributions of correct residues and incorrect residues in two different colors.

db-denovo-figure-3a

Figure 3(b) shows the score distribution of residues in “de novo only” sequences. As these residues cannot be directly validated using database peptide, their ratio of correctness is statistically estimated using the distributions in Figure 3(a). The figure shows the estimated score distributions of correct residues and incorrect residues in two different colors.

db-denovo-figure-3b

As a guideline for setting the ALC score threshold, you can gradually increase the threshold until the score distributions of correct and incorrect residues are similar in figure 3(a) and 3(b). In the following example, the ALC threshold is gradually increased to 80. This ensures the filtered “de novo only” sequences are generated from MS/MS spectra that have the same spectral quality as the MS/MS spectra confidently matched in the database search. Figure 3(b) also allows you to estimate the proportion of incorrect residues in the filtered “de novo only” sequences.

db-denovo