Identification of novel protein-coding sequences in Eucalyptus grandis plants by high-resolution mass spectrometry

Jorge, Gabriel Lemes, and Tiago Santana Balbuena. “Identification of Novel Protein-Coding Sequences in Eucalyptus Grandis Plants by High-Resolution Mass Spectrometry.” Biochimica et Biophysica Acta (BBA) – Proteins and Proteomics, no. 3, Elsevier BV, Mar. 2021, p. 140594. Crossref, doi:10.1016/j.bbapap.2020.140594.


Eucalyptus species are widely used in the forestry industry, and a significant increase in the number of sequences available in database repositories has been observed for these species. In proteomics, a protein is identified by correlating the theoretical fragmentation spectrum derived from genomic/transcriptomic data against the experimental fragmentation mass spectrum acquired from large-scale analysis of protein mixtures. Proteogenomics is an alternative approach that can identify novel proteins encoded by regions previously considered as non-coding. This study aimed to confidently identify and confirm the existence of previously unknown protein-coding sequences in the Eucalyptus grandis genome. To this end, we used a modified spectral correlation strategy and a dedicated de novo peptide sequencing pipeline. Upon the strategy used here, we confidently identified 41 novel peptide forms and six peptides containing at least one single amino acid substitution. The most representative genomic class of novel peptides was identified as originating from alternative reading frames. In contrast, no clear single amino acid substitution pattern was identified. Validation of the identifications was carried out using a parallel reaction monitoring approach that provided further mass spectrometry support for the existence of the novel peptide sequences. Data are available via ProteomeXchange with identifier PXD022110.