In previous instalments, we looked at what happens to the poor peptides we send to their doom down the rabbit hole and into the strange wonderland that is the inside of a mass spectrometer. Now, it’s all fine and good to fragment peptides into tiny little pieces, I for one am all for it.
I mean, who doesn’t like some good fireworks, but that’s not how you cure cancer. So how do we turn all of these large files containing thousands of spectra into actual biologically-relevant data?
There are three processes that are essential for this:
Today we will deal with peptide identification. The two other processes will be discussed in future entries.
The aim of peptide identification is to generate PSMs, that is, “Peptide-Spectrum Matches” (where a peptide is actually a theoretical peptide sequence).
 NB: We haven’t actually described the ion optics that handle the peptides inside the mass spectrometer, but I hope to do this in a future blog entry.
 Well, actually, my wife hates them.
In the previous blog entry,we explained how MS2 spectra are generated. If you haven’t read it, I encourage you to at least take a look at it, because it makes a few important points about why MS2 spectra are usually insufficient to derive the whole sequence of a peptide.
After a sample has gone through a mass spectrometer, the information we have about a given fragmented MS1 peak that can help us determine what it is consists of:
Precursor monoisotopic mass (usually known with great precision if measured in an Orbitrap) and charge.
The MS2 spectrum usually contains enough sequential information to sequence de novo part of (“MS2 sequence tag”) but not the whole peptide. Immonium ion peaks can also be used to identify some of the amino acids present in the sequence.
When a peptide elutes in the online gradient can tell us about its expected chemical properties; I am not aware of this type of information being currently leveraged to identify peptides, except in the library search approach.
In most cases (but not for environmental proteomics) we know the parent species of the sample, so we can predict expected peptides. Note that the more precise, the better: if you have transcriptomics data for your sample, then use it to update the species’ reference proteome with sample-specific variants.
There are three main strategies to identify peptides based on the information above. They are Database Searching, De Novo Sequencing, and Library Searching. Currently, the most used method is by far is Database Searching. I will be presenting these three methods below:
Well, actually, I lied a little earlier. I have already explained at length how the most widely used method for peptide identification works in a previous blog, so will be skipping it. But I do encourage you to read said blog, as understanding how this works is crucial to knowing the value of your data. Here, we will just be stating a few important facts:
De novo sequencing is the dream. It is what LC-MS/MS should ideally be doing all the time. No exceptions. Alas, as with all dreams, it is curtailed by reality, in this case, by the inherent limitations of available fragmentation methods, which cannot produce full peptide sequence information for most MS2 spectra. Thus, whereas for nucleic acids or for individual proteins (Edman degradation) a direct, full sequencing approach is available, it is sadly not the case for proteomics. There do however exist a variety of algorithms and solutions available for peptides de novo sequencing that allow partial de novo sequencing of most peptides, and use additional information or machine learning to estimate the most likely full sequence.
 Should, one day, a machine become available that can 1/ wholly segregate individual proteins in a complex mixture and 2/ Edman sequence each isolated protein in parallel, and this 3/ with high-throughput kinetics, then LC-MS/MS-based proteomics will effectively become redundant.
There are two main uses for de novo sequencing in peptides identification:
However, because of the limitations described above, these solutions cannot fully de novo sequence most MSMS spectra. Instead, they extract as much information about the sequence as they can from a given spectrum, then attempt to predict the rest of the sequence, typically using machine learning approaches. The final result should be a series of possible sequences per spectrum, each with a confidence score. In addition, each position in a given sequence also has an associated confidence.
Quite a few de novo sequencing algorithms have been published. Some of the handiest solutions available include PEAKS Studio and DeNovoGUI. The latter is free and implements several different sequencing algorithms (namely, Novor, DirectTag, PepNovo+ and pNovo+), thus allowing for a consensus approach.
This is essentially a database search made simpler by incorporating a partial de novo sequencing step. While it is rarely possible to sequence whole peptides from MS2 spectra, it is usually possible to extract enough information, e.g. short “sequence tags”, to learn something about the parent sequence. This information can then be used to reduce the search space for a given MS2 spectrum. This approach can significantly reduce search times while increasing data quality.
An interesting software with an absolutely beautiful hybrid approach to database searching is SCIEX’ ProteinPilot. The software uses limited (i.e., high confidence) de novo sequencing to generate sequence tags, then matches them to the search database to create a protein “temperature” – the more de novo tag cover a region of the protein, the hotter it is. Local temperature is then factored into the search parameters, so that more PTMs – including common PTMs that may not have been included by the user in the search parameters – or sequence substitutions are allowed for hotter regions, while for cold regions only the most basic search is performed. By reducing database size (database complexity is only maximal for hot regions), this algorithm allows for faster database searches. It also significantly increases the percentage of identified MSMS spectra and thus the quality of the data. I encourage you to look at its principle in more detail here: https://sciex.com/Documents/tech%20notes/ProteinPilot-Software-Overview-RUO-MKT-02-1777-A.pdf
This identification method relies exclusively on experience gathered using the other two (usually Database searching, but any type of PSM can be used). Peptide identifications are used to generate a spectra database where each PTM-specific peptide sequence is associated with signature peaks, a charge state, and an “arbitrary retention time”. The latter corresponds to the peptide’s retention time relative to a series of control peptides (iRT) which are spiked into samples, and has the property to be portable between different systems and gradients. The combination of signature peaks, charge state and iRT information is enough to identify MS2 spectra, and allows for much faster and more complete interrogation of the data.
Importantly, the Library Search strategy is crucial to the Data-Independent Acquisition approach to Proteomics. It has also been used successfully to hasten the process of PTM-modified peptide identification.
If you should remember one thing and one thing only from this blog entry, it is that there is a fundamental difference between how identifications work in Proteomics and in Amino-Acid-omics. Basically, in Proteomics we do not sequence peptides, we match theoretical peptide spectra to observed ones. Because these methods are much more error prone, the results must be controlled using robust statistical analysis. I have detailed the control method used for the Database Searching method in a previous blog. The consequence of this is that researchers have to be much more careful when dealing with proteomics data. This is why, for most studies, proteins are usually filtered and those identified with only one peptide are often discarded.
In the next blog entry (or entries?), we will discuss the challenges associated with peptide and protein quantitation in LC-MS/MS based proteomics