Background: Tandem mass spectrometry (MS/MS) has become the primary way for protein identification in proteomics. A good score function for measuring the match quality between a peptide and an MS/MS spectrum is instrumental for the protein identification. Traditionally the tobe-measured peptides are fragmented with the collision induced dissociation (CID) method. More recently, the electron transfer dissociation (ETD) method was introduced and has proven to produce better fragment ion ladders for larger and more basic peptides. However, the existing software programs that analyze ETD MS/MS data are not as advanced as they are for CID. Results: To take full advantage of ETD data, in this paper we develop a new score function to evaluate the match between a peptide and an ETD MS/MS spectrum. Experiments on real data demonstrated that this newly developed score function significantly improved the de novo sequencing accuracy of the PEAKS software on ETD data. Conclusion: A new and better score function for ETD MS/MS peptide identification was developed. The method used to develop our ETD score function can be easily reused to train new score functions for other types of MS/MS data.

from The Eighth Asia Pacific Bioinformatics Conference (APBC 2010) Bangalore, India 18-21 January 2010

In recent years, tandem mass spectrometry (MS/MS) has
become a popular technique in proteomics for protein
identification. In a typical ?bottom-up? approach,
proteins are enzymatically digested into short peptides, each
of them is measured with MS/MS and produces an MS/MS
spectrum. Analytical software is used to identify a peptide
for each qualified spectrum. Then the peptides are used
for protein identification and characterization. Many
software programs (e.g. Mascot, Sequest and PEAKS)
have been developed for MS/MS data analysis [

Both approaches identify peptides with two steps. First, a
set of candidate peptides are selected either from a
protein database or with a sequence construction
algorithm for a given spectrum. Second, each sequence
candidate is evaluated with a score function that
measures the quality of the match between the sequence
and the spectrum, and the peptide with the best
matching score is then output. Inevitably, the quality
of the score function greatly affects the quality of the
peptide identification and its following analysis. In a
tandem mass spectrometer, a peptide ion is fragmented
into fragment ions. Normally, the fragmentation can
occur at any position of the peptide backbone, resulting
in different types of fragment ions (Figure 1(a)). Each
ion will result in a peak at the corresponding mass to
charge ratio (m/z), and all the ions together form a
theoretical mass spectrum (Figure 1(b)). The real MS/MS
spectrum is more complicated due to the existence of
noise peaks, the missing of some fragment ion peaks,
and the different intensities of each peak (Figure 2).
The collision induced dissociation (CID) is among the
earliest ways developed for peptide fragmentation in a
tandem mass spectrometer, and is also the most widely
studied in bioinformatics. It is known to produce more y
and b-ions than the other ion types. Recently, more
fragmentation mechanisms such as infrared
multiphoton dissociation (IRMPD) [

Due to the importance of the matching score function,
many scoring methods have been proposed. One
straightforward approach is based on the match between
the real spectrum and the theoretical spectrum generated
from a candidate peptide [

Most of the existing score functions are designed and
trained specifically for CID data. ETD spectra differ from
CID spectra in the magnitude and the complexity of
fragment ion signals. Although a few software tools such
as PEAKS and Mascot have been adjusted to work on
ETD data, they are better suited for CID data and give
poorer results with ETD data. To better utilize the
advantage of ETD spectra on longer and more basic
peptides, in this paper, we propose a score function for
ETD data based on a probabilistic model. The
experiments on real data showed that the score function
significantly improved the de novo sequencing
performance when used in the PEAKS software [

We used an ETD dataset of 9885 spectra from a complex C. elegans protein mixture digested with trypsin followed by alkylation with iodoacetamide. The spectra were acquired on an LTQ Orbitrap XL ETD. The peptide mixtures were separated with Surveyor LC equipped with MicroAS autosampler using a reversed phase peptide trap and a reversed phase analytical column at a flow rate of 250 nl/min. A gradient of 5 ~ 30% acetonitrile in 90 minutes was employed.

The database search modules in PEAKS 5.0 [

An ETD spectrum often contains several peaks generated
from its precursor ions. These peaks often have high
intensities, and may affect the performance of the
matching score function. Thus, we removed these
precursor ion peaks from the spectra in data preparation.
Most de novo sequencing software developed so far is for
CID data and does not work for ETD data. The only
software available to us for performance comparison
using ETD data is the PEAKS software [

We mainly compared the accuracies of the PEAKS de novo sequencing results with the two score functions, respectively. There are two ways to define the accuracy: Accuracy I = number of correctly predicted residues

number of residues in real peptides and Accuracy II = number of correctly predicted residues number of predicted residues .

Accuracy II is the same as the accuracy defined in [

In addition, we search the maximum correct consecutive subsequences in the predicted peptides. For each predicted peptide P, the length of the maximum correct consecutive subsequence is denoted by Lmax(P). We count the number of predicted peptides with Lmax(P) ? l, for l = 3, 4, ..., 10. The error tolerance for the position of a predicted amino acid subsequence is 0.6 Da, and the amino acids leucine(L) and isoleucine(I) are considered as the same since their mass values are the same. The testing data contain 129 peptides and 1859 amino acids. Using the function in PEAKS 5.1 and our score function, we acquired two sets of predicted peptides, each contains 129 de novo peptides. The accuracy and the maximum correct consecutive subsequences of the resulting peptides are reported in Table 1. The results show that the new score function significantly improves the accuracy of de novo sequencing from 32.5% to 55.3% for Accuracy I, and from 34.2% to 55.9% for Accuracy II. In PEAKS 5.1, about 56% of predicted peptides contain a corrected subsequence with length at least 3. For the new score function, more than half of the predicated peptides contain a correct subsequence with length at least 6. From the experiments, we conclude that the new score function has better performance in de novo sequencing with ETD data than the score function in PEAKS 5.1.

We note that the independence assumption in our score
function development in Section Methods is not very
realistic because different types of ions generated by the
fragmentation between the same pair of adjacent
residues of the peptide are often correlated to each
other. To more accurately model this, a probabilistic
network (or Bayesian network) can be used similarly to
[

In this paper, we proposed a significance level measurement to distinguish signal peaks from noise. Based on this, a score function for de novo sequencing with ETD data was defined. Experiments on real data showed that our score function greatly improved the score function used in the latest PEAKS 5.1 software. The method used to develop our ETD score function can be easily reused to train new score functions for other types of MS/MS data.

We first introduce a novel significance level measurement to evaluate the likelihood that a peak is signal. The significance level has a better ability to distinguish signal and noise peaks than the original relative intensities given by the spectrum. Then, we study the distributions of the significance levels of different types of fragment ions, and define a likelihood ratio score for each fragment ion. Finally, we combine all the fragment ion scores of a peptide together to provide a peptide score function.

The peptide fragmentation mechanism shown in Figure 1(a) and 1(b) is overly simplified. In a real mass spectrum (Figure 2), there are also many noise peaks that cannot be explained by the six types of fragment ions. Meanwhile, not all of these fragment ions form strong peaks in the spectrum. Some are either absent or too low to be distinguished from the noise. Different spectra, and even different regions of the same spectrum, may have different noise and signal levels. Consequently, the intensity or the relative intensity given by the original spectrum do not accurately reflect the likelihood for a peak being signal. Thus, it is necessary to develop a better measure to reflect this likelihood. In this section we focus on this task and develop a measure called significance level.

According to our experience, four features of a peak may be associated with the significance of the peak: (1) global rank, (2) local rank, (3) global intensity ratio, and (4) local intensity ratio. These four features are explained in the following, respectively.

The global rank of a peak is the number of peaks (in the same spectrum) that are higher than or equal to the current peak. A smaller global rank means a more significant peak. However, a spectrum of a longer peptide normally has more signal peaks than a spectrum of a shorter peptide. With this observation, we further adjust the global rank by dividing the original ranking number by the size of the peptide, which is estimated by dividing the precursor ion mass by the average mass of an amino acid residue (approximately 112 Da). The global intensity ratio is the ratio between a global reference intensity and the intensity of the considered peak. If this ratio is lower than 1 then it is set to 1. We consider the following three choices for calculating the global reference intensity: (a) the intensity of the highest peak, (b) the intensity of the second highest peak, and (c) the average intensity of the 3rd to the 10th highest peaks. The reason we consider choices (b) and (c) is that some peptides are hard to fragment or can only fragment on one or two sites, resulting in a spectrum with only a couple of very high intensity peaks and many very low signal peaks. The very few extremely high intensity peaks are really outliers and cannot serve as an accurate reference. Therefore, it is meaningful to use choices (b) or (c), which ?boost? the low-intensity signal peaks. The three choices are compared using the training data of 130 spectrum-peptide pairs described in Section Results as follows: We consider z?-ions, which are z-ions with an additional hydrogen. There are in total 1206 z?-ions for the peptides in the training data, of which 1093 match peaks. These 1093 peaks are regarded as signal peaks. Then 12060 (10 times of the real z?-ions) random m/z values are generated to match peaks in the same set of spectra, giving 3370 randomly matched peaks. These 3370 random peaks are regarded as background noise. For each of the three choices of the global reference intensity, we draw the Receiver Operating Characteristic (ROC) curve for distinguishing the signal peaks from the background noise. Figure 3 shows the three curves. Choice (c) has the best ROC curve and is therefore used in this paper to calculate the global reference intensity and the global intensity ratio.

The local rank and the local intensity ratio are defined similarly as their global versions, except that only the peaks within ±57 Da (the mass of the smallest amino acid residue) difference to the current peak are examined. More specifically, the local rank is the number of peaks which are within ±57 Da difference, and are higher than or equal to the current peak. Suppose the intensity of the highest peak within ±57 Da is h1, the global reference intensity is h2, the intensity of the current peak is h, then the local intensity ratio of the current peak is defined as max (1, min(h1, h2)/h).

All of the above four features contribute some information towards the significance of a peak. The significance level of a peak is calculated as a linear combination of the natural logarithms of the four features. Let rg, rl, tg and tl be the global rank, local rank, global intensity ratio and local intensity ratio of a peak, then its significance level is

c rg log(rg ) + c rl log(rl) + ctg log(t g ) + ctl log(t l), where crg, crl, ctg and ctl are the coefficients for the features. It should be noted that under this definition, a smaller value for significance level means a stronger peak. This is somewhat counterintuitive but chosen deliberately in order to assign the same value 0 to the strongest peak in every spectrum.

From our experience, the local rank is the most important feature. An exhaustive search is used to find the best coefficient combination. We let crl = 1, and enumerate all possible values of crg, ctg, ctl ? {0.01, 0.02, ..., 1}. For each combination, the significance level is evaluated by the area below the ROC curve for distinguishing the signal peaks and the background noise in the training data. The search found the best normalized coefficients to be crg = 0.22, crl = 0.40, ctg = 0.05 and ctl = 0.33. The ROC curves with each of the four features and their best combination for the testing data are shown in Figure 4. This figure illustrates that the linear combination of the four features noticeably improves the ability in distinguishing between real matched peaks and randomly matched peaks. We point out that for de novo sequencing, the differences between the top few sequence candidates are minor, therefore a small improvement in Figure 4 may cause a large improvement in the final de novo sequencing results. Distribution of the peak significance levels for different ion types It is well known that different types of mass spectrometers (especially if the peptide fragmentation methods are different) tend to produce different ion types. For example, CID produces more b and y-ions than other ion types, while ETD produces more c and z?-ions. Thus, when using a peptide to explain a spectrum, matching a high-intensity peak with a z?-ion or a y-ion will have different contributions towards the likelihood that the peptide is correct. For ETD data, because we expect high z?-ions, the matching with a z?-ion will contribute more than the matching with a y-ion. Thus it is necessary to study the distribution of the significance level for different ion types. In this section we study this with ETD data. The distribution will help us to develop our final score function.

By using the same training data of 130 spectrum-peptide pairs, the distribution of the significance levels for several common ion types in ETD data are plotted in Figure 5. Keep in mind that the significance level 0 means the peak is the most significant in a spectrum. Because the training data is limited in its size (only approximately 1000 data points for each ion type), the distributions shown in Figure 5 are not smooth and not suitable to be used directly in our scoring function. Thus the following treatment is applied to the data. For each ion type t, suppose there are n ions of this type from the given peptides. m of the n ions match peaks in the spectra. We divide the range of significance level into four intervals I1, I2, I3, I4, such that Ij is left to Ij+1 and each interval contains m/4 matched peaks. Thus for each j = 1, 2, 3, 4, Pr(t ion significance level ? Ij) = 4mn . Meanwhile we can calculate Pr(random ion significance level ? Ij) by simple counting. So, the likelihood ratio of the two events is calculated as

.

Next we define a continuous function f by interpolation to measure the significance of the event that a type-t peak has significance level x. Denote the centroid of each interval Ij by cj. Define

? Pr(t ion significance level ?I j) ? f (c j) = log ? ? .

?? Pr(random ion significance level ?I j) ?? In addition, let c5 be the centroid of the largest 10% significance levels in the training data. At significance level greater than or equal to c5, most peaks are noise. Therefore, matching one of these peaks with an ion becomes a very insignificant event, and can be treated as if the ion does not match any peak. Thus, we define ? Pr(t ion does not match peak ? f (c5) = log ?? Pr(random ion does not match peak) ? ? .

Let f(x) = f(c1) for x <c1 and f (x) = f (c5) for x >c5. Let f (x) for other x be defined by linear interpolation. Thus, f(x) becomes an approximation to the log likelihood-ratio between the two hypotheses that the peak is a t-ion match and a random match. More specifically, ? Pr(t ion significance level= x) ? f (x) ? log ?? Pr(random ion significance level= x) ? ? .

The curves of f(x) for several different ion types are shown in Figure 6. This function will be used to develop our final score function.

Almost every score function for measuring the peptidespectrum matching relies on the peaks matched by the theoretical fragment ions of the peptide. In general the more significant peaks are matched, the more likely this peptide is the correct peptide. However, because there are different types of ions involved, it is not trivial to balance the contributions of all ions and combine them together as a single score value. Here we use the standard likelihood ratio method to define the score. Consider a peptide P of length n. Suppose there are k ion types t1, ..., tk considered in the score function. Denote all ions of type ti by ti, j (j = 1, ..., n - 1). Let xi, j be the significance level of the peak that matches ti, j. xi, j = ? if the ion matches no peak. Then the likelihood of observing the matches xi, j (i = 1, ..., k, j = 1, ..., n - 1) under the hypothesis that P is the correct peptide is

? t i, j has significance level xi, j , P is correct ?? .

Pr ??? for i = 1,? , k, j = 1,? , n ? 1 ?? The likelihood of the matches under the hypothesis that P is a random peptide is

? t i, j has significance level xi, j , P is random ?? .

Pr ??? for i = 1,? , k, j = 1,? , n ? 1 ?? The significance of this match, denoted as l (x1,1, x1,2, ..., xk, n-1), is therefore evaluated by the ratio between these two likelihoods. In this section we make the assumption that the events that ti, j has intensity xi, j are independent to each other. The independence assumption makes our score function easier to calculate while not sacrificing much performance on analysis accuracy. A discussion on the model where the events are dependent can be found in Section Discussion.

When the events ti, j has intensity xi, j are independent to each other, the log-likelihood-ratio becomes log ?(x1,1, x1,2,? , x k,n?1)

? Pr(t i, j has intensity xi, j|P is correct) ? = ? log ? ? i, j ?? Pr(t i, j has intensity xi, j|P is random) ?? (1) = ? fi, j(xi, j).

i, j Here fi, j is one of the f functions trained for the j-th ion of the ion type ti. In practice, for each ion type of interest, we learn five f functions, for the first and second ions near the N-terminus, the last and second last ions at the C-terminus, and the middle of the peptide, respectively. The log l defined in (1) is our score function. We still need to decide what types of ions can be included in the score function for ETD data. Clearly we want to include the more frequently observed ion types because they are more distinguishable from noise. Meanwhile, we do not want to include ion types that are too dependent to each other because that conflicts our independence assumption. The frequencies of observing peaks for a particular ion type in our training data are listed in Table 2. From the table, we select eight ions with the highest frequencies for our score function: z?, c, y, b, a, y-H2O, c-H2O, and z?2+-ion, where z?2+-ion is the z?-ion with charge 2. z-ion is not included because it is highly correlated with z?-ion.

The authors declare that they have no competing interests.

XL and BM designed and tested the score function, and drafted the manuscript. BS and LX contributed to the dataset and the test on the dataset.

The work was supported in part by China High-tech R&D Program 2008AA02Z313, NSERC (Canada), and a start up grant at University of Waterloo.

This article has been published as part of BMC Bioinformatics Volume 11 Supplement 1, 2010: Selected articles from the Eighth Asia-Pacific Bioinformatics Conference (APBC 2010). The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/11?issue=S1.