In practice, it is hard to tell which PSM is false – otherwise those false PSMs can be removed by the algorithm to achieve zero false discoveries. Therefore, the target-decoy method [2] has been widely used in practice to estimate the FDR. In this method, the software is used to search the concatenation of a target database and a decoy database with the same size. If the decoy is constructed properly, the software’s false identifications will be evenly distributed in the target and decoy databases. Since all the decoy identifications are false, FDR can be estimated by FDR = (# Decoy Hits) / (# target hits).

Figure 3: With a properly constructed decoy, the false identifications distribute evenly on the target and decoy. Thus, the amount of decoy hits can be used to estimate the FDR.

There is a simple fix to avoid the first two common mistakes — The PEAKS DB paper [1] proposed a decoy fusion method. Instead of concatenating the target and decoy databases together, the decoy fusion method concatenate the decoy and target sequences of the same protein together as a “fused” sequence (Figure 5). This simple change makes some meaningful differences. For the two round search problem, the target and decoy lengths are still the same in the second round. For the protein score problem, the same amount of bonus will be equally applied to the target and decoy parts of the same fused sequence. Thus, the “same size” and “even distribution” prerequisites are recreated; and the FDR is again estimated accurately. The built-in result validation of the PEAKS software uses this decoy fusion method.

Figure 5: The decoy fusion method “fuses” the target and decoy sequences together. Thus, the target and decoy sequences are guaranteed to have the same length even if a two-round search algorithm is used.

Common Misuses of the Target-Decoy Method

Mistake 1

Mistake 2

Mistake 3