| |
All Research
Below is a collection of research written during the development of the PatternHunter software or third-party research citing the use of the PatternHunter software.
|
|
Here is a collection of the main research conducted during the development of the PatternHunter software.
|
Ma B, Tromp J, Li M. PatternHunter: Faster and More Sensitive Homology Search. Bioinformatics. 2002 Mar;18(3):440-5. |
|
MOTIVATION: Genomics and proteomics studies routinely depend on homology searches based on the strategy of finding short seed matches which are then extended. The exploding genomic data growth presents a dilemma for DNA homology search techniques: increasing seed size decreases sensitivity whereas decreasing seed size slows down computation. RESULTS: We present a new homology search algorithm 'PatternHunter' that uses a novel seed model for increased sensitivity and new hit-processing techniques for significantly increased speed. At Blast levels of sensitivity, PatternHunter is able to find homologies between sequences as large as human chromosomes, in mere hours on a desktop. AVAILABILITY: PatternHunter is available at www.bioinfor.com, as a commercial package. It runs on all platforms that support Java. |
|
Chen X, Li M, Ma B, Tromp J. DNACompress: Fast and Effective DNA Sequence Compression. Bioinformatics. 2002 Dec;18(12):1696-8. |
|
While achieving the best compression ratios for DNA sequences, our new DNACompress program significantly improves the running time of all previous DNA compression programs. |
|
Li M, Brown D. Mouse Genome Sequencing Consortium: Initial sequencing and comparative analysis of the mouse genome. Nature, 420(6915):520-522. December 2002. |
|
The sequence of the mouse genome is a key informational tool for understanding the contents of the human genome and a key experimental tool for biomedical research. Here, we report the results of an international collaboration to produce a high-quality draft sequence of the mouse genome. We also present an initial comparative analysis of the mouse and human genomes, describing some of the insights that can be gleaned from the two sequences. We discuss topics including the analysis of the evolutionary forces shaping the size, structure and sequence of the genomes; the conservation of large-scale synteny across most of the genomes; the much lower extent of sequence orthology covering less than half of the genomes; the proportions of the genomes under selection; the number of protein-coding genes; the expansion of gene families related to reproduction and immunity; the evolution of proteins; and the identification of intraspecies polymorphism. |
|
Li M, Ma B, Kisman D, Tromp J. PatternHunter II: Highly Sensitive and Fast Homology Search. Genome Inform. 2003;14:164-75. |
|
Extending the single optimized spaced seed of PatternHunter to multiple ones, PatternHunter II simultaneously remedies the lack of sensitivity of Blastn and the lack of speed of Smith-Waterman, for homology search. At Blastn speed, PatternHunter II approaches Smith-Waterman sensitivity, bringing homology search technology back to a full circle. |
|
|
|
Here is a collection of research, written by users and other third-party members which cite the usage or methods present in the PatternHunter software.
|
Chen W, Sung WK.On Half Gapped Seed. Genome Inform. 2003;14:176-85. |
|
In this paper, we proposed a new type of seed for Blast-like homology search tools called "half seed''. This new seed is better than the "consecutive seed'' used by the original Blast tools in both sensitivity and efficiency. When compared with the "gapped seed'', which is proposed together with a new Blast-like searching tool, PatternHunter, this new seed offers a much wider range of choices for performing tradeoff between sensitivity and efficiency. This property is especially useful when some searching applications want to get more precise results with limitation on hardware resources, or vice versa. |
|
Brejová B, Brown DG, Vinar T.Optimal Spaced Seeds for Homologous Coding Regions. J Bioinform Comput Biol. 2004 Jan;1(4):595-610. |
|
Optimal spaced seeds were developed as a method to increase sensitivity of local alignment programs similar to BLASTN. Such seeds have been used before in the program PatternHunter, and have given improved sensitivity and running time relative to BLASTN in genome-genome comparison. We study the problem of computing optimal spaced seeds for detecting homologous coding regions in unannotated genomic sequences. By using well-chosen seeds, we are able to improve the sensitivity of coding sequence alignment over that of TBLASTX, while keeping runtime comparable to BLASTN. We identify good seeds by first giving effective hidden Markov models of conservation in alignments of homologous coding regions. We give an efficient algorithm to compute the optimal spaced seed when conservation patterns are generated by these models. Our results offer the hope of improved gene finding due to fewer missed exons in DNA/DNA comparison, and more effective homology search in general, and may have applications outside of bioinformatics. |
|
Yang I, Wang S, Chen Y, Huang P, Ye L, Huang X, Chao K.Efficient Methods for Generating Optimal Single and Multiple Spaced Seeds. IEEE Symposium on Bioinformatics and Bioengineering |
|
Biologists highly rely on good algorithms for finding homologous regions in bimolecular sequences. An advanced homology search program named PatternHunter has recently been developed. Unlike the well-known program BLAST using a consecutive model, it utilizes a spaced seed model to attain higher sensitivity. We have developed a new program, which extends PatternHunter from a single spaced model to a multiple spaced model. In this paper, we describe methods for finding optimal single and multiple spaced models. |
|
Choi KP, Zeng F, Zhang L.Good Spaced Seeds for Homology Search. Bioinformatics. 2004 May 1;20(7):1053-9. Epub 2004 Feb 5. |
|
MOTIVATION: Filtration is an important technique used to speed up local alignment as exemplified in the BLAST programs. Recently, Ma et al. discovered that better filtering can be achieved by spacing out the matching positions according to a certain pattern, instead of contiguous positions to trigger a local alignment in their PatternHunter program. Such a match pattern is called a spaced seed. RESULTS: Our numerical computation shows that the ranks of spaced seeds (based on sensitivity) change with the sequences similarity. Since homologous sequences may have diverse similarity, we assess the sensitivity of spaced seeds over a range of similarity levels and present a list of good spaced seeds for facilitating homology search in DNA genomic sequences. We validate that the listed spaced seeds are indeed more sensitive using three arbitrarily chosen pairs of DNA genomic sequences. |
|
Choi JH, Cho HG, Kim S.GAME: a Simple and Efficient Whole Genome Alignment Method Using Maximal Exact Match Filtering. Comput Biol Chem. 2005 Jun;29(3):244-53. |
|
In this paper, we present a simple and efficient whole genome alignment method using maximal exact match (MEM). The major problem with the use of MEM anchor is that the number of hits in non-homologous regions increases exponentially when shorter MEM anchors are used to detect more homologous regions. To deal with this problem, we have developed a fast and accurate anchor filtering scheme based on simple match extension with minimum percent identity and extension length criteria. Due to its simplicity and accuracy, all MEM anchors in a pair of genomes can be exhaustively tested and filtered. In addition, by incorporating the translation technique, the alignment quality and speed of our genome alignment algorithm have been further improved. As a result, our genome alignment algorithm, GAME (Genome Alignment by Match Extension), performs competitively over existing algorithms and can align large whole genomes, e.g., A. thaliana, without the requirement of typical large memory and parallel processors. This is shown using an experiment which compares the performance of BLAST, BLASTZ, PatternHunter, MUMmer and our algorithm in aligning all 45 pairs of 10 microbial genomes. The scalability of our algorithm is shown in another experiment where all pairs of five chromosomes in A. thaliana were compared. |
|
Engels R, Yu T, Burge C, Mesirov J, DeCaprio D, Galagan J.Combo: A Whole Genome Comparative Browser. Bioinformatics 2006 22(14):1782-1783. |
|
Combo is a comparative genome browser that provides a dynamic view of whole genome alignments along with their associated annotations. Combo provides two different visualization perspectives. The perpendicular (dot plot) view provides a dot plot of genome alignments synchronized with a display of genome annotations along each axis. The parallel view displays two genome annotations horizontally, synchronized through a panel displaying local alignments as trapezoids. Users can zoom to any resolution, from whole chromosomes to individual bases. They can select, highlight and view detailed information from specific alignments and annotations. Combo is an organism agnostic and can import data from a variety of file formats. |
|
Cui X, Vinar T, Brejová B, Shasha D, Li M.B. Homology Search for Genes. Bioinformatics. 2007 Jul 1;23(13):i97-103. |
|
Life science researchers often require an exhaustive list of protein coding genes similar to a given query gene. To find such genes, homology search tools, such as BLAST or PatternHunter, return a set of high-scoring pairs (HSPs). These HSPs then need to be correlated with existing sequence annotations, or assembled manually into putative gene structures. This process is error-prone and labor-intensive, especially in genomes without reliable gene annotation. RESULTS: We have developed a homology search solution that automates this process, and instead of HSPs returns complete gene structures. We achieve better sensitivity and specificity by adapting a hidden Markov model for gene finding to reflect features of the query gene. Compared to traditional homology search, our novel approach identifies splice sites much more reliably and can even locate exons that were lost in the query gene. On a testing set of 400 mouse query genes, we report 79% exon sensitivity and 80% exon specificity in the human genome based on orthologous genes annotated in NCBI HomoloGene. In the same set, we also found 50 (12%) gene structures with better protein alignment scores than the ones identified in HomoloGene. |
|
Ilie L, Ilie S. Multiple Spaced Seeds for Homology Search. Bioinformatics. 2007 Nov 15;23(22):2969-77. Epub 2007 Sep 5. |
|
Homology search finds similar segments between two biological sequences, such as DNA or protein sequences. The introduction of optimal spaced seeds in PatternHunter has increased both the sensitivity and the speed of homology search, and it has been adopted by many alignment programs such as BLAST. With the further improvement provided by multiple spaced seeds in PatternHunterII, Smith-Waterman sensitivity is approached at BLASTn speed. However, computing optimal multiple spaced seeds was proved to be NP-hard and current heuristic algorithms are all very slow (exponential). RESULTS: We give a simple algorithm which computes good multiple seeds in polynomial time. Due to a completely different approach, the difference with respect to the previous methods is dramatic. The multiple spaced seed of PatternHunterII, with 16 weight 11 seeds, was computed in 12 days. It takes us 17 s to find a better one. Our approach changes the way of looking at multiple spaced seeds. |
|
Chung WH, Park SB. Hit Integration for Identifying Optimal Spaced Seeds. BMC Bioinformatics. 2010 Jan 18;11 Suppl 1. |
|
BACKGROUND: Introduction of spaced speeds opened a way of sensitivity improvement in homology search without loss of search speed. Since then, the efforts of finding optimal seed which maximizes the sensitivity have been continued today. The sensitivity of a seed is generally computed by its hit probability. However, the limitation of hit probability is that it computes the sensitivity only at a specific similarity level while homologous regions usually distributed in various similarity levels. As a result, the optimal seed found by hit probability is not actually optimal for various similarity levels. Therefore, a new measure of seed sensitivity is required to recommend seeds that are robust to various similarity levels. RESULTS: We propose a new probability model of sensitivity hit integration which covers a range of similarity levels of homologous regions. A novel algorithm of computing hit integration is proposed which is based on integration of hit probabilities at a range of similarity levels. We also prove that hit integration is computable by expressing the integral part of hit integration as a recursive formula which can be easily solved by dynamic programming. The experimental results for biological data show that hit integration reveals the seeds more optimal than those by PatternHunter. CONCLUSION: The presented model is a more general model to estimate sensitivity than hit probability by relaxing similarity level. We propose a novel algorithm which directly computes the sensitivity at a range of similarity levels. |
|
|
|
|
|