BLAST, short for Basic Local Alignment Search Tool, searches for regions of local similarity between a query sequence and a large database of DNA or amino-acid sequences. It serves as a fundamental tool to many discovery processes in bioinformatics and computational biology, including inferring functional and evolutionary relationships between sequences, identifying members of gene families, and phylogenetic profiling. Consequently, researchers have spent many decades making local alignment search (such as BLAST) more efficient, both with respect to speed and accuracy. In this paper, we present our approach for more efficient sequence search, which we dub CentroidBLAST. CentroidBLAST first works on a representative fraction of the original database, where each representative serves as a "centroid" of similar sequences. A centroid's cluster of sequences is then searched only if its representative sequence is a similar match to the query sequence. This approach delivers as much as a 6.85-fold speed-up over NCBI BLAST. In addition, we analyze different aspects of CentroidBLAST, including execution time, biological significance of resulting alignments, selection of e-value cut-off, and effect of database compression.
展开▼