Fast randomized approximate string matching with succinct hash data structures

Alberto Policriti; Nicola Prezza

首页> 外文期刊>BMC Bioinformatics >Fast randomized approximate string matching with succinct hash data structures

【24h】

Fast randomized approximate string matching with succinct hash data structures

机译：快速随机近似字符串匹配，具有简洁的哈希数据结构

获取原文

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Background The high throughput of modern NGS sequencers coupled with the huge sizes of genomes currently analysed, poses always higher algorithmic challenges to align short reads quickly and accurately against a reference sequence. A crucial, additional, requirement is that the data structures used should be light . The available modern solutions usually are a compromise between the mentioned constraints: in particular, indexes based on the Burrows-Wheeler transform offer reduced memory requirements at the price of lower sensitivity, while hash-based text indexes guarantee high sensitivity at the price of significant memory consumption. Methods In this work we describe a technique that permits to attain the advantages granted by both classes of indexes. This is achieved using Hamming-aware hash functions--hash functions designed to search the entire Hamming sphere in reduced time--which are also homomorphisms on de Bruijn graphs. We show that, using this particular class of hash functions, the corresponding hash index can be represented in linear space introducing only a logarithmic slowdown (in the query length) for the lookup operation. We point out that our data structure reaches its goals without compressing its input: another positive feature, as in biological applications data is often very close to be un-compressible. Results The new data structure introduced in this work is called dB-hash and we show how its implementation--BW-ERNE--maintains the high sensitivity and speed of its (hash-based) predecessor ERNE, while drastically reducing space consumption. Extensive comparison experiments conducted with several popular alignment tools on both simulated and real NGS data, show, finally, that BW-ERNE is able to attain both the positive features of succinct data structures (that is, small space) and hash indexes (that is, sensitivity). Conclusions In applications where space and speed are both a concern, standard methods often sacrifice accuracy to obtain competitive throughputs and memory footprints. In this work we show that, combining hashing and succinct indexing techniques, we can attain good performances and accuracy with a memory footprint comparable to that of the most popular compressed indexes.

机译：背景技术现代NGS测序仪的高通量，加上目前正在分析的庞大基因组，始终带来更高的算法挑战，需要快速准确地将短片段与参考序列对齐。一项至关重要的附加要求是所使用的数据结构应轻巧。可用的现代解决方案通常是上述约束之间的折衷：特别是，基于Burrows-Wheeler变换的索引以较低的敏感度为代价提供了减少的内存需求，而基于哈希的文本索引以显着的内存为代价保证了高敏感度消费。方法在本文中，我们描述了一种技术，该技术可以实现两种索引所赋予的优势。这是通过使用可识别汉明的哈希函数（旨在减少时间搜索整个汉明球体的哈希函数）实现的，这些哈希函数也是de Bruijn图上的同态。我们表明，使用这种特殊的哈希函数类，可以在线性空间中表示相应的哈希索引，从而仅引入对数减慢（以查询长度为单位）进行查找操作。我们指出，我们的数据结构无需压缩其输入即可达到其目标：另一个积极的特点，因为在生物应用中，数据通常非常接近不可压缩。结果这项工作中引入的新数据结构称为dB-hash，我们将展示其实现方式-BW-ERNE-保持其（基于哈希的）前身ERNE的高灵敏度和速度，同时显着减少空间消耗。最后，使用几种流行的对齐工具对模拟和实际NGS数据进行的广泛比较实验表明，BW-ERNE能够获得简洁数据结构（即小空间）和哈希索引（即，灵敏度）。结论在同时考虑空间和速度的应用中，标准方法通常会牺牲精度来获得有竞争力的吞吐量和内存占用量。在这项工作中，我们表明，结合使用哈希和简洁的索引技术，我们可以以与最流行的压缩索引相当的内存占用量来获得良好的性能和准确性。

著录项

来源
《BMC Bioinformatics》 |2015年第9期|共页
作者
Alberto Policriti; Nicola Prezza;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类生物科学;
关键词

相似文献

外文文献
中文文献
专利

1. A hash trie filter method for approximate string matching in genomic databases [J] . Ye-In Chang, Jiun-Rung Chen, Min-Tze Hsu Applied Intelligence . 2010,第1期

机译：用于基因组数据库中近似字符串匹配的哈希特里过滤器方法
2. A hash trie filter method for approximate string matching in genomic databases [J] . Chang Y.-I., Chen J.-R., Hsu M.-T. Applied Intelligence: The International Journal of Artificial Intelligence, Neural Networks, and Complex Problem-Solving Technologies . 2010,第1期

机译：用于基因组数据库中近似字符串匹配的哈希特里过滤器方法
3. Faster string matching based on hashing and bit-parallelism [J] . Al-Ssulami Abdulrakeeb M., Mathkour Hassan Information Processing Letters . 2017,第Jula期

机译：基于散列和位并行的更快的字符串匹配
4. A Hash Trie Filter Approach to Approximate String Matching for Genomic Databases [C] . Ye-In Chang, Jiun-Rung Chen, Min-Tze Hsu Next-generation applied intelligence . 2009

机译：用于基因组数据库近似字符串匹配的哈希树过滤器方法
5. Fast Locality Sensitive Hashing Algorithm for Approximate Nearest Neighbor Search: A Practical Data Mining Approach. [D] . Buaba, Ruben. 2012

机译：近似最近邻居搜索的快速局部敏感哈希算法：一种实用的数据挖掘方法。
6. Fast randomized approximate string matching with succinct hash data structures [O] . Alberto Policriti, Nicola Prezza 2015

机译：快速随机近似字符串匹配具有简洁的哈希数据结构
7. Fast randomized approximate string matching with succinct hash data structures [O] . Alberto Policriti, Nicola Prezza 2015

机译：快速随机近似字符串匹配，具有简洁的哈希数据结构

Fast randomized approximate string matching with succinct hash data structures

摘要

著录项

相似文献

相关主题

期刊订阅