首页> 外文期刊>BMC Bioinformatics >Fast randomized approximate string matching with succinct hash data structures
【24h】

Fast randomized approximate string matching with succinct hash data structures

机译:快速随机近似字符串匹配,具有简洁的哈希数据结构

获取原文
           

摘要

Background The high throughput of modern NGS sequencers coupled with the huge sizes of genomes currently analysed, poses always higher algorithmic challenges to align short reads quickly and accurately against a reference sequence. A crucial, additional, requirement is that the data structures used should be light . The available modern solutions usually are a compromise between the mentioned constraints: in particular, indexes based on the Burrows-Wheeler transform offer reduced memory requirements at the price of lower sensitivity, while hash-based text indexes guarantee high sensitivity at the price of significant memory consumption. Methods In this work we describe a technique that permits to attain the advantages granted by both classes of indexes. This is achieved using Hamming-aware hash functions--hash functions designed to search the entire Hamming sphere in reduced time--which are also homomorphisms on de Bruijn graphs. We show that, using this particular class of hash functions, the corresponding hash index can be represented in linear space introducing only a logarithmic slowdown (in the query length) for the lookup operation. We point out that our data structure reaches its goals without compressing its input: another positive feature, as in biological applications data is often very close to be un-compressible. Results The new data structure introduced in this work is called dB-hash and we show how its implementation--BW-ERNE--maintains the high sensitivity and speed of its (hash-based) predecessor ERNE, while drastically reducing space consumption. Extensive comparison experiments conducted with several popular alignment tools on both simulated and real NGS data, show, finally, that BW-ERNE is able to attain both the positive features of succinct data structures (that is, small space) and hash indexes (that is, sensitivity). Conclusions In applications where space and speed are both a concern, standard methods often sacrifice accuracy to obtain competitive throughputs and memory footprints. In this work we show that, combining hashing and succinct indexing techniques, we can attain good performances and accuracy with a memory footprint comparable to that of the most popular compressed indexes.
机译:背景技术现代NGS测序仪的高通量,加上目前正在分析的庞大基因组,始终带来更高的算法挑战,需要快速准确地将短片段与参考序列对齐。一项至关重要的附加要求是所使用的数据结构应轻巧。可用的现代解决方案通常是上述约束之间的折衷:特别是,基于Burrows-Wheeler变换的索引以较低的敏感度为代价提供了减少的内存需求,而基于哈希的文本索引以显着的内存为代价保证了高敏感度消费。方法在本文中,我们描述了一种技术,该技术可以实现两种索引所赋予的优势。这是通过使用可识别汉明的哈希函数(旨在减少时间搜索整个汉明球体的哈希函数)实现的,这些哈希函数也是de Bruijn图上的同态。我们表明,使用这种特殊的哈希函数类,可以在线性空间中表示相应的哈希索引,从而仅引入对数减慢(以查询长度为单位)进行查找操作。我们指出,我们的数据结构无需压缩其输入即可达到其目标:另一个积极的特点,因为在生物应用中,数据通常非常接近不可压缩。结果这项工作中引入的新数据结构称为dB-hash,我们将展示其实现方式-BW-ERNE-保持其(基于哈希的)前身ERNE的高灵敏度和速度,同时显着减少空间消耗。最后,使用几种流行的对齐工具对模拟和实际NGS数据进行的广泛比较实验表明,BW-ERNE能够获得简洁数据结构(即小空间)和哈希索引(即, 灵敏度)。结论在同时考虑空间和速度的应用中,标准方法通常会牺牲精度来获得有竞争力的吞吐量和内存占用量。在这项工作中,我们表明,结合使用哈希和简洁的索引技术,我们可以以与最流行的压缩索引相当的内存占用量来获得良好的性能和准确性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号