...
首页> 外文期刊>BMC Genomics >Subset selection of high-depth next generation sequencing reads for de novo genome assembly using MapReduce framework
【24h】

Subset selection of high-depth next generation sequencing reads for de novo genome assembly using MapReduce framework

机译:使用MapReduce框架的De Novo Genome组装读取高深下一代测序的子集选择

获取原文
   

获取外文期刊封面封底 >>

       

摘要

Background Recent progress in next-generation sequencing technology has afforded several improvements such as ultra-high throughput at low cost, very high read quality, and substantially increased sequencing depth. State-of-the-art high-throughput sequencers, such as the Illumina MiSeq system, can generate ~15 Gbp sequencing data per run, with >80% bases above Q30 and a sequencing depth of up to several 1000x for small genomes. Illumina HiSeq 2500 is capable of generating up to 1 Tbp per run, with >80% bases above Q30 and often >100x sequencing depth for large genomes. To speed up otherwise time-consuming genome assembly and/or to obtain a skeleton of the assembly quickly for scaffolding or progressive assembly, methods for noise removal and reduction of redundancy in the original data, with almost equal or better assembly results, are worth studying. Results We developed two subset selection methods for single-end reads and a method for paired-end reads based on base quality scores and other read analytic tools using the MapReduce framework. We proposed two strategies to select reads: MinimalQ and ProductQ. MinimalQ selects reads with minimal base-quality above a threshold. ProductQ selects reads with probability of no incorrect base above a threshold. In the single-end experiments, we used Escherichia coli and Bacillus cereus datasets of MiSeq, Velvet assembler for genome assembly, and GAGE benchmark tools for result evaluation. In the paired-end experiments, we used the giant grouper ( Epinephelus lanceolatus ) dataset of HiSeq, ALLPATHS-LG genome assembler, and QUAST quality assessment tool for comparing genome assemblies of the original set and the subset. The results show that subset selection not only can speed up the genome assembly but also can produce substantially longer scaffolds. Availability: The software is freely available at https://github.com/moneycat/QReadSelector .
机译:背景技术下一代测序技术的最近进展在低成本,非常高的读取质量和基本上增加的测序深度上得到了几种改进,例如超高吞吐量,并且非常增加的测序深度。最先进的高吞吐量序列,例如Illumina Miseq系统,可以每次运行产生〜15 GBP测序数据,Q30以上的> 80%基础,对于小型基因组的测序深度可达几000倍。 Illumina Hiseq 2500能够每次运行产生高达1 TBP,Q30以上的> 80%基础,以及大型基因组的10倍测序深度。为了加速耗时耗费的基因组组件和/或快速获得组件的骨架,用于脚手架或渐进式组装,用于噪声消除和冗余在原始数据中的冗余的方法,具有几乎等于或更好的装配结果,值得研究。结果我们开发了两个用于单端读取的子集选择方法,以及基于基于基本质量分数和使用MapReduce框架的其他读取分析工具的配对读取方法。我们提出了两种策略选择读数:最小Q和产品Q。最小值选择读取以上高于阈值的基本质量。 ProductQ选择概率没有不正确的基础上方的读数。在单终实验中,我们使用了Miseq的大肠杆菌和芽孢杆菌和芽孢杆菌数据集,用于基因组组件,以及用于结果评估的Gage基准测试工具。在配对结束的实验中,我们使用了Hiseq,Allpaths-LG基因组汇编器的巨型格鲁珀(EpinePhelus Lanceolatus)数据集,以及用于比较原始组和子集的基因组组件的码垛质量评估工具。结果表明,子集选择不仅可以加速基因组组件,而且可以产生基本更长的支架。可用性:该软件可在https://github.com/moneycat/qreadselector上自由使用。

著录项

获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号