...
首页> 外文期刊>BMC Bioinformatics >Recommendations for performance optimizations when using GATK3.8 and GATK4
【24h】

Recommendations for performance optimizations when using GATK3.8 and GATK4

机译:使用GATK3.8和GATK4时性能优化的建议

获取原文
   

获取外文期刊封面封底 >>

       

摘要

BACKGROUND:Use of the Genome Analysis Toolkit (GATK) continues to be the standard practice in genomic variant calling in both research and the clinic. Recently the toolkit has been rapidly evolving. Significant computational performance improvements have been introduced in GATK3.8 through collaboration with Intel in 2017. The first release of GATK4 in early 2018 revealed rewrites in the code base, as the stepping stone toward a Spark implementation. As the software continues to be a moving target for optimal deployment in highly productive environments, we present a detailed analysis of these improvements, to help the community stay abreast with changes in performance.RESULTS:We re-evaluated multiple options, such as threading, parallel garbage collection, I/O options and data-level parallelization. Additionally, we considered the trade-offs of using GATK3.8 and GATK4. We found optimized parameter values that reduce the time of executing the best practices variant calling procedure by 29.3% for GATK3.8 and 16.9% for GATK4. Further speedups can be accomplished by splitting data for parallel analysis, resulting in run time of only a few hours on whole human genome sequenced to the depth of 20X, for both versions of GATK. Nonetheless, GATK4 is already much more cost-effective than GATK3.8. Thanks to significant rewrites of the algorithms, the same analysis can be run largely in a single-threaded fashion, allowing users to process multiple samples on the same CPU.CONCLUSIONS:In time-sensitive situations, when a patient has a critical or rapidly developing condition, it is useful to minimize the time to process a single sample. In such cases we recommend using GATK3.8 by splitting the sample into chunks and computing across multiple nodes. The resultant walltime will be nnn.4 hours at the cost of $41.60 on 4 c5.18xlarge instances of Amazon Cloud. For cost-effectiveness of routine analyses or for large population studies, it is useful to maximize the number of samples processed per unit time. Thus we recommend GATK4, running multiple samples on one node. The total walltime will be ~34.1 hours on 40 samples, with 1.18 samples processed per hour at the cost of $2.60 per sample on c5.18xlarge instance of Amazon Cloud.
机译:背景:使用基因组分析工具包(GATK)继续成为研究和诊所的基因组变体中的标准实践。最近工具包一直在迅速发展。通过与英特尔2017年的合作,Gatk3.8在Gatk3.8中引入了显着的计算性能改进。2018年初的GATK4的第一次发布揭示了代码基础的重写,因为踩踏石头朝着Spark实施。由于该软件在高生产环境中继续成为最佳部署的移动目标,我们对这些改进进行了详细的分析,以帮助社区随着性能的变化及时了解。结果:我们重新评估多个选项,如线程,并行垃圾收集,I / O选项和数据级并行化。此外,我们考虑了使用GATK3.8和GATK4的权衡。我们发现优化的参数值,可以减少最佳实践变体调用程序的时间29.3%,适用于GATK4的GATK3.8和16.9%。可以通过分离并行分析数据来实现进一步的加速,导致在整个人类基因组上仅为20倍的整个人类基因组的运行时间,适用于GATK。尽管如此,GATK4已经比GATK3.8更具成本效益。由于算法的重要重写,相同的分析可以很大程度上以单线程方式运行,允许用户在同一CPU上处理多个样本。链接:在患者患者临界或快速发展时,在时间敏感情况下条件,最小化处理单个样本的时间是有用的。在这种情况下,我们建议使用GATK3.8通过将样本拆分成块并跨多个节点计算来使用GATK3.8。由此产生的Walltime将以41.60美元的成本为41.60美元的亚马逊云。为了常规分析或大量人口研究的成本效益,最大化每单位时间处理的样品数量是有用的。因此,我们推荐GATK4,在一个节点上运行多个样本。在40个样品上,总Walltime将是〜34.1小时,每小时处理1.18个样品,每小时处理亚马逊云的C5.18xlarge实例的2.60美元。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号