Estimating the information value of polymorphic sites using pooled sequences

Ketil Malde

摘要

Background High-throughput sequencing is a cost effective method for identifying genetic variation, and it is currently in use on a large scale across the field of biology, including ecology and population genetics. Correctly identifying variable sites and allele frequencies from sequencing data remains challenging, in large part due to artifacts and biases inherent in the sequencing process. Selecting variants that are diagnostic is commonly done using diversity statistics like F _ST, but these measures are not ideal for the task. Results Here, we develop a method that directly calculates the expected amount of information gained from observing each variant site. We then develop and implement a conservative estimator that takes into account uncertainity introduced by sampling bias and sequencing error. This estimator is applied to simulated and real sequencing data, and we discuss how it performs compared to the commonly used existing methods for identifying diagnostic polymorphisms. Conclusion The expected information content gives an easy to interpret measure for the usefulness of variant sites. The results show that we achieve a clear separation between true variants and noise, allowing us to select candidate sites with a high degree of confidence.

机译：背景技术高通量测序是用于识别遗传变异的成本有效的方法，目前在生物学领域的大规模使用，包括生态和群体遗传学。在序列数据中正确识别可变站点和等位基因频率在很大程度上，由于测序过程中固有的伪影和偏差，因此大部分仍然具有挑战性。选择诊断的变体通常使用像f _{st 等分集统计数据来完成，但这些措施对任务不理想。结果在此，我们开发了一种方法，该方法直接计算从观察每个变体部位获得的预期信息量。然后，我们开发并实施一个保守的估计，考虑通过采样偏差和测序误差引入的不确定性。该估算器应用于模拟和实际测序数据，我们讨论了与识别诊断多态性的常用现有方法相比如何执行。结论预期信息内容易于解释了变体部位有用性的措施。结果表明，我们在真正的变体和噪音之间实现了明确的分离，允许我们选择具有高度信心的候选地点。}

Estimating the information value of polymorphic sites using pooled sequences

摘要

著录项

相关主题

期刊订阅