首页> 外文期刊>Computational Biology and Bioinformatics, IEEE/ACM Transactions on >Data Mining on DNA Sequences of Hepatitis B Virus
【24h】

Data Mining on DNA Sequences of Hepatitis B Virus

机译:乙肝病毒DNA序列的数据挖掘

获取原文
获取原文并翻译 | 示例
           

摘要

Extraction of meaningful information from large experimental data sets is a key element in bioinformatics research. One of the challenges is to identify genomic markers in Hepatitis B Virus (HBV) that are associated with HCC (liver cancer) development by comparing the complete genomic sequences of HBV among patients with HCC and those without HCC. In this study, a data mining framework, which includes molecular evolution analysis, clustering, feature selection, classifier learning, and classification, is introduced. Our research group has collected HBV DNA sequences, either genotype B or C, from over 200 patients specifically for this project. In the molecular evolution analysis and clustering, three subgroups have been identified in genotype C and a clustering method has been developed to separate the subgroups. In the feature selection process, potential markers are selected based on Information Gain for further classifier learning. Then, meaningful rules are learned by our algorithm called the Rule Learning, which is based on Evolutionary Algorithm. Also, a new classification method by Nonlinear Integral has been developed. Good performance of this method comes from the use of the fuzzy measure and the relevant nonlinear integral. The nonadditivity of the fuzzy measure reflects the importance of the feature attributes as well as their interactions. These two classifiers give explicit information on the importance of the individual mutated sites and their interactions toward the classification (potential causes of liver cancer in our case). A thorough comparison study of these two methods with existing methods is detailed. For genotype B, genotype C subgroups C1, C2, and C3, important mutation markers (sites) have been found, respectively. These two classification methods have been applied to classify never-seen-before examples for validation. The results show that the classification methods have more than 70 percent accuracy and 80 percent sensitivity for most da-n-nta sets, which are considered high as an initial scanning method for liver cancer diagnosis.
机译:从大型实验数据集中提取有意义的信息是生物信息学研究的关键要素。挑战之一是通过比较肝癌患者和非肝癌患者中乙肝病毒的完整基因组序列,确定与乙肝病毒(肝癌)发展相关的乙型肝炎病毒(HBV)中的基因组标记。在这项研究中,介绍了一个数据挖掘框架,其中包括分子进化分析,聚类,特征选择,分类器学习和分类。我们的研究小组已从200多位专门为此项目设计的患者中收集了B型或C型HBV DNA序列。在分子进化分析和聚类中,已鉴定出基因型C的三个亚组,并开发了一种聚类方法来分离这些亚组。在特征选择过程中,基于信息增益选择潜在标记,以进行进一步的分类器学习。然后,通过我们的基于进化算法的算法即规则学习来学习有意义的规则。此外,还开发了一种新的非线性积分分类方法。该方法的良好性能来自于使用模糊测度和相关的非线性积分。模糊度量的非可加性反映了要素属性及其相互作用的重要性。这两个分类器提供了有关单个突变位点的重要性及其对分类的相互作用的明确信息(在本例中为肝癌的潜在原因)。详细介绍了这两种方法与现有方法的全面比较研究。对于基因型B,基因型C亚组C1,C2和C3,分别发现了重要的突变标记(位点)。这两种分类方法已应用于对从未见过的示例进行分类以进行验证。结果表明,对于大多数da-n-nta集,该分类方法具有70%以上的准确性和80%的敏感性,被认为是肝癌诊断的初始扫描方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号