首页> 外文会议>Pacific Symposium on Biocomputing 2006 >LARGE-SCALE TESTING OF BIBLIOME INFORMATICS USING PFAM PROTEIN FAMILIES
【24h】

LARGE-SCALE TESTING OF BIBLIOME INFORMATICS USING PFAM PROTEIN FAMILIES

机译:使用PFAM蛋白质家族对生物信息学进行大规模测试

获取原文
获取原文并翻译 | 示例

摘要

Literature mining is expected to help not only with automatically sifting through huge biomed-ical literature and annotation databases, but also with linking bio-chemical entities to appropriate functional hypotheses. However, there has been very limited success in testing literature mining methods due to the lack of large, objectively validated test sets or "gold standards". To improve this situation we created a large-scale test of literature mining methods and resources. We report on a specific implementation of this test: how well can the Pfam protein family classification be replicated from independently mining different literature/annotation resources? We test and compare different keyterm sets as well as different algorithms for issuing protein family predictions. We find that protein families can indeed be automatically predicted from the literature. Using words from PubMed abstracts, of 3663 proteins tested, over 75% were correctly assigned to one of 618 Pfam families. For 90% of proteins the correct Pfam family was among the top 5 ranked families. We found that protein family prediction is far superior with keywords extracted from PubMed abstracts than with GO annotations or MeSH keyterms, suggesting that the text itself (in combination with the vector space model) is superior to GO and MeSH as a literature mining resources, at least for detecting protein family membership. Finally, we show that Shannon's entropy can be exploited to improve prediction by facilitating the integration of the different literature sources tested.
机译:预计文献挖掘不仅可以帮助自动筛选庞大的生物医学文献和注释数据库,还可以帮助将生物化学实体与适当的功能假设联系起来。但是,由于缺乏大型的,经过客观验证的测试集或“金标准”,因此在测试文献挖掘方法方面取得的成功非常有限。为了改善这种情况,我们创建了文献挖掘方法和资源的大规模测试。我们报告了该测试的具体实施情况:从独立开采的不同文献/注释资源中复制Pfam蛋白家族分类的效果如何?我们测试和比较不同的关键词集以及发布蛋白质家族预测的不同算法。我们发现,确实可以从文献中自动预测蛋白质家族。使用来自PubMed摘要的文字,测试了3663种蛋白质,其中超过75%被正确分配给618个Pfam家族之一。对于90%的蛋白质,正确的Pfam家族是排名前5位的家族。我们发现,从PubMed摘要中提取的关键字比使用GO注释或MeSH关键字词对蛋白质家族的预测要好得多,这表明文本本身(与向量空间模型结合)在作为文献挖掘资源方面优于GO和MeSH。至少用于检测蛋白质家族成员。最后,我们表明可以通过促进对不同测试文献资源的整合来利用香农熵来改善预测。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号