首页> 美国卫生研究院文献>other >A Part-Of-Speech Term Weighting Scheme for Biomedical Information Retrieval
【2h】

A Part-Of-Speech Term Weighting Scheme for Biomedical Information Retrieval

机译:生物医学信息检索的词性项加权算法

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

In the era of digitalization, information retrieval (IR), which retrieves and ranks documents from large collections according to users’ search queries, has been popularly applied in the biomedical domain. Building patient cohorts using electronic health records (EHRs) or searching literature for topics of interest are some IR use cases. Meanwhile, natural language processing (NLP), such as tokenization or Part-of-Speech (POS) tagging, has been developed for processing clinical documents or biomedical literature. We hypothesize that NLP can be incorporated into IR to strengthen the conventional IR models. In this study, we propose two NLP-empowered IR models, POS-BoW and POS-MRF, which incorporate automatic POS-based term weighting schemes into bag-of-word (BoW) and Markov Random Field (MRF) IR models, respectively. In the proposed models, the POS-based term weights are iteratively calculated by utilizing a cyclic coordinate method where golden section line search algorithm is applied along each coordinate to optimize the objective function defined by mean average precision (MAP). In the empirical experiments, we used the data sets from the Medical Records track in Text REtrieval Conference (TREC) 2011 and 2012 and the Genomics track in TREC 2004. The evaluation on TREC 2011 and 2012 Medical Records tracks shows that, for the POS-BoW models, the mean improvement rates for IR evaluation metrics, MAP, bpref, and P@10, are 10.88%, 4.54%, and 3.82%, compared to the BoW models; and for the POS-MRF models, these rates are 13.59%, 8.20%, and 8.78%, compared to the MRF models. Additionally, we experimentally verify that the proposed weighting approach is superior to the simple heuristic and frequency based weighting approaches, and validate our POS category selection. Using the optimal weights calculated in this experiment, we tested the proposed models on the TREC 2004 Genomics track and obtained average of 8.63% and 10.04% improvement rates for POS-BoW and POS-MRF, respectively. These significant improvements verify the effectiveness of leveraging POS tagging for biomedical IR tasks.
机译:在数字化时代,信息检索(IR)可以根据用户的搜索查询从大型馆藏中检索文档并对其进行排名,已广泛应用于生物医学领域。一些IR使用案例是使用电子健康记录(EHR)建立患者队列或搜索感兴趣主题的文献。同时,已经开发了诸如标记化或词性(POS)标记之类的自然语言处理(NLP)来处理临床文档或生物医学文献。我们假设可以将NLP合并到IR中以增强常规IR模型。在这项研究中,我们提出了两种支持NLP的IR模型POS-BoW和POS-MRF,它们分别将基于POS的自动术语加权方案合并到词袋(BoW)和Markov随机场(MRF)IR模型中。在提出的模型中,通过使用循环坐标法迭代计算基于POS的术语权重,其中沿每个坐标应用黄金分割线搜索算法以优化由平均平均精度(MAP)定义的目标函数。在经验实验中,我们使用了Text Retrieval Conference(TREC)2011和2012中的Medical Records记录中的数据集以及TREC 2004中的Genomics记录。对TREC 2011和2012 Medical Records记录的评估表明,对于POS- BoW模型,与BoW模型相比,IR评估指标,MAP,bpref和P @ 10的平均改善率分别为10.88%,4.54%和3.82%;对于POS-MRF模型,与MRF模型相比,这些比率分别为13.59%,8.20%和8.78%。此外,我们通过实验验证了所提出的加权方法优于简单的启发式和基于频率的加权方法,并验证了我们的POS类别选择。使用在该实验中计算出的最佳权重,我们在TREC 2004 Genomics轨道上测试了建议的模型,分别获得了POS-BoW和POS-MRF的平均8.63%和10.04%的改善率。这些重大改进证明了将POS标签用于生物医学IR任务的有效性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号