首页> 美国卫生研究院文献>Database: The Journal of Biological Databases and Curation >Rule-based deduplication of article records from bibliographic databases
【2h】

Rule-based deduplication of article records from bibliographic databases

机译:从书目数据库对文章记录进行基于规则的重复数据删除

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

We recently designed and deployed a metasearch engine, Metta, that sends queries and retrieves search results from five leading biomedical databases: PubMed, EMBASE, CINAHL, PsycINFO and the Cochrane Central Register of Controlled Trials. Because many articles are indexed in more than one of these databases, it is desirable to deduplicate the retrieved article records. This is not a trivial problem because data fields contain a lot of missing and erroneous entries, and because certain types of information are recorded differently (and inconsistently) in the different databases. The present report describes our rule-based method for deduplicating article records across databases and includes an open-source script module that can be deployed freely. Metta was designed to satisfy the particular needs of people who are writing systematic reviews in evidence-based medicine. These users want the highest possible recall in retrieval, so it is important to err on the side of not deduplicating any records that refer to distinct articles, and it is important to perform deduplication online in real time. Our deduplication module is designed with these constraints in mind. Articles that share the same publication year are compared sequentially on parameters including PubMed ID number, digital object identifier, journal name, article title and author list, using text approximation techniques. In a review of Metta searches carried out by public users, we found that the deduplication module was more effective at identifying duplicates than EndNote without making any erroneous assignments.
机译:我们最近设计并部署了元搜索引擎Metta,该引擎可以发送查询并从五个领先的生物医学数据库中检索搜索结果:PubMed,EMBASE,CINAHL,PsycINFO和Cochrane对照试验中央注册系统。由于许多文章都在多个数据库中的一个以上建立了索引,因此需要对检索到的文章记录进行重复数据删除。这不是一个简单的问题,因为数据字段包含许多丢失和错误的条目,并且某些类型的信息在不同数据库中的记录方式不同(且不一致)。本报告介绍了基于规则的方法,用于跨数据库对文章记录进行重复数据删除,并包括一个可自由部署的开源脚本模块。 Metta旨在满足使用循证医学撰写系统评价的人们的特殊需求。这些用户希望在检索中实现最高的召回率,因此,在避免对引用不同文章的任何记录进行重复数据删除方面很重要,并且实时在线进行重复数据删除也很重要。我们的重复数据删除模块在设计时考虑了这些限制。使用文本近似技术,对共享同一出版年份的文章在包括PubMed ID号,数字对象标识符,期刊名称,文章标题和作者列表在内的参数上进行顺序比较。在审查公共用户进行的Metta搜索时,我们发现重复数据删除模块在识别重复项方面比EndNote更有效,而不会进行任何错误分配。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号