...
首页> 外文期刊>BMC Medical Informatics and Decision Making >Doublet method for very fast autocoding
【24h】

Doublet method for very fast autocoding

机译:Doublet方法可实现非常快速的自动编码

获取原文
           

摘要

Background Autocoding (or automatic concept indexing) occurs when a software program extracts terms contained within text and maps them to a standard list of concepts contained in a nomenclature. The purpose of autocoding is to provide a way of organizing large documents by the concepts represented in the text. Because textual data accumulates rapidly in biomedical institutions, the computational methods used to autocode text must be very fast. The purpose of this paper is to describe the doublet method, a new algorithm for very fast autocoding. Methods An autocoder was written that transforms plain-text into intercalated word doublets (e.g. "The ciliary body produces aqueous humor" becomes "The ciliary, ciliary body, body produces, produces aqueous, aqueous humor"). Each doublet is checked against an index of doublets extracted from a standard nomenclature. Matching doublets are assigned a numeric code specific for each doublet found in the nomenclature. Text doublets that do not match the index of doublets extracted from the nomenclature are not part of valid nomenclature terms. Runs of matching doublets from text are concatenated and matched against nomenclature terms (also represented as runs of doublets). Results The doublet autocoder was compared for speed and performance against a previously published phrase autocoder. Both autocoders are Perl scripts, and both autocoders used an identical text (a 170+ Megabyte collection of abstracts collected through a PubMed search) and the same nomenclature (neocl.xml, containing over 102,271 unique names of neoplasms). In side-by-side comparison on the same computer, the doublet method autocoder was 8.4 times faster than the phrase autocoder (211 seconds versus 1,776 seconds). The doublet method codes 0.8 Megabytes of text per second on a desktop computer with a 1.6 GHz processor. In addition, the doublet autocoder successfully matched terms that were missed by the phrase autocoder, while the phrase autocoder found no terms that were missed by the doublet autocoder. Conclusions The doublet method of autocoding is a novel algorithm for rapid text autocoding. The method will work with any nomenclature and will parse any ascii plain-text. An implementation of the algorithm in Perl is provided with this article. The algorithm, the Perl implementation, the neoplasm nomenclature, and Perl itself, are all open source materials.
机译:当软件程序提取文本中包含的术语并将其映射到术语中包含的标准概念列表时,就会发生背景自动编码(或自动概念索引)。自动编码的目的是提供一种通过文本中表示的概念来组织大型文档的方法。由于文本数据在生物医学机构中迅速积累,因此用于自动编码文本的计算方法必须非常快。本文的目的是描述doublet方法,这是一种用于快速自动编码的新算法。方法编写了一种自动编码器,该自动编码器将纯文本转换为插入的单词双峰(例如,“睫状体产生房水”变成“睫状,睫状体,身体产生,房水,房水”)。根据从标准术语中提取的双峰索引检查每个双峰。匹配的doublet被分配一个特定于该命名法中每个doublet的数字代码。与从命名法提取的双峰索引不匹配的文本双峰不是有效命名术语的一部分。来自文本的匹配双峰的序列被连接起来并与命名术语匹配(也表示为双峰的序列)。结果将doublet自动编码器的速度和性能与以前发布的短语自动编码器进行了比较。两种自动编码器都是Perl脚本,并且两种自动编码器都使用相同的文本(通过PubMed搜索收集的170兆字节摘要摘要)和相同的命名法(neocl.xml,包含超过102,271个肿瘤的唯一名称)。在同一台计算机上进行并排比较时,doublet方法自动编码器比短语自动编码器快8.4倍(211秒对1,776秒)。在具有1.6 GHz处理器的台式计算机上,doublet方法每秒编码0.8 MB的文本。此外,双峰自动编码器成功匹配了短语自动编码器遗漏的术语,而短语自动编码器未找到双峰自动编码器遗漏的术语。结论自动编码的doublet方法是一种用于快速文本自动编码的新颖算法。该方法可以使用任何术语,并且可以解析任何ascii纯文本。本文提供了Perl中算法的实现。算法,Perl实现,肿瘤术语以及Perl本身都是开源材料。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号