...
首页> 外文期刊>Journal of the American Society for Information Science and Technology >Automatic thesaurus development: Term extraction from title metadata
【24h】

Automatic thesaurus development: Term extraction from title metadata

机译:自动同义词库开发:从标题元数据中提取术语

获取原文
获取原文并翻译 | 示例
           

摘要

The application of thesauri in networked environments is seriously hampered by the challenges of introducing new concepts and terminology into the formal controlled vocabulary, which is critical for enhancing its retrieval capability. The author describes an automated process of adding new terms to thesauri as entry vocabulary by analyzing the association between words/phrases extracted from bibliographic titles and subject descriptors in the metadata record (subject descriptors are terms assigned from controlled vocabularies of thesauri to describe the subjects of the objects [e.g., books, articles] represented by the metadata records). The investigated approach uses a corpus of metadata for scientific and technical (S&T) publications in which the titles contain substantive words for key topics. The three steps of the method are (a) extracting words and phrases from the title field of the metadata; (b) applying a method to identify and select the specific and meaningful keywords based on the associated controlled vocabulary terms from the thesaurus used to catalog the objects; and (c) inserting selected keywords into the thesaurus as new terms (most of them are in hierarchical relationships with the existing concepts), thereby updating the thesaurus with new terminology that is being used in the literature. The effectiveness of the method was demonstrated by an experiment with the Chinese Classification Thesaurus (CCT) and bibliographic data in China Machine-Readable Cataloging Record (MARC) format (CNMARC) provided by Peking University Library. This approach is equally effective in large-scale collections and in other languages.
机译:叙词表在网络环境中的应用受到将新概念和术语引入正式受控词汇表的挑战的严重阻碍,这对于增强其检索能力至关重要。作者介绍了一种自动过程,通过分析从书目标题中提取的词/短语与元数据记录中的主题描述符(主题描述符是从叙词词典的受控词汇分配来描述主题的术语)到词库添加新词条作为入口词汇的自动过程。由元数据记录表示的对象(例如书籍,文章)。所研究的方法使用了用于科学和技术(S&T)出版物的元数据语料库,其中标题中包含关键主题的实质词。该方法的三个步骤是(a)从元数据的标题字段中提取单词和短语; (b)应用一种方法,根据用于分类对象的词库中的相关受控词汇,识别并选择特定而有意义的关键字; (c)将选定的关键字作为新术语插入到词库中(它们中的大多数与现有概念具有层次关系),从而用文献中使用的新术语更新词库。北京大学图书馆提供的中国分类词典(CCT)和书目数据以中国机器可读目录(MARC)格式(CNMARC)进行了实验,证明了该方法的有效性。这种方法在大规模馆藏和其他语言中同样有效。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号