【24h】

Measuring Termhood in Automatic Terminology Extraction

机译:在自动术语提取中测量术语

获取原文

摘要

Automatic terminology extraction can be divided into two tasks. The first task measures the Unithood which is used to identify a string as a lexical unit. The second task measures the so called Termhood, used to identify a lexical unit being a domain specific term. This paper proposes a method to measure Termhood in Chinese ATE. It considers the domain specificity of both the components of a candidate term as well as statistical information and other contextual information across different domains and applied to a support vector machine model for terminology extraction. The experiments are based on the Chinese corpus in the IT domain with cross validation of data from outside of the IT domain. Results show that the precision of the open tests can reach over 80% for the top 2,000 candidates and around 50% for the top 20,000 candidate. Furthermore,experiments with different lexicon size shows that the algorithm does not require a comprehensive domain lexicon of a large size. A few thousand basic domain terms would be sufficient to achieve the above mentioned performance.
机译:自动术语提取可以分为两个任务。第一项任务是度量Unithood,该Unithood用于将字符串标识为词汇单位。第二项任务是测量所谓的术语,该术语用于识别作为领域特定术语的词汇单元。本文提出了一种测量中文ATE术语的方法。它考虑了候选术语的组成部分以及跨不同域的统计信息和其他上下文信息的域特异性,并将其应用于支持向量机模型以进行术语提取。这些实验基于IT领域的中文语料库,并且对来自IT领域外部的数据进行了交叉验证。结果表明,对于前2,000名候选人,开放测试的准确性可以达到80%以上,对于前20,000名候选人,开放测试的准确性可以达到50%左右。此外,具有不同词典大小的实验表明,该算法不需要大型的综合域词典。几千个基本领域术语就足以实现上述性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号