首页> 外文会议>International conference on computer technology and development >RESEARCH ON THEMATIC WORD EXTRACTION BASED ON HIGH QUALITY DATA SOURCES ON THE WEB
【24h】

RESEARCH ON THEMATIC WORD EXTRACTION BASED ON HIGH QUALITY DATA SOURCES ON THE WEB

机译:基于Web高质量数据源的主题词提取研究。

获取原文

摘要

The data source selection is one of the most important processes for domain thematic word extraction.Most of the previous work mainly researched on how to the extract keywords from existing corpus with good algorithms.Meanwhile, there is very limited research on how to explore good data sources for text corpus collection.This paper researches on how to use the online web tools to identify high quality data sources.Then, considering the characteristics of subject keywords, we propose an improved TF-IDF weight calculation formula for keywords sorting, and extract the field keywords from the documents by recalculating the weights of candidate words with the improved method.Finally, taking the Chinese herbal medicine field as an example, our result shows that we can have large higher accuracy and higher recall rate at much lower cost with our method given in this paper.
机译:数据源的选择是领域主题词提取中最重要的过程之一。以前的工作主要集中在如何利用良好的算法从现有语料库中提取关键词方面进行的研究,而关于如何探索良好的数据的研究非常有限。本文研究了如何使用在线Web工具识别高质量的数据源。然后,考虑主题关键字的特征,提出了一种改进的TF-IDF权重计算公式,用于关键字排序,并提取了改进后的方法通过重新计算候选词的权重来对文档中的字段关键词进行重新计算。最后,以中草药领域为例,我们的结果表明,使用该方法可以以较低的成本获得较高的准确性和较高的查全率。在本文中给出。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号