首页> 外文会议>Australasian joint conference on artificial intelligence >Enhanced N-Gram Extraction Using Relevance Feature Discovery
【24h】

Enhanced N-Gram Extraction Using Relevance Feature Discovery

机译:使用相关特征发现增强的N-Gram提取

获取原文

摘要

Guaranteeing the quality of extracted features that describe relevant knowledge to users or topics is a challenge because of the large number of extracted features. Most popular existing term-based feature selection methods suffer from noisy feature extraction, which is irrelevant to the user needs (noisy). One popular method is to extract phrases or n-grams to describe the relevant knowledge. However, extracted n-grams and phrases usually contain a lot of noise. This paper proposes a method for reducing the noise in n-grams. The method first extracts more specific features (terms) to remove noisy features. The method then uses an extended random set to accurately weight n-grams based on their distribution in the documents and their terms distribution in n-grams. The proposed approach not only reduces the number of extracted n-grams but also improves the performance. The experimental results on Reuters Corpus Volume 1 (RCV1) data collection and TREC topics show that the proposed method significantly outperforms the state-of-art methods underpinned by Okapi BM25, tf~*idf and Rocchio.
机译:由于提取的特征数量众多,因此保证向用户或主题描述相关知识的提取特征的质量是一个挑战。现有的最流行的基于术语的特征选择方法遭受嘈杂的特征提取,这与用户需求无关(嘈杂)。一种流行的方法是提取短语或n-gram来描述相关知识。但是,提取的n-gram和短语通常会包含很多噪音。本文提出了一种减少n-gram噪声的方法。该方法首先提取更具体的特征(术语)以去除嘈杂的特征。然后,该方法使用扩展的随机集,根据n-gram在文档中的分布以及它们在n-grams中的术语分布,对n-gram进行精确加权。所提出的方法不仅减少了提取的n-gram的数量,而且提高了性能。对路透社语料库第1卷(RCV1)数据收集和TREC主题的实验结果表明,所提出的方法明显优于Okapi BM25,tf〜* idf和Rocchio支持的最新方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号