首页> 外文期刊>Journal of information science and engineering >A Text Categorization Method using Extended Vector Space Model by Frequent Term Sets
【24h】

A Text Categorization Method using Extended Vector Space Model by Frequent Term Sets

机译:基于频繁项集的扩展向量空间模型的文本分类方法

获取原文
获取原文并翻译 | 示例
           

摘要

Text categorization is one of the most important research topics in Natural Language Processing and Information Retrieval due to the ever-increasing electronic documents. This paper presents a new text categorization method using frequent term sets. A novel constraint measure AD-Sup was introduced to extract discriminative features from frequent term sets for classification task. Then text documents are represented in the global feature space which contains both single terms and frequent term sets. To solve the sparse instance problem, a term weighting strategy is then implemented which assigns estimated weights using feature similarity and highly reduces the sparse rate. Through extensive experiments, the optimal proportion of single features and frequent term set features is empirically determined. Classification results on Reuters-21578 and WebKB corpus demonstrate that AD-Sup constraint is effective to extract useful frequent features and the combination strategy is effective to build better feature space and improve the SVM classifier.
机译:由于电子文档的不断增加,文本分类是自然语言处理和信息检索中最重要的研究主题之一。本文提出了一种新的使用频繁术语集的文本分类方法。引入了一种新颖的约束度量AD-Sup来从频繁术语集中提取用于分类任务的判别特征。然后,在包含单个术语和常用术语集的全局特征空间中表示文本文档。为了解决稀疏实例问题,随后实施了术语加权策略,该策略使用特征相似度分配估计的权重并极大地降低了稀疏率。通过广泛的实验,凭经验确定单个特征和频繁术语集特征的最佳比例。 Reuters-21578和WebKB语料库上的分类结果表明,AD-Sup约束可有效提取有用的频繁特征,而组合策略则可有效构建更好的特征空间并改进SVM分类器。

著录项

  • 来源
  • 作者单位

    School of Computer Science and Technology Beihang University Beijing, 100191 P.R. China,Research Institute of Beihang University in Shenzhen VU Park, High-tech Industrial Estate Shenzhen, 518057 P.R. China;

    School of Computer Science and Technology Beihang University Beijing, 100191 P.R. China,Research Institute of Beihang University in Shenzhen VU Park, High-tech Industrial Estate Shenzhen, 518057 P.R. China;

    School of Computer Science and Technology Beihang University Beijing, 100191 P.R. China,Research Institute of Beihang University in Shenzhen VU Park, High-tech Industrial Estate Shenzhen, 518057 P.R. China;

  • 收录信息
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    text categorization; text representation; frequent term sets; apriori; SVM;

    机译:文本分类文字表示;常用术语集;先验支持向量机;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号