【24h】

Experiments in High-Dimensional Text Categorization

机译:高维文本分类的实验

获取原文

摘要

We present results for automated text categorization of the Reuters-810000 collection of news stories. Our experiments use the entire one-year collection of 810,000 stories and the entire subject index. We divide the data into monthly groups and provide an initial benchmark of text categorization performance on the complete collection. Experimental results show that efficient sparse-feature implementations of linear methods and decision trees, using a global unstemmed dictionary, can readily handle applications of this size. Predictive performance is approximately as strong as the best results for the much smaller older Reuters collections. Detailed results are provided over time periods. It is shown that a smaller time horizon does not diminish predictive quality, implying reduced demands for retraining when Sample size is large.
机译:我们为Reuters-810000收集新闻故事提供了自动文本分类的结果。我们的实验使用整个一年的810,000个故事和整个主题索引。我们将数据划分为每月组,并在完整集合上提供文本分类性能的初始基准。实验结果表明,使用全局调节词典的线性方法和决策树的有效稀疏特征实现可以易于处理这种大小的应用。预测性能大致强大是较小的较小的路透社集合的最佳效果。随着时间的推移提供了详细结果。结果表明,较小的时间范围不会减少预测质量,这意味着当样本大小大时对再培训的要求减少。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号