【24h】

Cross-Comparison for Two-Dimensional Text Categorization

机译:二维文本分类的交叉比较

获取原文
获取原文并翻译 | 示例

摘要

The organization of large text collections is the main goal of automated text categorization. In particular, the final aim is to classify documents into a certain number of pre-defined categories in an efficient way and with as much accuracy as possible. On-line and run-time services, such as personalization services and information filtering services, have increased the importance of effective and efficient document categorization techniques. In the last years, a wide range of supervised learning algorithms have been applied to this problem. Recently, a new approach that exploits a two-dimensional summarization of the data for text classification was presented. This method does not go through a selection of words phase; instead, it uses the whole dictionary to present data in intuitive way on two-dimensional graphs. Although, successful in terms of classification effectiveness and efficiency (as recently showed in [3]), this method presents some unsolved key issues: the design of the training algorithm seems to be ad hoc for the Reuters-21578 collection; the evaluation has only been done only on the 10 most frequent classes of the Reuters-21578 dataset; the evaluation lacks measure of significance in most parts; the method adopted lacks a mathematical justification. We focus on the first three aspects, leaving the fourth as the future work.
机译:大型文本集合的组织是自动文本分类的主要目标。特别地,最终目标是以有效的方式并尽可能精确地将文档分类为一定数量的预定义类别。在线和运行时服务,例如个性化服务和信息过滤服务,已经增加了有效的文档分类技术的重要性。在过去的几年中,各种各样的监督学习算法已经被应用到这个问题上。最近,提出了一种新的方法,该方法利用数据的二维汇总进行文本分类。此方法不会经过单词选择阶段;相反,它使用整个词典以直观的方式在二维图形上显示数据。尽管在分类有效性和效率方面很成功(如最近在[3]中所示),但该方法存在一些未解决的关键问题:训练算法的设计似乎是Reuters-21578集合的临时性;仅对Reuters-21578数据集的10个最频繁的类别进行了评估;评估在大多数地方缺乏重要意义;采用的方法缺乏数学依据。我们将重点放在前三个方面,而将第四个方面留作未来的工作。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号