首页> 外文会议>Annual meeting of the Association for Computational Linguistics >Towards Unsupervised Text Classification Leveraging Experts and Word Embeddings
【24h】

Towards Unsupervised Text Classification Leveraging Experts and Word Embeddings

机译:走向无监督的文本分类,利用专家和单词嵌入

获取原文

摘要

Text classification aims at mapping documents into a set of predefined categories. Supervised machine learning models have shown great success in this area but they require a large number of labeled documents to reach adequate accuracy. This is particularly true when the number of target categories is in the tens or the hundreds. In this work, we explore an unsupervised approach to classify documents into categories simply described by a label. The proposed method is inspired by the way a human proceeds in this situation: It draws on textual similarity between the most relevant words in each document and a dictionary of keywords for each category reflecting its semantics and lexical field. The novelty of our method hinges on the enrichment of the category labels through a combination of human expertise and language models, both generic and domain specific. Our experiments on 5 standard corpora show that the proposed method increases Fl-score over relying solely on human expertise and can also be on par with simple supervised approaches. It thus provides a practical alternative to situations where low-cost text categorization is needed, as we illustrate with our application to operational risk incidents classification.
机译:文本分类旨在将文档映射到一组预定义的类别中。监督机器学习模型在这一领域取得了巨大成功,但它们需要大量标记的文件来达到足够的准确性。当目标类别的数量处于数百个或数百个时,这尤其如此。在这项工作中,我们探讨了一个无人监督的方法,将文档分类为类别,简单地由标签描述。该方法的灵感来自这种情况下的人类所需的方式:它在每个文档中最相关的单词与反映其语义和词条字段的每个类别的关键字字典之间的文本相似性。我们的方法涉及通过人类专业知识和语言模型的组合来铰接类别标签,包括通用和域的特定于界限。我们的实验在5标准Cothara上表明,该方法仅仅增加了人类专业知识,依靠人类专业知识,也可以与简单的监督方法相提并论。因此,在需要我们的应用风险事件分类的情况下,提供了需要低成本文本分类的情况的实际替代情况。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号