...
【24h】

A Fast Corpus-Based Stemmer

机译:基于语料库的快速语音

获取原文
获取原文并翻译 | 示例
           

摘要

Stemming is a mechanism of word form normalization that transforms the variant word forms to their common root. In an Information Retrieval system, it is used to increase the system's performance, specifically the recall and desirably the precision. Although its usefulness is shown to be mixed in languages such as English, because morphologically complex languages stemming produces a significant performance improvement. A number of linguistic rule-based stemmers are available for most European languages which employ a set of rules to get back the root word from its variants. But for Indian languages which are highly inflectional in nature, devising a linguistic rule-based stemmer needs some additional resources which are not available. We present an approach which is purely corpus based and finds the equivalence classes of variant words in an unsupervised manner. A set of experiments on four languages using FIRE, CLEF, and TREC test collections shows that our approach provides comparable results with linguistic rule-based stemmers for some languages and gives significant performance improvement for resource constrained languages such as Bengali and Marathi.
机译:词干是词形规范化的一种机制,可将变体词形转换为它们的共同根。在信息检索系统中,它用于提高系统性能,特别是召回率和期望的精度。尽管它的有用性在诸如英语之类的语言中被混合使用,但是由于词干形态复杂的语言产生了显着的性能改进。大多数欧洲语言都提供了许多基于语言规则的词干提取器,这些词干器使用一组规则来从其变体中获取根词。但是对于本质上具有很大影响力的印度语言,设计基于语言规则的词干分析器需要一些其他资源,而这些资源是不可用的。我们提出了一种纯粹基于语料库的方法,并以无监督的方式找到了变体词的等价类。使用FIRE,CLEF和TREC测试集合对四种语言进行的一组实验表明,对于某些语言,我们的方法可以提供与基于语言规则的词干分析器相媲美的结果,并且对于资源受限的语言(例如孟加拉语和马拉地语)可以显着提高性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号