首页> 外文期刊>Journal of the American Society for Information Science and Technology >Indexing and Searching Strategies for the Russian Language
【24h】

Indexing and Searching Strategies for the Russian Language

机译:俄语的索引和搜索策略

获取原文
获取原文并翻译 | 示例
           

摘要

This paper describes and evaluates various stemming and indexing strategies for the Russian language. We design and evaluate two stemming approaches, a light and a more aggressive one, and compare these stem-mers to the Snowball stemmer, to no stemming, and also to a language-independent approach (n-gram).To evaluate the suggested stemming strategies we apply various probabilistic information retrieval (IR) models, including the Okapi, the Divergence from Randomness (DFR), a statistical language model (LM), as well as two vector-space approaches, namely, the classical tf idf scheme and the dtu-dtn model. We find that the vector-space dtu-dtn and the DFR models tend to result in better retrieval effectiveness than the Okapi, LM, or tf idf models, while only the latter two IR approaches result in statistically significant performance differences. Ignoring stemming generally reduces the MAP by more than 50%, and these differences are always significant. When applying an n-gram approach, performance differences are usually lower than an approach involving stemming. Finally, our light stemmer tends to perform best, although performance differences between the light, aggressive, and Snowball stemmers are not statistically significant.
机译:本文介绍并评估了俄语的各种词干和索引策略。我们设计并评估了两种词干提取方法,一种轻便且更具攻击性的方法,并将这些词干提取器与Snowball词干提取器,无词干提取方法以及与语言无关的方法(n-gram)进行了比较。我们应用各种概率信息检索(IR)模型,包括Okapi,随机散度(DFR),统计语言模型(LM)以及两种向量空间方法,即经典tf idf方案和dtu-dtn模型。我们发现,向量空间dtu-dtn和DFR模型往往比Okapi,LM或tf idf模型具有更好的检索效果,而仅后两种IR方法导致统计学上显着的性能差异。忽略词干通常会使MAP降低50%以上,并且这些差异始终很明显。当应用n-gram方法时,性能差异通常低于涉及词干的方法。最后,尽管轻型,激进型和Snowball阻止器之间的性能差异在统计上并不显着,但我们的轻型阻止器往往表现最佳。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号