首页> 外文期刊>Information Processing & Management >Indexing and stemming approaches for the Czech language
【24h】

Indexing and stemming approaches for the Czech language

机译:捷克语的索引编制和词干提取方法

获取原文
获取原文并翻译 | 示例
       

摘要

This paper describes and evaluates various stemming and indexing strategies for the Czech language. Based on Czech test-collection, we have designed and evaluated two stemming approaches, a light and a more aggressive one. We have compared them with a no stemming scheme as well as a language-independent approach (n-gram). To evaluate the suggested solutions we used various IR models, including Okapi, Divergence from Randomness (DFR), a statistical language model (LM) as well as the classical tf idf vector-space approach. We found that the Divergence from Randomness paradigm tend to propose better retrieval effectiveness than the Okapi, LM or tf idf models, the performance differences were however statistically significant only with the last two IR approaches. Ignoring the stemming reduces generally the MAP by more than 40%, and these differences are always significant. Finally, if our more aggressive stemmer tends to show the best performance, the differences in performance with a light stemmer are not statistically significant.
机译:本文介绍并评估了捷克语的各种词干和索引策略。基于捷克的测试收集,我们设计和评估了两种阻止方法,一种轻便且更具侵略性。我们将它们与没有词干方案以及与语言无关的方法(n-gram)进行了比较。为了评估建议的解决方案,我们使用了各种IR模型,包括Okapi,随机散度(DFR),统计语言模型(LM)以及经典的tf idf向量空间方法。我们发现,与“随机性”范式不同,与Okapi,LM或tf idf模型相比,趋向于提出更好的检索效果,但是,性能差异仅在后两种IR方法上具有统计学意义。忽略词干通常会使MAP降低40%以上,并且这些差异始终很明显。最后,如果我们更具攻击性的词梗趋向于表现出最好的性能,则轻型词梗的性能差异在统计学上并不显着。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号