首页> 外文期刊>Information retrieval >Searching strategies for the Bulgarian language
【24h】

Searching strategies for the Bulgarian language

机译:保加利亚语的搜索策略

获取原文
获取原文并翻译 | 示例
           

摘要

This paper reports on the underlying IR problems encountered when indexing and searching with the Bulgarian language. For this language we propose a general light stemmer and demonstrate that it can be quite effective, producing significantly better MAP (around + 34%) than an approach not applying stemming. We implement the GL2 model derived from the Divergence from Randomness paradigm and find its retrieval effectiveness better than other probabilistic, vector-space and language models. The resulting MAP is found to be about 50% better than the classical tf idf approach. Moreover, increasing the query size enhances the MAP by around 10% (from T to TD). In order to compare the retrieval effectiveness of our suggested stopword list and the light stemmer developed for the Bulgarian language, we conduct a set of experiments on another stopword list and also a more complex and aggressive stemmer. Results tend to indicate that there is no statistically significant difference between these variants and our suggested approach. This paper evaluates other indexing strategies such as 4-gram indexing and indexing based on the automatic decompounding of compound words. Finally, we analyze certain queries to discover why we obtained poor results, when indexing Bulgarian documents using the suggested word-based approach.
机译:本文报告了使用保加利亚语进行索引和搜索时遇到的基本IR问题。对于这种语言,我们提出了一种通用的轻型词干分析器,并证明了它是非常有效的,与不使用词干分析的方法相比,可以产生明显更好的MAP(大约+ 34%)。我们实现了从随机性范式的差异派生的GL2模型,并发现它的检索效果优于其他概率模型,向量空间模型和语言模型。发现生成的MAP比传统的tf idf方法好约50%。此外,增加查询大小可将MAP提升大约10%(从T到TD)。为了比较建议的停用词列表和针对保加利亚语开发的轻型词干提取器的检索效果,我们对另一个停用词列表以及更复杂和更具攻击性的词干分析器进行了一系列实验。结果倾向于表明这些变体与我们建议的方法之间没有统计学上的显着差异。本文评估了其他索引策略,例如4-gram索引和基于复合词自动分解的索引。最后,当使用建议的基于单词的方法对保加利亚文档进行索引时,我们分析某些查询以发现为什么获得较差的结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号