...
首页> 外文期刊>ACM transactions on Asian language information processing >Sub-Word Indexing and Blind Relevance Feedback for English, Bengali, Hindi, and Marathi IR
【24h】

Sub-Word Indexing and Blind Relevance Feedback for English, Bengali, Hindi, and Marathi IR

机译:英语,孟加拉语,北印度语和马拉地语IR的子词索引和盲目相关性反馈

获取原文
获取原文并翻译 | 示例
           

摘要

The Forum for Information Retrieval Evaluation (FIRE) provides document collections, topics, and relevance assessments for information retrieval (IR) experiments on Indian languages. Sev-eral research questions are explored in this article: 1) How to create create a simple, language-independent corpus-based stemmer, 2) How to identify sub-words and which types of sub-words are suitable as indexing units, and 3) How to apply blind relevance feedback on sub-words and how feedback term selection is affected by the type of the indexing unit. More than 140 IR experi-ments are conducted using the BM25 retrieval model on the topic titles and descriptions (TD) for the FIRE 2008 English, Bengali, Hindi, and Marathi document collections. The major findings are: The corpus-based stemming approach is effective as a knowledge-light term conflation step and useful in the case of few language-specific resources. For English, the corpus-based stemmer performs nearly as well as the Porter stemmer and significantly better than the baseline of indexing words when combined with query expansion. In combination with blind relevance feedback, it also performs significantly better than the baseline for Bengali and Marathi IR. Sub-words such as consonant-vowel sequences and word prefixes can yield similar or better per-formance in comparison to word indexing. There is no best performing method for all languages. For English, indexing using the Porter stemmer performs best, for Bengali and Marathi, overlap-ping 3-grams obtain the best result, and for Hindi, 4-prefixes yield the highest MAP. However, in combination with blind relevance feedback using 10 documents and 20 terms, 6-prefixes for English and 4-prefixes for Bengali, Hindi, and Marathi IR yield the highest MAP. Sub-word identification is a general case of decompounding. It results in one or more index terms for a single word form and increases the number of index terms but decreases their average length. The corresponding retrieval experiments show that relevance feedback on sub-words ben-efits from selecting a larger number of index terms in comparison with retrieval on word forms. Similarly, selecting the number of relevance feedback terms depending on the ratio of word vo-cabulary size to sub-word vocabulary size almost always slightly increases information retrieval effectiveness compared to using a fixed number of terms for different languages.
机译:信息检索评估论坛(FIRE)为印度语信息检索(IR)实验提供了文档收集,主题和相关性评估。本文探讨了一些主要的研究问题:1)如何创建一个简单的,独立于语言的基于语料库的词干,2)如何识别子词以及哪些类型的子词适合作为索引单位,以及3)如何在子词上应用盲目相关反馈,以及如何通过索引单元的类型影响反馈词的选择。使用BM25检索模型对FIRE 2008 English,Bengali,Hindi和Marathi文档集的主题标题和描述(TD)进行了140多次IR实验。主要发现是:基于语料库的词干抽取方法可以有效地用作知识轻词的合并步骤,并且在语言专用资源很少的情况下很有用。对于英语,与查询扩展结合使用时,基于语料库的词干表现几乎与Porter词干一样好,并且明显优于索引词的基线。结合盲目的相关性反馈,它的性能也明显好于孟加拉语和马拉地语IR的基线。与词索引相比,诸如辅音元音序列和词前缀之类的子词可以产生相似或更好的性能。没有适用于所有语言的最佳方法。对于英语,使用波特词干搜索器的索引效果最佳;对于孟加拉语和马拉地语,重叠平移3克的效果最佳;对于印地语,使用4前缀的MAP最高。但是,结合使用10个文档和20个术语的盲目相关性反馈,英语的6前缀和孟加拉语,北印度语和马拉地语IR的4前缀产生了最高的MAP。子词识别是分解的一般情况。它会为一个单词形式生成一个或多个索引词,并增加了索引词的数量,但减少了其平均长度。相应的检索实验表明,与对单词形式的检索相比,对子词的相关性反馈是通过选择大量索引词获得的。类似地,与针对不同语言使用固定数量的术语相比,根据单词语音词汇量与子单词词汇量之比来选择相关性反馈术语的数量几乎总是会稍微提高信息检索效率。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号