首页> 外文期刊>Journal of the American Society for Information Science and Technology >Information Retrieval From Historical Newspaper Collections in Highly Inflectional Languages: A Query Expansion Approach
【24h】

Information Retrieval From Historical Newspaper Collections in Highly Inflectional Languages: A Query Expansion Approach

机译:从高折语的历史报纸收藏中检索信息:一种查询扩展方法

获取原文
获取原文并翻译 | 示例
           

摘要

The aim of the study was to test whether query expansion by approximate string matching methods is beneficial in retrieval from historical newspaper collections in a language rich with compounds and inflectional forms (Finnish). First, approximate string matching methods were used to generate lists of index words most similar to contemporary query terms in a digitized newspaper collection from the 1800s. Top index word variants were categorized to estimate the appropriate query expansion ranges in the retrieval test. Second, the effectiveness of approximate string matching methods, automatically generated inflectional forms, and their combinations were measured in a Cranfield-style test. Finally, a detailed topic-level analysis of test results was conducted. In the index of historical newspaper collection the occurrences of a word typically spread to many linguistic and historical variants along with optical character recognition (OCR) errors. All query expansion methods improved the baseline results. Extensive expansion of around 30 variants for each query word was required to achieve the highest performance improvement. Query expansion based on approximate string matching was superior to using the inflectional forms of the query words, showing that coverage of the different types of variation is more important than precision in handling one type of variation.
机译:该研究的目的是测试通过近似字符串匹配方法进行的查询扩展是否有益于从历史报纸收藏中以丰富的化合物和屈折形式的语言进行检索(芬兰语)。首先,使用近似字符串匹配方法来生成索引词列表,这些索引词与1800年代的数字化报纸收藏中的当代查询词最相似。对顶级索引词变体进行了分类,以估计检索测试中适当的查询扩展范围。其次,在Cranfield样式测试中测量了近似字符串匹配方法,自动生成的变形形式及其组合的有效性。最后,对测试结果进行了详细的主题级分析。在历史报纸收藏的索引中,单词的出现通常会随着光学字符识别(OCR)错误扩散到许多语言和历史变体中。所有查询扩展方法均改善了基线结果。为了实现最高的性能改进,每个查询词需要大约30个变体的广泛扩展。基于近似字符串匹配的查询扩展优于使用查询词的变形形式,这表明在处理一种类型的变体中,覆盖不同类型的变体比精度更重要。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号