Indexing and stemming approaches for the Czech language

Ljiljana Dolamic; Jacques Savoy

首页> 外文期刊>Information Processing & Management >Indexing and stemming approaches for the Czech language

【24h】

Indexing and stemming approaches for the Czech language

机译：捷克语的索引编制和词干提取方法

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

This paper describes and evaluates various stemming and indexing strategies for the Czech language. Based on Czech test-collection, we have designed and evaluated two stemming approaches, a light and a more aggressive one. We have compared them with a no stemming scheme as well as a language-independent approach (n-gram). To evaluate the suggested solutions we used various IR models, including Okapi, Divergence from Randomness (DFR), a statistical language model (LM) as well as the classical tf idf vector-space approach. We found that the Divergence from Randomness paradigm tend to propose better retrieval effectiveness than the Okapi, LM or tf idf models, the performance differences were however statistically significant only with the last two IR approaches. Ignoring the stemming reduces generally the MAP by more than 40%, and these differences are always significant. Finally, if our more aggressive stemmer tends to show the best performance, the differences in performance with a light stemmer are not statistically significant.

机译：本文介绍并评估了捷克语的各种词干和索引策略。基于捷克的测试收集，我们设计和评估了两种阻止方法，一种轻便且更具侵略性。我们将它们与没有词干方案以及与语言无关的方法（n-gram）进行了比较。为了评估建议的解决方案，我们使用了各种IR模型，包括Okapi，随机散度（DFR），统计语言模型（LM）以及经典的tf idf向量空间方法。我们发现，与“随机性”范式不同，与Okapi，LM或tf idf模型相比，趋向于提出更好的检索效果，但是，性能差异仅在后两种IR方法上具有统计学意义。忽略词干通常会使MAP降低40％以上，并且这些差异始终很明显。最后，如果我们更具攻击性的词梗趋向于表现出最好的性能，则轻型词梗的性能差异在统计学上并不显着。

著录项

来源
《Information Processing & Management》 |2009年第6期|714-720|共7页
作者
Ljiljana Dolamic; Jacques Savoy;
展开▼
作者单位

Computer Science Department, University of Neuchatel, 2009 Neuchatel, Switzerland;

Computer Science Department, University of Neuchatel, 2009 Neuchatel, Switzerland;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
czech language; stemming; evaluation; slavic languages;

机译：捷克语茎评估;斯拉夫语;

相似文献

外文文献
中文文献
专利

1. Digital Library, Virtual Library and Other Emerging Library Systems : An Interpretation to their Terminological Situations in Indexing Languages and Reference Tools [J] . Debasish Pradhan, Tridib Tripathi IASLIC bulletin . 2009,第3期

机译：数字图书馆，虚拟图书馆和其他新兴图书馆系统：对它们在索引语言和参考工具中的术语情况的解释
2. One language, two number-word systems and many problems: numerical cognition in the Czech language. [J] . Pixner S, Zuber J, Hermanova V, Research in developmental disabilities . 2011,第6期

机译：一种语言，两种数字词系统以及许多问题：捷克语中的数字认知。
3. Epsilon-reducible context-free languages and characterizations of indexed languages [J] . Information and computation . 2019,第Deca期

机译：可减少Epsilon的无上下文语言和索引语言的表征
4. Large-scale processing, indexing and search system for Czech audio-visual cultural heritage archives [C] . Nouza Jan, Blavka Karel, Zdansky Jindrich, 2012 IEEE 14th International Workshop on Multimedia Signal Processing. . 2012

机译：捷克视听文化遗产档案的大规模处理，索引和搜索系统
5. Term selection process in subject searching: End-user interactions with information retrieval systems and indexing languages. [D] . Salaba, Athena. 2005

机译：主题搜索中的术语选择过程：最终用户与信息检索系统和索引语言的交互。
6. Controlled Vocabularies Indexing and Medical Language Processing. Expert Indexing Systems: Research on Interactive Knowledge-Based Indexing: The MedIndEx Prototype [O] . Susanne M. Humphrey 1989

机译：受控词汇表索引编制和医学语言处理。专家索引系统：基于交互式知识的索引的研究：MedIndEx原型
7. Indexing and stemming approaches for the Czech language [O] . Dolamic, Ljiljana, Savoy, Jacques 2013

机译：捷克语的索引编制和词干提取方法

Indexing and stemming approaches for the Czech language

摘要

著录项

相似文献

相关主题

期刊订阅