Comparison of Stemming and N-gram Matching for Term Conflation in Arabic Text

Hani Abu-Salem

首页> 外文期刊>International Journal of Computer Processing of Oriental Languages >Comparison of Stemming and N-gram Matching for Term Conflation in Arabic Text

【24h】

Comparison of Stemming and N-gram Matching for Term Conflation in Arabic Text

机译：词干和词干匹配在阿拉伯语文本中的比较

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

A stem in Arabic is a root verb form combined with derivational morphemes but with affixes removed. One part of this paper is to repeat the Mixed Stemming experiment, that chooses the use of a word, a stem, or a root for a query term based upon which form has the highest average inverse document frequency value, using more documents and queries. A consistency result has been obtained. The root with weighting method was the superior. The mixed stemming improved binary weighting search results in all cases but did not increase performance over weighted stems or roots, N-gram matching is widely used for term conflation when searching the World Wide Web for something without knowing whether it is stored as singular or plural, as a compound or in the form of several words, in old or new spelling or possibly in wrong orthography. Part two of this paper reports a comparison of Stemming and N-gram matching. The Stem-NGram-Stem Search outperforms (at recall 0.4 and above) the Digrams method when both the queries and documents words are not stemmed. It also suggests that the (Stem-NGram-Stem) Search approach outperforms (at recall 0.4 and above) the Trigrams method (Stem-NGram-Word) when the queries words are stemmed and documents words are not stemmed. The results also suggest that the Trigrams method (Stem-NGram-Word) outperforms the Digrams method when both the queries and documents words are not stemmed. All of the proposed N-gram methods outperform the Word, Stem, and Root index methods for Binary Weighting scheme.

机译：阿拉伯语的词干是与衍生语素结合但去除了词缀的词根动词形式。本文的一部分是重复混合词干法实验，该实验选择单词，词干或词根作为查询项的依据，即使用更多文档和查询时，哪种形式具有最高的平均反向文档频率值。已获得一致性结果。加权法的根是优越的。在所有情况下，混合词干均改善了二进制加权搜索结果，但并未提高加权词干或词根的性能，当在不知道其存储为单数还是复数形式的情况下搜索万维网时，N-gram匹配被广泛用于术语合并。，以旧词或新词拼写或可能以错误的拼写形式作为复合词或多个单词的形式出现。本文的第二部分报告了词干匹配和N-gram匹配的比较。当查询词和文档词都没有被词干时，Stem-NGram-Stem Search的性能优于Digrams方法（在调用0.4及更高版本时）。它还表明，当词干被查询而文档词不被词干时，（Stem-NGram-Stem）搜索方法优于Trigrams方法（Stem-NGram-Word）（在回忆0.4及更高版本上）。结果还表明，在不阻止查询词和文档词的情况下，Trigrams方法（Stem-NGram-Word）优于Digrams方法。对于二进制加权方案，所有提出的N-gram方法都优于Word，Stem和Root索引方法。

著录项

来源
《International Journal of Computer Processing of Oriental Languages》 |2004年第2期|p.61-81|共21页
作者
Hani Abu-Salem;
展开▼
作者单位

School of Computer Science, Telecommunications, and Information Systems, DePaul University, 243 S. Wabash Ave., Chicago, IL 60604, USA;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类计算技术、计算机技术;
关键词
arabic stemming; N-grams; arabic information retrieval; conflation techniques;

机译：阿拉伯词干;N-grams;阿拉伯语信息检索;融合技术;

相似文献

外文文献
中文文献
专利

1. N-Gram: a Method of Conflating Terms An Approach to Text Categorization and Question Answering Systems in the Arabic language [J] . Riyad Al-Shalabi, Ghassan Kannan, Marwan S. Abualrub, International journal of applied science & computations . 2005,第2期

机译：N-Gram：术语混用的方法阿拉伯文本分类和问答系统的方法
2. Evaluation of N-Gram Conflation Approaches for Arabic Text Retrieval [J] . Farag Ahmed, Andreas Nuernberger Journal of the American Society for Information Science and Technology . 2009,第7期

机译：阿拉伯文本检索的N-Gram合并方法的评估
3. Google N-Gram Viewer does not Include Arabic Corpus! Towards N-Gram Viewer for Arabic Corpus [J] . Alsmadi Izzat, Zarour Mohammad The international arab journal of information technology . 2018,第5期

机译：Google N-Gram Viewer不包括阿拉伯语语料库！面向N-Gram阿拉伯语语料库查看器
4. STEMMING FOR TERM CONFLATION IN MALAY TEXTS [C] . N. IDRIS, S.M.F.D SYED MUSTAPHA International Conference on Artificial Intelligence IC-AI'2001 Vol.3, Jun 25-28, 2001, Las Vegas, Nevada, USA . 2001

机译：马来文中的术语冲突
5. Optimization and effectiveness of N-grams approach for indexing and retrieval in Arabic information retrieval systems. [D] . AlShehri, Abdullah Mohammed. 2002

机译：阿拉伯信息检索系统中用于索引和检索的N元语法方法的优化和有效性。
6. Tashkeela: Novel corpus of Arabic vocalized texts data for auto-diacritization systems [O] . Taha Zerrouki, Amar Balla 2017

机译：塔什凯拉（Tashkeela）：阿拉伯语发声文字的新颖语料库用于自动音标系统的数据
7. Evaluation of N-Grams Conflation Approach in Text-Based Information Retrieval [O] . Serhiy Kosinov 2001

机译：基于文本的信息检索中的N-gram融合方法评估

Comparison of Stemming and N-gram Matching for Term Conflation in Arabic Text

摘要

著录项

相似文献

相关主题

期刊订阅