首页> 外文期刊>International Journal of Computer Processing of Oriental Languages >Comparison of Stemming and N-gram Matching for Term Conflation in Arabic Text
【24h】

Comparison of Stemming and N-gram Matching for Term Conflation in Arabic Text

机译:词干和词干匹配在阿拉伯语文本中的比较

获取原文
获取原文并翻译 | 示例
           

摘要

A stem in Arabic is a root verb form combined with derivational morphemes but with affixes removed. One part of this paper is to repeat the Mixed Stemming experiment, that chooses the use of a word, a stem, or a root for a query term based upon which form has the highest average inverse document frequency value, using more documents and queries. A consistency result has been obtained. The root with weighting method was the superior. The mixed stemming improved binary weighting search results in all cases but did not increase performance over weighted stems or roots, N-gram matching is widely used for term conflation when searching the World Wide Web for something without knowing whether it is stored as singular or plural, as a compound or in the form of several words, in old or new spelling or possibly in wrong orthography. Part two of this paper reports a comparison of Stemming and N-gram matching. The Stem-NGram-Stem Search outperforms (at recall 0.4 and above) the Digrams method when both the queries and documents words are not stemmed. It also suggests that the (Stem-NGram-Stem) Search approach outperforms (at recall 0.4 and above) the Trigrams method (Stem-NGram-Word) when the queries words are stemmed and documents words are not stemmed. The results also suggest that the Trigrams method (Stem-NGram-Word) outperforms the Digrams method when both the queries and documents words are not stemmed. All of the proposed N-gram methods outperform the Word, Stem, and Root index methods for Binary Weighting scheme.
机译:阿拉伯语的词干是与衍生语素结合但去除了词缀的词根动词形式。本文的一部分是重复混合词干法实验,该实验选择单词,词干或词根作为查询项的依据,即使用更多文档和查询时,哪种形式具有最高的平均反向文档频率值。已获得一致性结果。加权法的根是优越的。在所有情况下,混合词干均改善了二进制加权搜索结果,但并未提高加权词干或词根的性能,当在不知道其存储为单数还是复数形式的情况下搜索万维网时,N-gram匹配被广泛用于术语合并。 ,以旧词或新词拼写或可能以错误的拼写形式作为复合词或多个单词的形式出现。本文的第二部分报告了词干匹配和N-gram匹配的比较。当查询词和文档词都没有被词干时,Stem-NGram-Stem Search的性能优于Digrams方法(在调用0.4及更高版本时)。它还表明,当词干被查询而文档词不被词干时,(Stem-NGram-Stem)搜索方法优于Trigrams方法(Stem-NGram-Word)(在回忆0.4及更高版本上)。结果还表明,在不阻止查询词和文档词的情况下,Trigrams方法(Stem-NGram-Word)优于Digrams方法。对于二进制加权方案,所有提出的N-gram方法都优于Word,Stem和Root索引方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号