首页> 外文期刊>Journal of logic and computation >A corpus-based finite-state morphological toolkit for contemporary arabic
【24h】

A corpus-based finite-state morphological toolkit for contemporary arabic

机译:基于语料库的当代阿拉伯语有限状态形态学工具包

获取原文
获取原文并翻译 | 示例
           

摘要

We develop an open-source large-scale finite-state morphological processing toolkit (AraComLex) for Modern Standard Arabic (MSA) distributed under the GPLv3 license. The morphological transducer is based on a lexical database specifically constructed for this purpose. In contrast to previous resources, the database is tuned to MSA, eliminating lexical entries no longer attested in contemporary use. The database is built using a corpus of 1,089,111,204 word tokens, a pre-annotation tool, machine learning techniques and knowledge-based pattern matching to automatically acquire lexical knowledge. Our morphological transducer is evaluated and compared to LDC's SAMA (Standard Arabic Morphological Analyser). We also develop a finite-state morphological guesser as part of a methodology for extracting unknown word forms, lemmatizing them, and giving them a priority weight for inclusion in the lexicon.
机译:我们为使用GPLv3许可分发的现代标准阿拉伯语(MSA)开发了开源的大规模有限状态形态处理工具包(AraComLex)。形态转换器基于为此目的专门构建的词汇数据库。与以前的资源相比,该数据库已调整为MSA,从而消除了现代使用中不再证明的词汇条目。该数据库使用1,089,111,204个词标记的语料库,预注释工具,机器学习技术和基于知识的模式匹配来自动获取词汇知识。我们对形态传感器进行了评估,并与LDC的SAMA(标准阿拉伯形态分析仪)进行了比较。我们还开发了一种有限状态形态猜测器,作为提取未知单词形式,对其进行词素化,并为它们包含在词典中的优先权重方法的一部分。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号