首页> 外文会议>International joint conference on computational intelligence >Evaluating Methods for Building Arabic Semantic Resources with Big Corpora
【24h】

Evaluating Methods for Building Arabic Semantic Resources with Big Corpora

机译:大语料库建设阿拉伯语语义资源的评估方法

获取原文

摘要

This paper presents detailed data on the workings of a system extracting semantic clusters from a large general Arabic corpus which has been presented in a previous work [1], and proposes some bases for best evaluation using Arabic Word-Net. In the first experiments, using an evaluation corpus of about 8 millions words and GraPaVec, a method for word vectorization based on automatically generated frequency patterns, our system clustered word vectors in a Self Organizing Map neural network model and evaluated them with Arabic WordNet existing synsets. We compared the results with state-of-the-art Word2Vec and Glove methods. As our results were astonishingly high, without clear explanations, we present here a more thorough testing protocol, evaluating with a much larger corpus (1.4 billion words), introducing more refined measures, a refined definition of multiclass recall and precision, taking better into account the specifics of wordnet classification and using NLTK tools. Observations on the corpus are given in order to help researchers interested in our approach to assess methods of implementation and evaluation.
机译:本文提供了有关从大型通用阿拉伯语料库中提取语义簇的系统工作的详细数据,该系统已在先前的工作中提出[1],并提出了一些使用阿拉伯文字网进行最佳评估的基础。在第一个实验中,使用约800万个单词的评估语料库和GraPaVec(一种基于自动生成的频率模式的单词向量化方法),我们的系统将单词向量聚类到自组织映射神经网络模型中,并使用阿拉伯语WordNet现有的同义词集对其进行了评估。我们将结果与最新的Word2Vec和Glove方法进行了比较。由于我们的结果令人惊讶的高,没有明确的解释,我们在这里提出了一个更彻底的测试协议,以更大的语料库(14亿个单词)进行评估,引入了更完善的措施,对多类召回率和精确度进行了精确定义,并更好地考虑了Wordnet分类和使用NLTK工具的详细信息。对语料库进行观察是为了帮助对我们的方法感兴趣的研究人员评估实施和评估方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号