首页> 外文会议>International Conference on Web Research >Designing a Deep Neural Network Model for Finding Semantic Similarity Between Short Persian Texts Using a Parallel Corpus
【24h】

Designing a Deep Neural Network Model for Finding Semantic Similarity Between Short Persian Texts Using a Parallel Corpus

机译:使用并行语料库设计深度波斯文本中的语义相似性的深度神经网络模型

获取原文

摘要

Text processing, as one of the main issues in the field of artificial intelligence, has received a lot of attention in recent decades. Numerous methods and algorithms are proposed to address the task of semantic textual similarity which is one of the sub-branches of text processing. Due to the special features of the Persian language and its non-standard writing system, finding semantic similarity is an even more challenging task in Persian. On the other hand, producing a proper corpus that can be used for training a model for finding semantic similarities, is of great importance. In this study, the main purpose is to propose a method for measuring the semantic similarity between short Persian texts. To do so, first, we try to build an appropriate corpus, and then propose an efficient approach based on neural networks. The proposed method involves three steps. The first step is data collection and building a parallel corpus. In the next step, namely the pre-processing step, the data is normalized. Finally, Semantic similarity recognition is done by the neural network using vector representations of the words. The suggested model is built upon the produced corpus made of movie and tv show subtitles containing 35266 sentence pairs. The F-measure of the proposed approach on PAN2016 is 75.98% with 4 tags and 98.87% with 2 tags. We also achieved an F-measure of 98.86% for our model tested on the parallel corpus with 2 tags.
机译:文本处理是人工智能领域的主要问题之一,近几十年来受到很多关注。提出了许多方法和算法来解决语义文本相似性的任务,这是文本处理的子分支之一。由于波斯语言及其非标准写作系统的特点,发现语义相似性是波斯语中更具挑战性的任务。另一方面,生产可用于培训用于寻找语义相似性的模型的适当语料库,这是非常重要的。在这项研究中,主要目的是提出一种测量短期波斯文本之间的语义相似性的方法。为此,首先,我们尝试构建适当的语料库,然后提出基于神经网络的有效方法。该方法涉及三个步骤。第一步是数据收集并构建并行语料库。在下一步中,即预处理步骤,数据被归一化。最后,语义相似性识别由神经网络使用单词的矢量表示来完成。建议的模型建立在由包含35266句对的电影和电视节目字幕制成的制作语料库上。 Pan2016上提出的方法的F测量值为75.98%,4标签和98.87%,2标签。我们还达到了在与2标签上的并联语料库上测试的模型的F-Mabote为98.86%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号