...
首页> 外文期刊>Procedia Computer Science >Word Embedding based Textual Semantic Similarity Measure in Bengali
【24h】

Word Embedding based Textual Semantic Similarity Measure in Bengali

机译:孟加拉语嵌入基于文本语义相似度量的词

获取原文
           

摘要

Textual semantic similarity is a crucial constituent in many NLP tasks such as information retrieval, machine translation, information retrieval and textual forgery detection. It is a complicated task for rule-based techniques to address semantic similarity measures in low-resource languages due to the complex morphological structure and scarcity of linguistic resources. This paper investigates several word embedding techniques (Word2Vec, GloVe, FastText) to estimate the semantic similarity of Bengali sentences. Due to the unavailability of the standard dataset, this work developed a Bengali dataset containing 187031 text documents with 400824 unique words. Moreover, this work considers three semantic distance measures to compute the similarity between the word vectors using Cosine similarity with no weight, term frequency weighting and Part-of-Speech weighting. The performance of the proposed approach is evaluated on the developed dataset containing 50 pairs of Bengali sentences. The evaluation result shows that FastText with continuous bag-of-words with 100 vector size achieved the highest Pearson’s correlation (ρ) score of 77.28%.
机译:文本语义相似性是许多NLP任务中的重要组成部分,例如信息检索,机器翻译,信息检索和文本伪造检测。由于基于规则的技术,通过复杂的形态学结构和语言资源稀缺来解决低资源语言中的语义相似度测量的复杂任务。本文调查了几个单词嵌入技术(Word2Vec,手套,FastText)来估计孟加拉语句子的语义相似性。由于标准数据集的不可用,这项工作开发了一个包含187031个文本文档的孟加拉数据集,其中包含400824个独特的单词。此外,这项工作考虑了三个语义距离措施,以使用余弦相似性来计算字矢量之间的相似性,而没有重量,术语频率加权和语音部分。所提出的方法的性能在包含50对孟加拉语句子的开发数据集上进行评估。评估结果表明,具有100个矢量尺寸的连续袋的FastText实现了最高的Pearson的相关性(ρ)得分为77.28%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号