Word Embedding based Textual Semantic Similarity Measure in Bengali

MD. Asif Iqbal; Omar Sharif; Mohammed Moshiul Hoque; Iqbal H. Sarkar

首页> 外文期刊>Procedia Computer Science >Word Embedding based Textual Semantic Similarity Measure in Bengali

【24h】

Word Embedding based Textual Semantic Similarity Measure in Bengali

机译：孟加拉语嵌入基于文本语义相似度量的词

获取原文

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Textual semantic similarity is a crucial constituent in many NLP tasks such as information retrieval, machine translation, information retrieval and textual forgery detection. It is a complicated task for rule-based techniques to address semantic similarity measures in low-resource languages due to the complex morphological structure and scarcity of linguistic resources. This paper investigates several word embedding techniques (Word2Vec, GloVe, FastText) to estimate the semantic similarity of Bengali sentences. Due to the unavailability of the standard dataset, this work developed a Bengali dataset containing 187031 text documents with 400824 unique words. Moreover, this work considers three semantic distance measures to compute the similarity between the word vectors using Cosine similarity with no weight, term frequency weighting and Part-of-Speech weighting. The performance of the proposed approach is evaluated on the developed dataset containing 50 pairs of Bengali sentences. The evaluation result shows that FastText with continuous bag-of-words with 100 vector size achieved the highest Pearson’s correlation (ρ) score of 77.28%.

机译：文本语义相似性是许多NLP任务中的重要组成部分，例如信息检索，机器翻译，信息检索和文本伪造检测。由于基于规则的技术，通过复杂的形态学结构和语言资源稀缺来解决低资源语言中的语义相似度测量的复杂任务。本文调查了几个单词嵌入技术（Word2Vec，手套，FastText）来估计孟加拉语句子的语义相似性。由于标准数据集的不可用，这项工作开发了一个包含187031个文本文档的孟加拉数据集，其中包含400824个独特的单词。此外，这项工作考虑了三个语义距离措施，以使用余弦相似性来计算字矢量之间的相似性，而没有重量，术语频率加权和语音部分。所提出的方法的性能在包含50对孟加拉语句子的开发数据集上进行评估。评估结果表明，具有100个矢量尺寸的连续袋的FastText实现了最高的Pearson的相关性（ρ）得分为77.28％。

著录项

来源
《Procedia Computer Science》 |2021年第a期|共10页
作者
MD. Asif Iqbal; Omar Sharif; Mohammed Moshiul Hoque; Iqbal H. Sarkar;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类
关键词
Natural language processingTextual semantic similarityWord embeddingCosine similarityPart-of-speech weighting;

机译：自然语言处理缩写语义相似性嵌入式相似性Part-of Leamplight;

相似文献

外文文献
中文文献
专利

1. Sentence modeling via multiple word embeddings and multi-level comparison for semantic textual similarity [J] . Nguyen Huy Tien, Nguyen Minh Le, Tomohiro Yamasaki, Information Processing & Management . 2019,第6期

机译：通过多个词嵌入和多级比较进行句子建模，以实现语义文本相似性
2. A Comparison of Approaches for Measuring the Semantic Similarity of Short Texts Based on Word Embeddings [J] . Karlo Babi?, Francesco Guerra, Sanda Martin?i?-Ip?i?, Journal of Information and Organizational Sciences . 2020,第2期

机译：基于Word Embeddings测量短文本语义相似性的方法的比较
3. Semantic textual similarity between sentences using bilingual word semantics [J] . Md. Shajalal, Masaki Aono Progress in Artificial Intelligence . 2019,第2期

机译：使用双语词语义的句子之间的语义文本相似性
4. Measuring Semantic Similarity of Bengali Texts with Parts-of-Speech Tags and Word-Level Semantics [C] . Md. Atabuzzaman, Md Shajalal International Conference on Computer and Information Technology . 2020

机译：测量孟加拉语文本的语义相似性与言语零件的标签和单词级语义
5. Using semantic similarity measures in the biomedical domain for computing functional similarity between genes based on gene ontology [D] . Khabiri, Elham 2007

机译：在生物医学领域中使用语义相似性度量基于基因本体计算基因之间的功能相似性
6. Word and Sentence Embedding Tools to Measure Semantic Similarity of Gene Ontology Terms by Their Definitions [O] . Dat Duong, Wasi Uddin Ahmad, Eleazar Eskin, -1

机译：单词和句子嵌入工具通过其定义来测量基因本体术语的语义相似性
7. QLUT at SemEval-2017 Task 1: Semantic Textual Similarity Based on Word Embeddings [O] . Fanqing Meng, Wenpeng Lu, Yuteng Zhang, 2017

机译：Qlut在Semeval-2017任务1：基于Word Embeddings的语义文本相似性

Word Embedding based Textual Semantic Similarity Measure in Bengali

摘要

著录项

相似文献

相关主题

期刊订阅