Efficient Weighted Semantic Score Based on the Huffman Coding Algorithm and Knowledge Bases for Word Sequences Embedding

Ben-Lhachemi Nada; Nfaoui El Habib

首页> 外文期刊>International journal on Semantic Web and information systems >Efficient Weighted Semantic Score Based on the Huffman Coding Algorithm and Knowledge Bases for Word Sequences Embedding

【24h】

Efficient Weighted Semantic Score Based on the Huffman Coding Algorithm and Knowledge Bases for Word Sequences Embedding

机译：基于Huffman编码算法和知识库的高效加权语义分数嵌入Word序列的Word序列

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Learning text representation is forming a core for numerous natural language processing applications. Word embedding is a type of text representation that allows words with similar meaning to have similar representation. Word embedding techniques categorize semantic similarities between linguistic items based on their distributional properties in large samples of text data. Although these techniques are very efficient, handling semantic and pragmatics ambiguity with high accuracy is still a challenging research task. In this article, we propose a new feature as a semantic score which handles ambiguities between words. We use external knowledge bases and the Huffman Coding algorithm to compute this score that depicts the semantic relatedness between all fragments composing a given text. We combine this feature with word embedding methods to improve text representation. We evaluate our method on a hashtag recommendation system in Twitter where text is noisy and short. The experimental results demonstrate that, compared with state-of-the-art algorithms, our method achieves good results.

机译：学习文本表示正在为许多自然语言处理应用形成核心。单词嵌入是一种文本表示，允许具有类似含义的单词来具有相似的表示。 Word嵌入技术基于在大型文本数据的大型样本中的分布属性基于它们的语言项之间分类语义相似性。虽然这些技术非常有效，但以高精度处理语义和语用歧义仍然是一个具有挑战性的研究任务。在本文中，我们提出了一个新功能，作为一个语义分数，它在单词之间处理歧义。我们使用外部知识库和霍夫曼编码算法来计算这个分数，描绘了构成给定文本的所有片段之间的语义相关性。我们将此功能与单词嵌入方法组合以改进文本表示。我们在Twitter中评估我们的方法在Twitter中，文本嘈杂和短。实验结果表明，与最先进的算法相比，我们的方法达到了良好的效果。

著录项

来源
《International journal on Semantic Web and information systems》 |2020年第2期|共17页
作者
Ben-Lhachemi Nada; Nfaoui El Habib;
展开▼
作者单位

Sidi Mohamed Ben Abdellah Univ Fes Morocco;

Sidi Mohamed Ben Abdellah Univ LIIAN Lab Fes Morocco;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类计算机网络;
关键词
DBpedia; Huffman Algorithm; Knowledge Bases; Semantic Representation; Word Embedding;

机译：DBPedia;霍夫曼算法;知识库;语义表示;词嵌入;

相似文献

外文文献
中文文献
专利

1. Efficient Weighted Semantic Score Based on the Huffman Coding Algorithm and Knowledge Bases for Word Sequences Embedding [J] . Ben-Lhachemi Nada, Nfaoui El Habib International journal on Semantic Web and information systems . 2020,第2期

机译：基于Huffman编码算法和知识库的高效加权语义分数嵌入Word序列的Word序列
2. Two Algorithms for Constructing Efficient Huffman-Code-Based Reversible Variable Length Codes [J] . Lin C.-W., Wu J.-L., Chuang Y.-J. IEEE Transactions on Communications . 2007,第12期

机译：基于有效霍夫曼码的可逆可变长度码的两种构建算法
3. Evaluating semantic relations in neural word embeddings with biomedical and general domain knowledge bases [J] . Zhiwei Chen, Zhe He, Xiuwen Liu, BMC Medical Informatics and Decision Making . 2018,第2期

机译：利用生物医学和通用领域知识库评估神经词嵌入中的语义关系
4. WALGES: Weighted Probability Based Scoring Approach for Solving Algebraic Word Problems using Semantic Parsing [C] . Habibur Rahman, Julia Rahman, Asmaul Husna International Conference on Electrical, Computer and Communication Engineering . 2019

机译：WALGES：基于加权概率的计分方法，用于使用语义解析解决代数词问题
5. Scoring-and-Unfolding Trimmed Tree Assembler: Algorithms for Assembling Genome Sequences Accurately and Efficiently. [D] . Narzisi, Giuseppe. 2011

机译：计分和展开修剪的树组装程序：准确有效地组装基因组序列的算法。
6. Evaluating semantic relations in neural word embeddings with biomedical and general domain knowledge bases [O] . Zhiwei Chen, Zhe He, Xiuwen Liu, 2018

机译：利用生物医学和通用领域知识库评估神经词嵌入中的语义关系
7. Figure 4: (A) One conserved sequence, which occurs 79 times in 46,264 binding site peaks from the ChIP-seq data-set. The mutation profile of this conserved sequence is illustrated, where ’_ ’ indicates this base is unchanged; DEL indicates this base is lost; INS X indicates a new base X is inserted in front of this base. (B) Several repeated elements patterns are listed. (C) In the first column, the top five DNA motifs, mined by meme-chip tools (Machanick Bailey, 2011) are illustrated. The resemblant conserved sequences, found by the CFSP algorithm are listed in the second column. In the third column, the position-specific scoring matrices, which are transformed from mutational information are listed. The similarity between meme motif and resemblant conserved sequence with PSSM format was calculated via a stamp motif comparison tool (Mahony Benos, 2007). The E-values for the similarity of those pairs is displayed in the fourth column. (D) One motif is selected in each group clustered by gkmsvm descriptors, and the corresponding motif found by the CFSP algorithm is listed below. (E) There are additional datasets (File No: ENCFF100GRL, ENCFF616IRT, ENCFF870CER, Target: SREBF1) collected from https://www.encodeproject.org. The top two motifs are selected in each file using meme tools, and the corresponding motifs found by our algorithm are listed below. [O] . -1

机译：图4：（a）一种保守序列，其发生在芯片-SEQ数据集中的46,264个结合位点峰值中的79倍。说明了这种保守序列的突变分布，其中'_'表示该碱度不变; del表示此基础丢失; INS X表示新的基础X插入此基础前面。（b）列出了几种重复的元素模式。（c）在第一栏中，示出了由MEME芯片工具（Machanick＆Bailey，2011）开采的前五个DNA主题。由CFSP算法发现的相应保守序列列于第二列中。在第三列中，列出了从突变信息转换的特定位置的评分矩阵。 MEME主题与PSSM格式的相似性与PSSM格式之间的相似性通过邮票图章比较工具（Mahony＆Benos，2007）计算。这些对相似性的电子值显示在第四列中。（d）在由GKMSVM描述符聚集的每个组中选择了一个图案，下面列出了CFSP算法的相应主题。（e）从https://www.encodeproject.org收集的，有附加数据集（文件no：cernff100grl，cenf616irl，conf8.20cer，target：srebf1）。使用MEME工具在每个文件中选择前两个图案，并且我们的算法发现的相应主题如下所示。

Efficient Weighted Semantic Score Based on the Huffman Coding Algorithm and Knowledge Bases for Word Sequences Embedding

摘要

著录项

相似文献

相关主题

期刊订阅