首页> 外文会议>3rd workshop on representation learning for NLP 2018 >Corpus specificity in LSA and word2vec: the role of out-of-domain documents
【24h】

Corpus specificity in LSA and word2vec: the role of out-of-domain documents

机译:LSA和word2vec中的语料库特异性:域外文档的作用

获取原文
获取原文并翻译 | 示例

摘要

Despite the popularity of word embed-dings, the precise way by which they acquire semantic relations between words remain unclear. In the present article, we investigate whether LSA and word2vec capacity to identify relevant semantic relations increases with coipus size. One intuitive hypothesis is that the capacity to identify relevant associations should increase as the amount of data increases. However, if corpus size grows in topics which are not specific to the domain of interest, signal to noise ratio may weaken. Here we investigate the effect of corpus specificity and size in word-embeddings, and for this, we study two ways for progressive elimination of documents: the elimination of random documents vs. the elimination of documents unrelated to a specific task. We show that word2vec can take advantage of all the documents, obtaining its best performance when it is trained with the whole corpus. On the contrary, the specialization (removal of out-of-domain documents) of the training corpus, accompanied by a decrease of dimensionality, can increase LSA word-representation quality while speeding up the processing time. From a cognitive-modeling point of view, we point out that LSA's word-knowledge acquisitions may not be efficiently exploiting higher-order co-occurrences and global relations, whereas word2vec does.
机译:尽管单词嵌入很流行,但它们获取单词之间语义关系的确切方法仍不清楚。在本文中,我们调查了LSA和word2vec识别相关语义关系的能力是否随同伴大小而增加。一种直观的假设是,随着数据量的增加,识别相关关联的能力也应增加。但是,如果在非特定于感兴趣领域的主题中语料库大小增加,则信噪比可能会减弱。在这里,我们研究了语料库特异性和字词嵌入大小的影响,为此,我们研究了逐步消除文档的两种方法:消除随机文档与消除与特定任务无关的文档。我们证明了word2vec可以利用所有文档,并在整个语料库上进行训练时可获得最佳性能。相反,训练语料库的专业化(删除域外文档)伴随着维数的减少,可以提高LSA单词表示质量,同时加快处理时间。从认知模型的角度来看,我们指出LSA的单词知识获取可能无法有效利用高阶共现和全局关系,而word2vec却可以。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号