首页> 外文期刊>Intelligent data analysis >On the use of word embedding for cross language plagiarism detection
【24h】

On the use of word embedding for cross language plagiarism detection

机译:关于单词嵌入在跨语言窃检测中的应用

获取原文
获取原文并翻译 | 示例
           

摘要

Cross language plagiarism is the unacknowledged reuse of text across language pairs. It occurs if a passage of text is translated from source language to target language and no proper citation is provided. Although various methods have been developed for detection of cross language plagiarism, less attention has been paid to measure and compare their performance, especially when tackling with different types of paraphrasing through translation. In this paper, we investigate various approaches to cross language plagiarism detection. Moreover, we present a novel approach to cross language plagiarism detection using word embedding methods and explore its performance against other state-of-the-art plagiarism detection algorithms. In order to evaluate the methods, we have constructed an English-Persian bilingual plagiarism detection corpus (referred to as HAMTA-CL) comprised of seven types of obfuscation. The results show that the word embedding approach outperforms the other approaches with respect to recall when encountering heavily paraphrased passages. On the other hand, translation based approach performs well when the precision is the main consideration of the cross language plagiarism detection system.
机译:跨语言窃是跨语言对的未经认可的文本重用。如果将一段文字从源语言翻译成目标语言,并且没有提供适当的引用,则会发生这种情况。尽管已开发出多种方法来检测跨语言抄袭,但对度量和比较其性能的关注却很少,特别是在通过翻译处理不同类型的释义时。在本文中,我们研究了跨语言窃检测的各种方法。此外,我们提出了一种使用单词嵌入方法进行跨语言窃检测的新颖方法,并针对其他最新的窃检测算法探索了其性能。为了评估这些方法,我们构建了英语-波斯语双语抄袭检测语料库(称为HAMTA-CL),该语料库由七种混淆类型组成。结果表明,在遇到大量意味深长的段落时,单词嵌入方法的效果优于其他方法。另一方面,当精度是跨语言窃检测系统的主要考虑因素时,基于翻译的方法效果很好。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号