【24h】

Near-Duplicate Mail Detection Based on URL Information for Spam Filtering

机译:基于URL信息的几乎重复邮件检测以进行垃圾邮件过滤

获取原文
获取原文并翻译 | 示例

摘要

Due to fast changing of spam techniques to evade being detected, we argue that multiple spam detection strategies should be developed to effectively against spam. In literature, many proposed spam detection schemes used similar strategies based on supervised classification techniques such as Naieve Baysian, SVM, and K-NN. But only few works were on the strategy using detection of duplicate copies. In this paper, we propose a new duplicate-mail detection scheme based on similarity of mail context between incoming mails, especially the context of URL information. We discuss different design strategies to against possible spam tricks to avoid being detected. Also, We compared our approaches with four different approaches available in literature: Octet-based histogram method, I-Mach, Winnowing, and identical matching. With over thousands of real mails we collected as testing data, our experiment results show that the proposed strategy outperforms the others. Without considering compulsory miss, over 97% of near duplicate mails can be detected correctly.
机译:由于垃圾邮件技术快速变化以逃避被检测到,我们认为应该开发多种垃圾邮件检测策略来有效地抵制垃圾邮件。在文献中,许多提议的垃圾邮件检测方案都使用基于监督分类技术(例如Naieve Baysian,SVM和K-NN)的类似策略。但是,只有极少数的工作在使用重复副本检测的策略上。在本文中,我们基于传入邮件之间的邮件上下文的相似性,尤其是URL信息的上下文,提出了一种新的重复邮件检测方案。我们讨论了不同的设计策略,以防止可能的垃圾邮件招数以避免被发现。另外,我们将我们的方法与文献中可用的四种不同方法进行了比较:基于八位位组的直方图方法,I-Mach,Winnowing和相同的匹配。我们收集了成千上万的真实邮件作为测试数据,我们的实验结果表明,提出的策略优于其他策略。在不考虑强制错过的情况下,可以正确检测出近97%的重复邮件。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号