...
首页> 外文期刊>Knowledge and information systems >Highly discriminative statistical features for email classification
【24h】

Highly discriminative statistical features for email classification

机译:电子邮件分类的高度区分统计功能

获取原文
获取原文并翻译 | 示例
           

摘要

This paper reports on email classification and filtering, more specifically on spam versus ham and phishing versus spam classification, based on content features. We test the validity of several novel statistical feature extraction methods. The methods rely on dimensionality reduction in order to retain the most informative and discriminative features. We successfully test our methods under two schemas. The first one is a classic classification scenario using a 10-fold cross-validation technique for several corpora, including four ground truth standard corpora: Ling-Spam, SpamAssassin, PU1, and a subset of the TREC 2007 spam corpus, and one proprietary corpus. In the second schema, we test the anticipatory properties of our extracted features and classification models with two proprietary datasets, formed by phishing and spam emails sorted by date, and with the public TREC 2007 spam corpus. The contributions of our work are an exhaustive comparison of several feature selection and extraction methods in the frame of email classification on different benchmarking corpora, and the evidence that especially the technique of biased discriminant analysis offers better discriminative features for the classification, gives stable classification results notwithstanding the amount of features chosen, and robustly retains their discriminative value over time and data setups. These findings are especially useful in a commercial setting, where short profile rules are built based on a limited number of features for filtering emails.
机译:本文报告了基于内容功能的电子邮件分类和过滤,尤其是垃圾邮件与火腿,网络钓鱼与垃圾邮件的分类。我们测试了几种新颖的统计特征提取方法的有效性。该方法依赖降维,以保留最丰富的信息和具有区别性的特征。我们在两种模式下成功测试了我们的方法。第一个是经典分类方案,对几种语料库使用10倍交叉验证技术,其中包括四个基本事实标准语料库:Ling-Spam,SpamAssassin,PU1和TREC 2007垃圾语料库的子集,以及一个专有语料库。 。在第二个模式中,我们使用两个专有数据集(由按日期排序的网络钓鱼和垃圾邮件组成)以及公共TREC 2007垃圾邮件集测试提取的特征和分类模型的预期特性。我们的工作是对不同基准语料库的电子邮件分类框架中的几种特征选择和提取方法进行详尽的比较,并证明特别是有偏判别分析技术为分类提供了更好的判别特征,并给出了稳定的分类结果尽管选择了许多功能,但随着时间和数据设置的变化,这些功能仍具有强大的区分价值。这些发现在商业环境中特别有用,在商业环境中,基于有限的功能来构建简短的配置文件规则以过滤电子邮件。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号