Highly discriminative statistical features for email classification

Juan Carlos Gomez; Erik Boiy; Marie-Francine Moens

首页> 外文期刊>Knowledge and information systems >Highly discriminative statistical features for email classification

【24h】

Highly discriminative statistical features for email classification

机译：电子邮件分类的高度区分统计功能

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

This paper reports on email classification and filtering, more specifically on spam versus ham and phishing versus spam classification, based on content features. We test the validity of several novel statistical feature extraction methods. The methods rely on dimensionality reduction in order to retain the most informative and discriminative features. We successfully test our methods under two schemas. The first one is a classic classification scenario using a 10-fold cross-validation technique for several corpora, including four ground truth standard corpora: Ling-Spam, SpamAssassin, PU1, and a subset of the TREC 2007 spam corpus, and one proprietary corpus. In the second schema, we test the anticipatory properties of our extracted features and classification models with two proprietary datasets, formed by phishing and spam emails sorted by date, and with the public TREC 2007 spam corpus. The contributions of our work are an exhaustive comparison of several feature selection and extraction methods in the frame of email classification on different benchmarking corpora, and the evidence that especially the technique of biased discriminant analysis offers better discriminative features for the classification, gives stable classification results notwithstanding the amount of features chosen, and robustly retains their discriminative value over time and data setups. These findings are especially useful in a commercial setting, where short profile rules are built based on a limited number of features for filtering emails.

机译：本文报告了基于内容功能的电子邮件分类和过滤，尤其是垃圾邮件与火腿，网络钓鱼与垃圾邮件的分类。我们测试了几种新颖的统计特征提取方法的有效性。该方法依赖降维，以保留最丰富的信息和具有区别性的特征。我们在两种模式下成功测试了我们的方法。第一个是经典分类方案，对几种语料库使用10倍交叉验证技术，其中包括四个基本事实标准语料库：Ling-Spam，SpamAssassin，PU1和TREC 2007垃圾语料库的子集，以及一个专有语料库。。在第二个模式中，我们使用两个专有数据集（由按日期排序的网络钓鱼和垃圾邮件组成）以及公共TREC 2007垃圾邮件集测试提取的特征和分类模型的预期特性。我们的工作是对不同基准语料库的电子邮件分类框架中的几种特征选择和提取方法进行详尽的比较，并证明特别是有偏判别分析技术为分类提供了更好的判别特征，并给出了稳定的分类结果尽管选择了许多功能，但随着时间和数据设置的变化，这些功能仍具有强大的区分价值。这些发现在商业环境中特别有用，在商业环境中，基于有限的功能来构建简短的配置文件规则以过滤电子邮件。

著录项

来源
《Knowledge and information systems》 |2012年第1期|共31页
作者
Juan Carlos Gomez; Erik Boiy; Marie-Francine Moens;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类自动化系统理论;
关键词
Data mining; Dimensionality reduction; Email classification; Feature extraction; Feature selection;

机译：数据挖掘;降维;电子邮件分类;特征提取;特征选择;

相似文献

外文文献
中文文献
专利

1. Highly discriminative statistical features for email classification [J] . Juan Carlos Gomez, Erik Boiy, Marie-Francine Moens Knowledge and information systems . 2012,第1期

机译：电子邮件分类的高度区分统计功能
2. Social feature-based enterprise email classification without examining email contents [J] . Min-Feng Wang, Meng-Feng Tsai, Sie-Long Jheng, Journal of network and computer applications . 2012,第2期

机译：基于社交功能的企业电子邮件分类，无需检查电子邮件内容
3. Feature Extraction or Feature Selection for Text Classification: A Case Study on Phishing Email Detection [J] . Masoumeh Zareapoor, Seeja K. R International Journal of Information Engineering and Electronic Business . 2015,第2期

机译：用于文本分类的特征提取或特征选择：以网络钓鱼电子邮件检测为例
4. Highly Discriminative Features for Phishing Email Classification by SVD [C] . Masoumeh Zareapoor, Pourya Shamsolmoali, M. Afshar Alam International Conference on Information Systems Design and Intelligent Applications . 2015

机译：通过SVD进行网络钓鱼电子邮件分类的高度辨别功能
5. Detecting targeted malicious email through supervised classification of persistent threat and recipient oriented features. [D] . Amin, Rohan Mahesh. 2010

机译：通过对持久性威胁和面向收件人的功能进行监督分类来检测目标恶意电子邮件。
6. Lung nodule malignancy classification using only radiologist-quantified image features as inputs to statistical learning algorithms: probing the Lung Image Database Consortium dataset with two statistical learning methods [O] . Matthew C. Hancock, Jerry F. Magnan 2016

机译：仅使用放射科医生量化的图像特征作为统计学习算法的输入的肺结节恶性分类：使用两种统计学习方法探查肺图像数据库联盟数据集
7. A new variance-based approach for discriminative feature extraction in machine hearing classification using spectrogram features [O] . Xie, Zhi-Peng, McLoughlin, Ian Vince, Zhang, Hao-min, 2016

机译：一种新的基于方差的方法，用于使用频谱图特征进行机器听力分类中的歧视性特征提取

Highly discriminative statistical features for email classification

摘要

著录项

相似文献

相关主题

期刊订阅