...
首页> 外文期刊>Knowledge and Data Engineering, IEEE Transactions on >A Survey of Discretization Techniques: Taxonomy and Empirical Analysis in Supervised Learning
【24h】

A Survey of Discretization Techniques: Taxonomy and Empirical Analysis in Supervised Learning

机译:离散化技术概述:监督学习中的分类法和实证分析

获取原文
获取原文并翻译 | 示例
           

摘要

Discretization is an essential preprocessing technique used in many knowledge discovery and data mining tasks. Its main goal is to transform a set of continuous attributes into discrete ones, by associating categorical values to intervals and thus transforming quantitative data into qualitative data. In this manner, symbolic data mining algorithms can be applied over continuous data and the representation of information is simplified, making it more concise and specific. The literature provides numerous proposals of discretization and some attempts to categorize them into a taxonomy can be found. However, in previous papers, there is a lack of consensus in the definition of the properties and no formal categorization has been established yet, which may be confusing for practitioners. Furthermore, only a small set of discretizers have been widely considered, while many other methods have gone unnoticed. With the intention of alleviating these problems, this paper provides a survey of discretization methods proposed in the literature from a theoretical and empirical perspective. From the theoretical perspective, we develop a taxonomy based on the main properties pointed out in previous research, unifying the notation and including all the known methods up to date. Empirically, we conduct an experimental study in supervised classification involving the most representative and newest discretizers, different types of classifiers, and a large number of data sets. The results of their performances measured in terms of accuracy, number of intervals, and inconsistency have been verified by means of nonparametric statistical tests. Additionally, a set of discretizers are highlighted as the best performing ones.
机译:离散化是许多知识发现和数据挖掘任务中使用的必不可少的预处理技术。它的主要目标是通过将分类值与区间相关联,从而将一组连续属性转换为离散属性,从而将定量数据转换为定性数据。以这种方式,可以将符号数据挖掘算法应用于连续数据,并且简化了信息表示,使其更加简洁和具体。文献提供了许多离散化的建议,并且可以找到将它们分类为分类的一些尝试。但是,在以前的文章中,在属性的定义上缺乏共识,并且尚未建立正式的分类,这可能会使从业人员感到困惑。此外,只有一小部分离散器被广泛考虑,而许多其他方法并未引起注意。为了缓解这些问题,本文从理论和经验的角度对文献中提出的离散化方法进行了概述。从理论上讲,我们根据先前研究中指出的主要属性来开发分类法,统一该符号并包括最新的所有已知方法。根据经验,我们在监督分类中进行了一项实验性研究,涉及最有代表性和最新的离散器,不同类型的分类器以及大量数据集。通过准确性,间隔数和不一致性来衡量其性能的结果已通过非参数统计检验得到了验证。此外,突出显示了一组离散化器,它们是性能最好的。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号