首页> 外文学位 >Feature Engineering for Data Analytics
【24h】

Feature Engineering for Data Analytics

机译:数据分析功能工程

获取原文
获取原文并翻译 | 示例

摘要

Data plays a fundamental role in modern science, engineering, and business applications. We investigate two important problems in data analytics. The first is feature selection, where we consider both the unsupervised and the supervised case. The second is data privacy, where we propose a new model and describe algorithms that improve data privacy in that model.;Feature selection is the process of removing redundant and irrelevant features from the data. There are two general classes of feature selection: the unsupervised case and the supervised case. In the unsupervised case features are selected to approximate the entire data matrix, while in the supervised case features are selected to predict a set of labels from the data matrix.;We describe several new algorithms for the unsupervised case that are closely related to the A* heuristic search algorithm. These new algorithms can effectively select features from large datasets and are shown experimentally to be more accurate than the current state of the art. The evaluation criterion for feature selection is typically an error measured in the Frobenius norm. We generalize the criteria to a large family of unitarily invariant norms. These include, among others, the Spectral norm, the Nuclear norm, and Schatten p-norms.;We proposed several algorithms for supervised feature selection that improve the running time and the accuracy of the current state of the art. A common approach for reducing the running time is to perform the selection in two stages. In the first stage a fast filter is applied to select good candidates. The number of candidates is further reduced in the second stage by an accurate algorithm that may run significantly slower. We describe a general framework that can use an arbitrary off-the-shelf unsupervised algorithm for the second stage. The algorithm is applied to the selection obtained in the first stage weighted appropriately.;Another common approach for accelerating the running time of feature selection is to use a greedy technique called "forward selection". We show how to use this technique to address the multi-label classification problem. Experimental results on real-world data demonstrate the effectiveness of the proposed approach.;We generalize the error criteria of forward selection to unitarily invariant functions. In particular we show how to minimize Schatten p-norms to solve the outlier-robust PCA problem. The algorithm is very efficient and experimental results show that it outperforms a well-known outlier-robust PCA method that uses convex optimization.;In standard machine learning and regression, feature values are used to predict some desired information from the data. One privacy concern is that the feature values may also expose information that one wishes to keep confidential. We propose a model that formulates this concern and show that such privacy can be achieved with almost no effect on the quality of predicting desired information. We describe two algorithms for the case in which the prediction model starts with a linear operator. The desired effect can be achieved by zeroing out feature components in the approximate null space of the linear operator.
机译:数据在现代科学,工程和业务应用程序中起着基本作用。我们研究数据分析中的两个重要问题。首先是特征选择,在这里我们考虑了无监督情况和有监督情况。第二个是数据隐私,其中我们提出了一个新模型,并描述了可改善该模型中数据隐私的算法。特征选择是从数据中删除冗余和不相关特征的过程。特征选择一般分为两类:无监督的情况和受监督的情况。在无监督的情况下,特征被选择为近似整个数据矩阵,而在有监督的情况下,特征被选择为从数据矩阵中预测一组标签。;我们描述了几种与A密切相关的无监督情况的新算法*启发式搜索算法。这些新算法可以有效地从大型数据集中选择特征,并通过实验证明比现有技术更准确。用于特征选择的评估标准通常是在Frobenius规范中测量的误差。我们将标准推广到一大批统一不变的准则。其中包括:光谱范数,核范数和Schatten p范数。我们提出了几种用于监督特征选择的算法,这些算法可缩短运行时间并提高当前技术水平的准确性。减少运行时间的常用方法是分两个阶段执行选择。在第一阶段,应用快速过滤器来选择好的候选者。在第二阶段,候选算法的数量会通过运行可能明显慢得多的精确算法进一步减少。我们描述了一个通用框架,该框架可以在第二阶段使用任意现成的非监督算法。该算法适用于在第一阶段获得适当加权的选择。加快特征选择运行时间的另一种常见方法是使用称为“正向选择”的贪婪技术。我们展示了如何使用这种技术来解决多标签分类问题。对真实数据的实验结果证明了该方法的有效性。我们将前向选择的误差准则推广为unit不变函数。特别是,我们展示了如何最小化Schatten p范数来解决异常健壮的PCA问题。该算法非常有效,实验结果表明,该算法优于使用凸优化的著名的异常鲁棒PCA方法。在标准机器学习和回归中,特征值用于从数据中预测一些所需信息。一个隐私问题是,特征值还可能会公开人们希望保密的信息。我们提出了一个表达这种担忧的模型,并表明可以实现这种隐私,而对预期信息的质量几乎没有影响。我们针对预测模型以线性算子开始的情况描述了两种算法。可以通过将线性算子的近似零位空间中的特征分量清零来获得所需的效果。

著录项

  • 作者

    Xu, Ke.;

  • 作者单位

    The University of Texas at Dallas.;

  • 授予单位 The University of Texas at Dallas.;
  • 学科 Computer science.
  • 学位 Ph.D.
  • 年度 2017
  • 页码 134 p.
  • 总页数 134
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 康复医学;
  • 关键词

  • 入库时间 2022-08-17 11:36:46

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号