首页> 外文学位 >Learning from partially labeled data: Unsupervised and semi-supervised learning on graphs and learning with distribution shifting.
【24h】

Learning from partially labeled data: Unsupervised and semi-supervised learning on graphs and learning with distribution shifting.

机译:从部分标记的数据中学习:在图上进行无监督和半监督学习,并通过分布转移进行学习。

获取原文
获取原文并翻译 | 示例

摘要

This thesis focuses on two fundamental machine learning problems: unsupervised learning, where no label information is available, and semi-supervised learning, where a small amount of labels are given in addition to unlabeled data. These problems arise in many real word applications, such as Web analysis and bioinformatics, where a large amount of data is available, but no or only a small amount of labeled data exists. Obtaining classification labels in these domains is usually quite difficult because it involves either manual labeling or physical experimentation. This thesis approaches these problems from two perspectives: graph based and distribution based.;First, I investigate a series of graph based learning algorithms that are able to exploit information embedded in different types of graph structures. These algorithms allow label information to be shared between nodes in the graph---ultimately communicating information globally to yield effective unsupervised and semi-supervised learning. In particular, I extend existing graph based learning algorithms, currently based on undirected graphs, to more general graph types, including directed graphs, hypergraphs and complex networks. These richer graph representations allow one to more naturally capture the intrinsic data relationships that exist, for example, in Web data, relational data, bioinformatics and social networks. For each of these generalized graph structures I show how information propagation can be characterized by distinct random walk models, and then use this characterization to develop new unsupervised and semi-supervised learning algorithms.;Second, I investigate a more statistically oriented approach that explicitly models a learning scenario where the training and test examples come from different distributions. This is a difficult situation for standard statistical learning approaches, since they typically incorporate an assumption that the distributions for training and test sets are similar, if not identical. To achieve good performance in this scenario, I utilize unlabeled data to correct the bias between the training and test distributions. A key idea is to produce resampling weights for bias correction by working directly in a feature space and bypassing the problem of explicit density estimation. The technique can be easily applied to many different supervised learning algorithms, automatically adapting their behavior to cope with distribution shifting between training and test data.
机译:本文着重于两个基本的机器学习问题:无监督学习(其中没有可用的标签信息)和半监督学习(其中除未标记的数据外还提供少量标签)。这些问题出现在许多真实的单词应用程序中,例如Web分析和生物信息学,这些应用程序中有大量数据可用,但不存在或仅存在少量标记数据。在这些领域中获取分类标签通常非常困难,因为它涉及手动标签或物理实验。本文从两个角度解决了这些问题:基于图和基于分布。首先,我研究了一系列基于图的学​​习算法,它们能够利用嵌入在不同类型图结构中的信息。这些算法允许标签信息在图中的节点之间共享-最终在全球范围内交流信息,以产生有效的无监督和半监督学习。特别是,我将当前基于无向图的基于图的学​​习算法扩展到更普通的图类型,包括有向图,超图和复杂网络。这些更丰富的图形表示使人们可以更自然地捕获存在于Web数据,关系数据,生物信息学和社交网络中的固有数据关系。对于这些通用图结构,我展示了如何通过不同的随机游走模型来表征信息传播,然后使用这种表征来开发新的无监督和半监督学习算法。其次,我研究了一种更加以统计为导向的方法,可以对模型进行显式建模一种学习场景,其中的培训和测试示例来自不同的分布。对于标准的统计学习方法来说,这是一个困难的情况,因为它们通常合并一个假设,即训练和测试集的分布是相似的,即使不相同。为了在这种情况下获得良好的性能,我利用了未标记的数据来纠正训练和测试分布之间的偏差。一个关键思想是通过直接在特征空间中工作并绕过显式密度估计的问题来产生用于偏差校正的重采样权重。该技术可以轻松地应用于许多不同的监督学习算法,自动调整其行为以应对训练数据与测试数据之间的分布偏移。

著录项

  • 作者

    Huang, Jiayuan.;

  • 作者单位

    University of Waterloo (Canada).;

  • 授予单位 University of Waterloo (Canada).;
  • 学科 Computer Science.
  • 学位 Ph.D.
  • 年度 2007
  • 页码 169 p.
  • 总页数 169
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 自动化技术、计算机技术;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号