...
首页> 外文期刊>Knowledge and information systems >Overlapping correlation clustering
【24h】

Overlapping correlation clustering

机译:重叠相关聚类

获取原文
获取原文并翻译 | 示例
           

摘要

We introduce a new approach for finding overlapping clusters given pairwise similarities of objects. In particular, we relax the problem of correlation clustering by allowing an object to be assigned to more than one cluster. At the core of our approach is an optimization problem in which each data point is mapped to a small set of labels, representing membership in different clusters. The objective is to find a mapping so that the given similarities between objects agree as much as possible with similarities taken over their label sets. The number of labels can vary across objects. To define a similarity between label sets, we consider two measures: (i) a 0-1 function indicating whether the two label sets have nonzero intersection and (ii) the Jaccard coefficient between the two label sets. The algorithm we propose is an iterative local-search method. The definitions of label set similarity give rise to two non-trivial optimization problems, which, for the measures of set-intersection and Jaccard, we solve using a greedy strategy and non-negative least squares, respectively. We also develop a distributed version of our algorithm based on the BSP model and implement it using a Pregel framework. Our algorithm uses as input pairwise similarities of objects and can thus be applied when clustering structured objects for which feature vectors are not available. As a proof of concept, we apply our algorithms on three different and complex application domains: trajectories, amino-acid sequences, and textual documents.
机译:我们介绍了一种新方法,该方法可在给定对象的成对相似性的情况下找到重叠的簇。特别是,我们通过允许将一个对象分配给多个群集来缓解相关群集的问题。我们方法的核心是一个优化问题,其中每个数据点都映射到一小组标签,代表不同集群中的成员身份。目的是找到一种映射,以使对象之间的给定相似性与它们的标签集所获得的相似性尽可能一致。标签的数量可能会因对象而异。为了定义标签集之间的相似性,我们考虑两种方法:(i)0-1函数,指示两个标签集是否具有非零交集;(ii)两个标签集之间的雅卡系数。我们提出的算法是一种迭代的局部搜索方法。标签集相似性的定义引起了两个非平凡的优化问题,对于集交和Jaccard的度量,我们分别使用贪婪策略和非负最小二乘法进行求解。我们还基于BSP模型开发了算法的分布式版本,并使用Pregel框架实现了该算法。我们的算法将对象的成对相似性用作输入,因此可以在对特征向量不可用的结构化对象进行聚类时应用。作为概念证明,我们将算法应用于三个不同且复杂的应用域:轨迹,氨基酸序列和文本文档。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号