首页> 外文学位 >Efficient algorithms for mining arbitrary shaped clusters.
【24h】

Efficient algorithms for mining arbitrary shaped clusters.

机译:用于挖掘任意形状簇的高效算法。

获取原文
获取原文并翻译 | 示例

摘要

Clustering is one of the fundamental data mining tasks. Many different clustering paradigms have been developed over the years, which include partitional, hierarchical, mixture model based, density-based, spectral, subspace, and so on. Traditional algorithms approach clustering as an optimization problem, wherein the objective is to minimize certain quality metrics such as the squared error. The resulting clusters are convex polytopes in d-dimensional metric space. For clusters that have arbitrary shapes, such a strategy does not work well. Clusters with arbitrary shapes are observed in many areas of science. For instance, spatial data gathered from Geographic Information Systems, data from weather satellites, data from studies on epidemiology and sensor data rarely possess regular shaped clusters. Image segmentation is an area of technology that deals extensively with arbitrary shaped regions and boundaries. In addition to the complex shapes some of the above applications generate large volumes of data. The set of clustering algorithms that identify irregular shaped clusters are referred to as shape-based clustering algorithms. These algorithms are the focus of this thesis.;Existing methods for identifying arbitrary shaped clusters include density-based, hierarchical and spectral algorithms. These methods suffer either in terms of the memory or time complexity, which can be quadratic or even cubic. This shortcoming has restricted these algorithms to datasets of moderate sizes. In this thesis we propose SPARCL, a simple and scalable algorithm for finding clusters with arbitrary shapes and sizes. SPARCL has a linear space and time complexity. SPARCL consists of two stages---the first stage runs a carefully initialized version of the Kmeans algorithm to generate many small seed clusters. The second stage iteratively merges the generated clusters to obtain the final shape-based clusters. The merging stage is guided by a similarity metric between the seed clusters. Experiments conducted on a variety of datasets highlight the effectiveness, efficiency, and scalability of our approach. On large datasets SPARCL is an order of magnitude faster than the best existing approaches. SPARCL can identify irregular shaped clusters that are full-dimensional, i.e., the clusters span all the input dimensions.;We also propose an alternate algorithm for shape-based clustering. In prior clustering algorithms the objects remain static whereas the cluster representatives are modified iteratively. We propose an algorithm based on the movement of objects under a systematic process. On convergence, the core structure (or the backbone) of each cluster is identified. From the core, we can identify the shape-based clusters more easily. The algorithm operates in an iterative manner. During each iteration, a point can either be subsumed (the term "globbing" is used in this text) by another representative point and/or it moves towards a dense neighborhood. The stopping condition for this iterative process is formulated as a MDL model selection criterion. Experiments on large datasets indicate that the new approach can be an order of magnitude faster, while maintaining clustering quality comparable with SPARCL. In the future, we plan to extend our work to identify subspace clusters. A subspace cluster spans a subset of the dimensions in the input space. The task of subspace clustering thus involves not only identifying the cluster members, but also the relevant dimensions for each cluster. Indexing spatial objects using the seed selection approach proposed in SPARCL is another line of work we intend to explore.
机译:群集是基本的数据挖掘任务之一。这些年来,已经开发了许多不同的聚类范例,包括分区,分层,基于混合模型,基于密度,光谱,子空间等。传统算法将聚类作为优化问题,其中目标是最小化某些质量指标,例如平方误差。所得的簇是d维度量空间中的凸多面体。对于形状任意的群集,这种策略效果不佳。在科学的许多领域都可以观察到具有任意形状的簇。例如,从地理信息系统收集的空间数据,气象卫星的数据,流行病学研究的数据和传感器数据很少具有规则形状的簇。图像分割是广泛涉及任意形状的区域和边界的技术领域。除了复杂的形状外,上述某些应用程序还会生成大量数据。标识不规则形状的聚类的聚类算法集称为基于形状的聚类算法。这些算法是本文的研究重点。现有的用于识别任意形状簇的方法包括基于密度的算法,分层算法和光谱算法。这些方法的内存或时间复杂度可能是二次的,甚至是三次的。该缺点将这些算法限制为中等大小的数据集。在本文中,我们提出了SPARCL,这是一种用于查找具有任意形状和大小的聚类的简单且可扩展的算法。 SPARCL具有线性的空间和时间复杂度。 SPARCL包含两个阶段-第一阶段运行经过仔细初始化的Kmeans算法版本,以生成许多小的种子簇。第二阶段迭代合并生成的聚类以获得最终的基于形状的聚类。合并阶段由种子簇之间的相似性度量指导。在各种数据集上进行的实验突出了我们方法的有效性,效率和可扩展性。在大型数据集上,SPARCL比现有的最佳方法快一个数量级。 SPARCL可以识别不规则形状的,全维的聚类,即该聚类跨越所有输入维。;我们还提出了一种基于形状的聚类的替代算法。在现有的聚类算法中,对象保持静态,而聚类代表被迭代地修改。我们提出了一种基于对象在系统过程中运动的算法。收敛时,将确定每个群集的核心结构(或主干)。从核心来看,我们可以更轻松地识别基于形状的聚类。该算法以迭代方式运行。在每次迭代期间,一个点可以被另一个代表点包含(在本文中使用术语“ globbing”),和/或它会向密集邻域移动。将该迭代过程的停止条件公式化为MDL模型选择标准。在大型数据集上进行的实验表明,新方法可以快一个数量级,同时保持与SPARCL相当的聚类质量。将来,我们计划扩展我们的工作以识别子空间集群。子空间集群跨越输入空间中维度的子集。因此,子空间聚类的任务不仅涉及识别聚类成员,而且还涉及每个聚类的相关维度。使用SPARCL中提出的种子选择方法对空间对象进行索引是我们打算探索的另一项工作。

著录项

  • 作者

    Chaoji, Vineet.;

  • 作者单位

    Rensselaer Polytechnic Institute.;

  • 授予单位 Rensselaer Polytechnic Institute.;
  • 学科 Computer Science.
  • 学位 Ph.D.
  • 年度 2009
  • 页码 116 p.
  • 总页数 116
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

  • 入库时间 2022-08-17 11:38:24

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号