首页> 外文学位 >Scalable model-based clustering algorithms for large databases and their applications.
【24h】

Scalable model-based clustering algorithms for large databases and their applications.

机译:适用于大型数据库及其应用程序的基于模型的可伸缩群集算法。

获取原文
获取原文并翻译 | 示例

摘要

With the unabated growth of data amassed from business, scientific and engineering disciplines, cluster analysis and other data mining functionalities, play a more and more important role. They can reveal previously unknown and potentially useful patterns and relations in large databases. One of the most significant challenges in data mining is scalability—effectively handling large databases with linear computational complexity and limited main memory.; This thesis addresses the scalability problem of the mixture model-based clustering algorithms for databases with large number of data items. It proposes a scalable model-based clustering framework. The basic idea is first to partition a large data set into subclusters roughly, and then to generate clusters from summary statistics of these subclusters effectively. The procedure may be manipulated by a controller. The framework consists of three modules: data summarization, in-memory clustering, and optional controller.; The data summarization module sums up data sets by only storing appropriate summary statistics. Besides the BIRCH'S data summarization procedure, there exist many approaches. Our adaptive grid-based data summarization procedure simply partitions data space and sum up the data items within the same cells into summary statistics. Based on the principle of Self-Organizing Map (SOM), we establish an expanding SOM and an integrated SOM for data summarization and projection. The two SOMs can generate better mappings than the traditional SOM in terms of both the quantization and the topological errors. It is also substantiated by the experiments where they can generate most accurate solutions of the travelling salesman problem so far in the neural network literature.; The in-memory clustering module generates mixture models from summary statistics of subclusters. We establish two model-based clustering algorithms based on the general Expectation-Maximization (EM) algorithm. If attributes are statistically independent, we use a clustering feature to represent a subcluster, and employ EMACF to generate clusters. Otherwise, our EMADS handles data summaries where the correlations are embedded. We prove the convergence of both algorithms. Combining with the grid-based data summarization procedure, we have two scalable clustering systems: the gEMACF and gEMADS algorithms. Similarly, the bEMACF and bEMADS algorithms work on the subclusters generated by the BIRCH's data summarization procedure. They can run one or two orders of magnitude faster than the classical model-based clustering algorithm and generate results with no or little loss of accuracy. Their effectiveness and efficiency are confirmed by experiments on both synthetic and real world data.; The last optional module provides a controller to our clustering system. We introduce genetic algorithms to guide the model-based clustering techniques and establish the GAXEM algorithm. It determines the optimal number of clusters automatically. Some preliminary experimental results substantiate that the GAXEM algorithm performs well on both synthetic and real world data sets.; Besides the clustering application of our model-based techniques, we also briefly discuss two other applications.
机译:随着业务,科学和工程学科,集群分析和其他数据挖掘功能的不断增长的数据增长,其作用越来越重要。它们可以揭示大型数据库中以前未知且可能有用的模式和关系。数据挖掘中最重大的挑战之一是可伸缩性-有效地处理具有线性计算复杂性和有限主内存的大型数据库。本文针对具有大量数据项的数据库,解决了基于混合模型的聚类算法的可扩展性问题。它提出了一个基于模型的可伸缩的集群框架。基本思想是首先将大数据集粗略地划分为子集群,然后根据这些子集群的摘要统计信息有效地生成集群。该过程可以由控制器来操纵。该框架由三个模块组成:数据汇总,内存集群和可选控制器。数据汇总模块仅通过存储适当的汇总统计信息来汇总数据集。除了BIRCH的数据汇总程序外,还有许多方法。我们的基于网格的自适应数据汇总过程仅对数据空间进行分区,并将同一单元格中的数据项汇总为汇总统计信息。基于自组织图(SOM)的原理,我们建立了一个扩展的SOM和一个集成的SOM,用于数据汇总和投影。就量化和拓扑误差而言,这两个SOM可以比传统SOM生成更好的映射。实验也证实了这一点,在神经网络文献中,它们可以生成迄今为止旅行商问题最准确的解决方案。内存中的群集模块根据子群集的摘要统计信息生成混合模型。我们基于一般期望最大化(EM)算法建立了两种基于模型的聚类算法。如果属性在统计上是独立的,则我们使用聚类功能来表示子聚类,并使用EMACF生成聚类。否则,我们的EMADS会处理嵌入关联的数据摘要。我们证明了这两种算法的收敛性。结合基于网格的数据汇总过程,我们有两个可扩展的集群系统:gEMACF和gEMADS算法。同样,bEMACF和bEMADS算法可处理BIRCH数据汇总过程生成的子群集。它们可以比传统的基于模型的聚类算法运行速度快一个或两个数量级,并且生成的结果几乎没有准确性损失。它们的有效性和效率已通过对合成数据和真实数据的实验证实。最后一个可选模块为我们的集群系统提供了一个控制器。我们引入遗传算法来指导基于模型的聚类技术并建立GAXEM算法。它会自动确定最佳群集数。一些初步的实验结果证实,GAXEM算法在合成数据集和现实世界数据集上均表现良好。除了基于模型的技术的聚类应用程序外,我们还将简要讨论另外两个应用程序。

著录项

  • 作者

    Jin, Huidong.;

  • 作者单位

    Chinese University of Hong Kong (People's Republic of China).;

  • 授予单位 Chinese University of Hong Kong (People's Republic of China).;
  • 学科 Computer Science.
  • 学位 Ph.D.
  • 年度 2002
  • 页码 204 p.
  • 总页数 204
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 自动化技术、计算机技术;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号