...
首页> 外文期刊>Progress in Artificial Intelligence >A dockerized framework for hierarchical frequency-based document clustering on cloud computing infrastructures
【24h】

A dockerized framework for hierarchical frequency-based document clustering on cloud computing infrastructures

机译:云计算基础架构上基于分层频率的文档聚类的耦合框架

获取原文
获取原文并翻译 | 示例
           

摘要

Scalable big data analysis frameworks are of paramount importance in the modern web society, which is characterized by a huge number of resources, including electronic text documents. Document clustering is an important field in text mining and is commonly used for document organization, browsing, summarization and classification. Hierarchical clustering methods construct a hierarchy structure that, combined with the produced clusters, can be useful in managing documents, thus making the browsing and navigation process easier and quicker, and providing only relevant information to the users' queries by leveraging the structure relationships. Nevertheless, the high computational cost and memory usage of baseline hierarchical clustering algorithms render them inappropriate for the vast number of documents that must be handled daily. In this paper, we propose a new scalable hierarchical clustering framework, which uses the frequency of the topics in the documents to overcome these limitations. Our work consists of a binary tree construction algorithm that creates a hierarchy of the documents using three metrics (Identity, Entropy, Bin Similarity), and a branch breaking algorithm which composes the final clusters by applying thresholds to each branch of the tree. The clustering algorithm is followed by a meta-clustering module which makes use of graph theory to obtain insights in the leaf clusters' connections. The feature vectors representing each document derive from topic modeling. At the implementation level, the clustering method has been dockerized in order to facilitate its deployment on cloud computing infrastructures. Finally, the proposed framework is evaluated on several datasets of varying size and content, achieving significant reduction in both memory consumption and computational time over existing hierarchical clustering algorithms. The experiments also include performance testing on cloud resources using different setups and the results are promising.
机译:可扩展的大数据分析框架在现代Web社会中是至关重要的,其特点是大量资源,包括电子文本文件。文档群集是文本挖掘中的一个重要字段,通常用于文档组织,浏览,汇总和分类。分层聚类方法构造层次结构结构,即与所产生的集群组合,可以在管理文档中有用,从而使浏览和导航过程更容易,更快,并仅通过利用结构关系来为用户查询提供相关信息。尽管如此,基线分层聚类算法的高计算成本和内存用法呈现不适合每天处理的大量文档。在本文中,我们提出了一种新的可扩展分层聚类框架,它使用文档中的主题频率来克服这些限制。我们的工作由二叉树建设算法组成,它使用三个度量(标识,熵,箱相似度)和分支破坏算法创建文档的层次结构,并通过将阈值应用于树的每个分支来构成最终群集。聚类算法之后是元聚类模块,它利用图理论来获得叶簇连接中的见解。表示每个文档的特征向量导出主题建模。在实现级别,群集方法已被停放,以便于其在云计算基础架构上部署。最后,在不同大小和内容的多个数据集中评估所提出的框架,在现有分层聚类算法上实现了内存消耗和计算时间的显着降低。实验还包括使用不同设置的云资源的性能测试,结果是有前途的。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号