A dockerized framework for hierarchical frequency-based document clustering on cloud computing infrastructures

首页> 外文期刊>Progress in Artificial Intelligence >A dockerized framework for hierarchical frequency-based document clustering on cloud computing infrastructures

【24h】

A dockerized framework for hierarchical frequency-based document clustering on cloud computing infrastructures

机译：云计算基础架构上基于分层频率的文档聚类的耦合框架

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Scalable big data analysis frameworks are of paramount importance in the modern web society, which is characterized by a huge number of resources, including electronic text documents. Document clustering is an important field in text mining and is commonly used for document organization, browsing, summarization and classification. Hierarchical clustering methods construct a hierarchy structure that, combined with the produced clusters, can be useful in managing documents, thus making the browsing and navigation process easier and quicker, and providing only relevant information to the users' queries by leveraging the structure relationships. Nevertheless, the high computational cost and memory usage of baseline hierarchical clustering algorithms render them inappropriate for the vast number of documents that must be handled daily. In this paper, we propose a new scalable hierarchical clustering framework, which uses the frequency of the topics in the documents to overcome these limitations. Our work consists of a binary tree construction algorithm that creates a hierarchy of the documents using three metrics (Identity, Entropy, Bin Similarity), and a branch breaking algorithm which composes the final clusters by applying thresholds to each branch of the tree. The clustering algorithm is followed by a meta-clustering module which makes use of graph theory to obtain insights in the leaf clusters' connections. The feature vectors representing each document derive from topic modeling. At the implementation level, the clustering method has been dockerized in order to facilitate its deployment on cloud computing infrastructures. Finally, the proposed framework is evaluated on several datasets of varying size and content, achieving significant reduction in both memory consumption and computational time over existing hierarchical clustering algorithms. The experiments also include performance testing on cloud resources using different setups and the results are promising.

机译：可扩展的大数据分析框架在现代Web社会中是至关重要的，其特点是大量资源，包括电子文本文件。文档群集是文本挖掘中的一个重要字段，通常用于文档组织，浏览，汇总和分类。分层聚类方法构造层次结构结构，即与所产生的集群组合，可以在管理文档中有用，从而使浏览和导航过程更容易，更快，并仅通过利用结构关系来为用户查询提供相关信息。尽管如此，基线分层聚类算法的高计算成本和内存用法呈现不适合每天处理的大量文档。在本文中，我们提出了一种新的可扩展分层聚类框架，它使用文档中的主题频率来克服这些限制。我们的工作由二叉树建设算法组成，它使用三个度量（标识，熵，箱相似度）和分支破坏算法创建文档的层次结构，并通过将阈值应用于树的每个分支来构成最终群集。聚类算法之后是元聚类模块，它利用图理论来获得叶簇连接中的见解。表示每个文档的特征向量导出主题建模。在实现级别，群集方法已被停放，以便于其在云计算基础架构上部署。最后，在不同大小和内容的多个数据集中评估所提出的框架，在现有分层聚类算法上实现了内存消耗和计算时间的显着降低。实验还包括使用不同设置的云资源的性能测试，结果是有前途的。

著录项

来源
《Progress in Artificial Intelligence》 |2020年第1期|共17页
作者

展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类计算技术、计算机技术;
关键词
Hierarchical document clustering; Topic modeling; Docker; Performance testing;

机译：分层文档聚类;主题建模;Docker;性能测试;

相似文献

外文文献
中文文献
专利

1. A dockerized framework for hierarchical frequency-based document clustering on cloud computing infrastructures [J] . Progress in Artificial Intelligence . 2020,第1期

机译：云计算基础架构上基于分层频率的文档聚类的耦合框架
2. A Dynamic Cloud Discovery Framework for Deploying of Scientific Computing Services over a Multi-cloud Infrastructure [J] . C.D. Karthic, S. Sujatha, V. Praveenkumar Journal of Artificial Intelligence . 2012,第4期

机译：用于在多云基础架构上部署科学计算服务的动态云发现框架
3. Energy-Aware Task Scheduling (EATS) Framework for Efficient Energy in Smart Cities Cloud Computing Infrastructures [J] . Leila Ismail, Abbas A. Fardoun International journal of thermal & environmental engineering . 2016,第1a2期

机译：智能城市中高效能源的能源感知任务调度（EATS）框架云计算基础架构
4. A Process Checkpoint Evaluation at User Space of Docker Framework on Distributed Computing Infrastructure [C] . Dani Adhipta, Selo Sulistyo, Widyawan Widyawan International Conference on Information Technology and Electrical Engineering . 2020

机译：分布式计算基础架构Docker框架用户空间的过程检查点评估
5. Research Automation for Infrastructure and Software: A Framework for Domain Specific Research with Cloud Computing [D] . Wagner, Weslyn S. 2017

机译：基础设施和软件研究自动化：云计算域特定研究框架
6. Real-time prediction of intradialytic relative blood volume: a proof-of-concept for integrated cloud computing infrastructure [O] . Sheetal Chaudhuri, Hao Han, Caitlin Monaghan, 2021

机译：内型相对血容量的实时预测：集成云计算基础设施的验证
7. A SECURITY FRAMEWORK IN CLOUD COMPUTING INFRASTRUCTURE [O] . Arijit Ukil, Debasish Jana, Ajanta De Sarkar 2013

机译：云计算基础架构中的安全框架

A dockerized framework for hierarchical frequency-based document clustering on cloud computing infrastructures

摘要

著录项

相似文献

相关主题

期刊订阅