HFIM: a Spark-based hybrid frequent itemset mining algorithm for big data processing

Sethi Krishan Kumar; Ramesh Dharavath

首页> 外文期刊>Journal of supercomputing >HFIM: a Spark-based hybrid frequent itemset mining algorithm for big data processing

【24h】

HFIM: a Spark-based hybrid frequent itemset mining algorithm for big data processing

机译：HFIM：一种基于Spark的混合频繁项集挖掘算法，用于大数据处理

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Frequent itemset mining is one of the data mining techniques applied to discover frequent patterns, used in prediction, association rule mining, classification, etc. Apriori algorithm is an iterative algorithm, which is used to find frequent itemsets from transactional dataset. It scans complete dataset in each iteration to generate the large frequent itemsets of different cardinality, which seems better for small data but not feasible for big data. The MapReduce framework provides the distributed environment to run the Apriori on big transactional data. However, MapReduce is not suitable for iterative process and declines the performance. We introduce a novel algorithm named Hybrid Frequent Itemset Mining (HFIM), which utilizes the vertical layout of dataset to solve the problem of scanning the dataset in each iteration. Vertical dataset carries information to find support of each itemsets. Moreover, we also include some enhancements to reduce number of candidate itemsets. The proposed algorithm is implemented over Spark framework, which incorporates the concept of resilient distributed datasets and performs in-memory processing to optimize the execution time of operation. We compare the performance of HFIM with another Spark-based implementation of Apriori algorithm for various datasets. Experimental results show that the HFIM performs better in terms of execution time and space consumption.

机译：频繁项集挖掘是用于发现频繁模式的数据挖掘技术之一，用于预测，关联规则挖掘，分类等。Apriori算法是一种迭代算法，用于从事务数据集中查找频繁项集。它在每次迭代中扫描完整的数据集以生成具有不同基数的大型频繁项集，这对于小数据而言似乎更好，但对大数据而言则不可行。 MapReduce框架提供了在大事务数据上运行Apriori的分布式环境。但是，MapReduce不适合迭代过程，因此会降低性能。我们引入了一种称为混合频繁项集挖掘（HFIM）的新颖算法，该算法利用数据集的垂直布局来解决每次迭代中扫描数据集的问题。垂直数据集携带信息以找到每个项目集的支持。此外，我们还包括一些增强功能以减少候选项目集的数量。所提出的算法是在Spark框架上实现的，该框架结合了弹性分布式数据集的概念，并执行内存中处理以优化操作的执行时间。我们将HFIM的性能与针对各种数据集的另一种基于Spark的Apriori算法实现进行了比较。实验结果表明，HFIM在执行时间和空间消耗方面表现更好。

著录项

来源
《Journal of supercomputing》 |2017年第8期|3652-3668|共17页
作者
Sethi Krishan Kumar; Ramesh Dharavath;
展开▼
作者单位

Indian Inst Technol ISM, Dept Comp Sci & Engn, Dhanbad 826004, Jharkhand, India;

Indian Inst Technol ISM, Dept Comp Sci & Engn, Dhanbad 826004, Jharkhand, India;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类
关键词
Frequent pattern mining; Big data; Apache Spark; Apriori algorithm;

机译：频繁模式挖掘;大数据;Apache Spark;Apriori算法;

相似文献

外文文献
中文文献
专利

1. EFFICIENT SUBSET-LATTICE ALGORITHMS FOR MINING CLOSED FREQUENT ITEMSETS AND MAXIMAL FREQUENT ITEMSETS IN DATA STREAMS [J] . Ye-In Chang, Chia-En Li, Wei-Hau Peng, International Journal of Electrical Engineering: Transactions of the Chinese Institute of Engineers, Series E . 2013,第2期

机译：高效的子格算法，用于挖掘数据流中的封闭频率项和最大频率项
2. Geo Map Visualization for Frequent Purchaser in Online Shopping Database Using an Algorithm LP-Growth for Mining Closed Frequent Itemsets [J] . M. Sinthuja, N. Puviarasan, P. Aruna Procedia Computer Science . 2018,第1期

机译：使用算法LP-Growth挖掘封闭式频繁项目集的在线购物数据库中频繁购买者的地理地图可视化
3. A Survey of latest Algorithms for Frequent Itemset Mining in Data Stream [J] . U.Chandrasekhar, Sandeep Kumar. K, Yakkala Uma Mahesh International Journal of Advanced Computer Research . 2013,第9期

机译：数据流中频繁项集挖掘的最新算法综述
4. Speeding up frequent itemset mining process on XML data using graphic processor [C] . Rathi Sheetal, Dhote C.A, Bangera Vivek 2014 5th International Conference- Confluence The Next Generation Information Technology Summit . 2014

机译：使用图形处理器加快对XML数据的频繁项集挖掘过程
5. New algorithms for frequent sequential pattern and itemset data mining in certain and uncertain databases. [D] . Peterson, Erich Allen. 2012

机译：在某些不确定数据库中频繁进行顺序模式和项集数据挖掘的新算法。
6. Genetic Programming and Frequent Itemset Mining to Identify Feature Selection Patterns of iEEG and fMRI Epilepsy Data [O] . Otis Smart, Lauren Burrell -1

机译：遗传程序设计和频繁项集挖掘以识别iEEG和fMRI癫痫数据的特征选择模式
7. Comparing Dataset Characteristics that Favor the Apriori, Eclat or FP-Growth Frequent Itemset Mining Algorithms [O] . Heaton, Jeff 2017

机译：比较有利于apriori，Eclat或者数据集的数据集特征 Fp-Growth频繁项集挖掘算法

HFIM: a Spark-based hybrid frequent itemset mining algorithm for big data processing

摘要

著录项

相似文献

相关主题

期刊订阅