...
首页> 外文期刊>Computer networks >DSM-PLW: Single-pass mining of path traversal patterns over streaming Web click-sequences
【24h】

DSM-PLW: Single-pass mining of path traversal patterns over streaming Web click-sequences

机译:DSM-PLW:在流式Web单击序列上单遍挖掘路径遍历模式

获取原文
获取原文并翻译 | 示例
           

摘要

Mining Web click streams is an important data mining problem with broad applications. However, it is also a difficult problem since the streaming data possess some interesting characteristics, such as unknown or unbounded length, possibly a very fast arrival rate, inability to backtrack over previously arrived click-sequences, and a lack of system control over the order in which the data arrive. In this paper, we propose a projection-based, single-pass algorithm, called DSM-PLW (Data Stream Mining for Path traversal patterns in a Landmark Window), for online incremental mining of path traversal patterns over a continuous stream of maximal forward references generated at a rapid rate. According to the algorithm, each maximal forward reference of the stream is projected into a set of reference-suffix maximal forward references, and these reference-suffix maximal forward references are inserted into a new in-memory summary data structure, called SP-forest (Summary Path traversal pattern forest), which is an extended prefix tree-based data structure for storing essential information about frequent reference sequences of the stream so far. The set of all maximal reference sequences is determined from the SP-forest by a depth-first-search mechanism, called MRS-mining (Maximal Reference Sequence mining). Theoretical analysis and experimental studies show that the proposed algorithm has gently growing memory requirements and makes only one pass over the streaming data.
机译:挖掘Web点击流是具有广泛应用程序的重要数据挖掘问题。但是,这也是一个难题,因为流数据具有一些有趣的特性,例如未知或无限制的长度,可能非常快的到达速度,无法回溯先前到达的点击序列以及缺乏对订单的系统控制数据到达的位置。在本文中,我们提出了一种基于投影的单遍算法,称为DSM-PLW(在地标窗口中用于路径遍历模式的数据流挖掘),用于在最大前向参考的连续流上在线增量挖掘路径遍历模式快速产生。根据该算法,流的每个最大前向参考被投影到一组参考后缀的最大前向参考中,并且这些参考后缀的最大前向参考被插入到一个新的内存摘要数据结构中,该结构称为SP-forest(摘要路径遍历模式林(Summary Path traversal pattern forest),这是一种基于前缀树的扩展数据结构,用于存储到目前为止有关流的频繁引用序列的基本信息。所有最大参考序列的集合都是通过称为MRS挖掘(最大参考序列挖掘)的深度优先搜索机制从SP林中确定的。理论分析和实验研究表明,该算法对内存的需求逐渐增长,并且仅对流数据进行一次传递。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号