...
首页> 外文期刊>Computer networks >A framework for mining evolving trends in Web data streams using dynamic learning and retrospective validation
【24h】

A framework for mining evolving trends in Web data streams using dynamic learning and retrospective validation

机译:使用动态学习和追溯验证来挖掘Web数据流中不断发展的趋势的框架

获取原文
获取原文并翻译 | 示例
           

摘要

The expanding and dynamic nature of the Web poses enormous challenges to most data mining techniques that try to extract patterns from Web data, such as Web usage and Web content. While scalable data mining methods are expected to cope with the size challenge, coping with evolving trends in noisy data in a continuous fashion, and without any unnecessary stoppages and reconfigurations is still an open challenge. This dynamic and single pass setting can be cast within the framework of mining evolving data streams. The harsh restrictions imposed by the "you only get to see it once" constraint on stream data calls for different computational models that may furthermore bring some interesting surprises when it comes to the behavior of some well known similarity measures during clustering, and even validation. In this paper, we study the effect of similarity measures on the mining process and on the interpretation of the mined patterns in the harsh single pass requirement scenario. We propose a simple similarity measure that has the advantage of explicitly coupling the precision and coverage criteria to the early learning stages. Even though the cosine similarity, and its close relative such as the Jaccard measure, have been prevalent in the majority of Web data clustering approaches, they may fail to explicitly seek profiles that achieve high coverage and high precision simultaneously. We also formulate a validation strategy and adapt several metrics rooted in information retrieval to the challenging task of validating a learned stream synopsis in dynamic environments. Our experiments confirm that the performance of the MinPC similarity is generally better than the cosine similarity, and that this outperformance can be expected to be more pronounced for data sets that are more challenging in terms of the amount of noise and/or overlap, and in terms of the level of change in the underlying profiles/topics (known sub-categories of the input data) as the input stream unravels. In our simulations, we study the task of mining and tracking trends and profiles in evolving text and Web usage data streams in a single pass, and under different trend sequencing scenarios.
机译:Web的不断扩展和动态性质对大多数试图从Web数据中提取模式的数据挖掘技术提出了巨大挑战,例如Web使用情况和Web内容。尽管可伸缩的数据挖掘方法有望应对规模挑战,但以连续的方式应对嘈杂数据不断发展的趋势,而又没有任何不必要的停顿和重新配置,仍然是一个开放的挑战。这种动态的单遍设置可以在挖掘不断发展的数据流的框架内进行转换。流数据上的“您只能看到一次”约束所施加的严格限制要求使用不同的计算模型,这可能会给聚类甚至验证过程中某些众所周知的相似性度量的行为带来一些有趣的惊喜。在本文中,我们研究了在苛刻的单次通过需求场景中,相似性度量对采矿过程和开采模式的解释的影响。我们提出了一种简单的相似性度量,该度量具有将精度和覆盖标准明确耦合到早期学习阶段的优势。即使在大多数Web数据聚类方法中普遍使用了余弦相似度及其近亲(例如Jaccard度量),但它们可能无法显式地寻求同时实现高覆盖范围和高精度的配置文件。我们还制定了一种验证策略,并使植根于信息检索中的多个指标适应了在动态环境中验证学习流摘要的艰巨任务。我们的实验证实,MinPC相似性的性能通常优于余弦相似性,并且对于噪声和/或重叠量更具挑战性的数据集,可以预期这种出色表现会更加明显。输入流解散时,基本配置文件/主题(输入数据的已知子类别)的变化级别的术语。在我们的模拟中,我们研究在单一趋势下以及在不同趋势排序方案下,在不断发展的文本和Web使用数据流中挖掘和跟踪趋势和配置文件的任务。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号