首页> 外文会议>International Teletraffic Congress >CLUE: Clustering for Mining Web URLs
【24h】

CLUE: Clustering for Mining Web URLs

机译:线索:群集以挖掘Web URL

获取原文

摘要

The Internet has witnessed the proliferation of applications and services that rely on HTTP as application protocol. Users play games, read emails, watch videos, chat and access web pages using their PC, which in turn downloads tens or hundreds of URLs to fetch all the objects needed to display the requested content. As result, billions of URLs are observed in the network. When monitoring the traffic, thus, it is becoming more and more important to have methodologies and tools that allow one to dig into this data and extract useful information. In this paper, we present CLUE, Clustering for URL Exploration, a methodology that leverages clustering algorithms, i.e., unsupervised techniques developed in the data mining field to extract knowledge from passive observation of URLs carried by the network. This is a challenging problem given the unstructured format of URLs, which, being strings, call for specialized approaches. Inspired by text-mining algorithms, we introduce the concept of URL-distance and use it to compose clusters of URLs using the well-known DBSCAN algorithm. Experiments on actual datasets show encouraging results. Well-separated and consistent clusters emerge and allow us to identify, e.g., malicious traffic, advertising services, and thirdparty tracking systems. In a nutshell, our clustering algorithm offers the means to get insights on the data carried by the network, with applications in the security or privacy protection fields.
机译:互联网见证了依赖HTTP作为应用程序协议的应用程序和服务的激增。用户使用PC玩游戏,阅读电子邮件,观看视频,聊天和访问网页,然后下载数十或数百个URL,以获取显示请求内容所需的所有对象。结果,在网络中观察到数十亿个URL。因此,在监视流量时,拥有允许人们深入研究此数据并提取有用信息的方法和工具变得越来越重要。在本文中,我们介绍了CLUE(URL探索的聚类),一种利用聚类算法的方法,即在数据挖掘领域开发的无监督技术,可从对网络承载URL的被动观察中提取知识。鉴于URL的非结构化格式(这是字符串,需要特殊的方法),这是一个具有挑战性的问题。受文本挖掘算法的启发,我们引入了URL距离的概念,并使用众所周知的DBSCAN算法将其用于组成URL的簇。在实际数据集上的实验显示出令人鼓舞的结果。分隔良好且一致的群集出现了,使我们能够识别例如恶意流量,广告服务和第三方跟踪系统。简而言之,我们的聚类算法提供了一种手段,可通过对安全性或隐私保护领域中的应用程序来了解网络所承载的数据。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号