Link contexts in classifier-guided topical crawlers

Gautam Pant; Padmini Srinivasan

首页> 外文期刊>IEEE Transactions on Knowledge and Data Engineering >Link contexts in classifier-guided topical crawlers

【24h】

Link contexts in classifier-guided topical crawlers

机译：在分类器指导的主题搜寻器中链接上下文

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Context of a hyperlink or link context is defined as the terms that appear in the text around a hyperlink within a Web page. Link contexts have been applied to a variety of Web information retrieval and categorization tasks. Topical or focused Web crawlers have a special reliance on link contexts. These crawlers automatically navigate the hyperlinked structure of the Web while using link contexts to predict the benefit of following the corresponding hyperlinks with respect to some initiating topic or theme. Using topical crawlers that are guided by a support vector machine, we investigate the effects of various definitions of link contexts on the crawling performance. We find that a crawler that exploits words both in the immediate vicinity of a hyperlink as well as the entire parent page performs significantly better than a crawler that depends on just one of those cues. Also, we find that a crawler that uses the tag tree hierarchy within Web pages provides effective coverage. We analyze our results along various dimensions such as link context quality, topic difficulty, length of crawl, training data, and topic domain. The study was done using multiple crawls over 100 topics covering millions of pages allowing us to derive statistically strong results.

机译：超链接或链接上下文的上下文定义为出现在网页内超链接周围的文本中的术语。链接上下文已应用于各种Web信息检索和分类任务。主题或重点突出的Web爬网程序特别依赖链接上下文。这些搜寻器会自动导航Web的超链接结构，同时使用链接上下文来预测相对于某些启动主题或主题遵循相应超链接的好处。使用由支持向量机引导的主题爬网程序，我们研究了链接上下文的各种定义对爬网性能的影响。我们发现，利用仅在超链接附近以及整个父页面中使用单词的爬网程序比仅依赖于其中一个线索的爬网程序的性能要好得多。此外，我们发现在网页中使用标记树层次结构的爬网程序提供了有效的覆盖范围。我们从各个方面分析结果，例如链接上下文质量，主题难度，爬网长度，训练数据和主题域。这项研究使用100多个主题的多个爬网完成，覆盖了数百万个页面，使我们能够得出具有统计意义的结果。

著录项

来源
《IEEE Transactions on Knowledge and Data Engineering》 |2006年第1期|p.107-122|共16页
作者
Gautam Pant; Padmini Srinivasan;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类计算技术、计算机技术;
关键词
Internet; data mining; hypermedia; information retrieval; search engines; support vector machines; Web information retrieval; Web mining; Web search; classifier-guided topical Web crawler; hyperlink context; information categorization; information navigation; support;

机译：互联网;数据挖掘;超媒体;信息检索;搜索引擎;支持向量机;Web信息检索;Web挖掘;Web搜索;分类器指导的主题Web爬虫;超链接上下文;信息分类;信息导航;支持;

相似文献

外文文献
中文文献
专利

1. PDD Crawler : A Focused Web Crawler Using Link and Content Analysis for Relevence Prediction [J] . Prashant Dahiwale, M M Raghuwanshi, Latesh Malik Computer Science & Information Technology . 2014,第11期

机译：PDD爬网程序：使用链接和内容分析进行相关性预测的集中式Web爬网程序
2. Topical Web Crawling for Doniain-Specific Resource Discovery Enhanced by Selectively using Link-Context [J] . Liu Lu, Peng Tao, Zuo Wanli The international arab journal of information technology . 2015,第2期

机译：通过有选择地使用链接上下文增强了针对Doniain特定资源发现的主题Web爬网
3. Clustering-based topical Web crawling using CFu-tree guided by link-context [J] . Lu LIU, Tao PENG Frontiers of computer science in China . 2014,第4期

机译：在链接上下文的指导下使用CFu树进行基于集群的主题Web爬网
4. Classifier-Guided Topical Crawler: A Novel Method of Automatically Labeling the Positive URLs [C] . Chen Li, Li Zhi-shu, Yu Zhong-hua, Semantics, Knowledge and Grid, 2009. SKG 2009 . 2009

机译：分类器引导的主题搜寻器：一种自动标记肯定URL的新方法
5. Learning to crawl: Classifier-guided topical crawlers. [D] . Pant, Gautam. 2004

机译：学习爬网：分类器指导的主题爬网程序。
6. Xlink Analyzer: Software for analysis and visualization of cross-linking data in the context of three-dimensional structures [O] . Jan Kosinski, Alexander von Appen, Alessandro Ori, -1

机译：Xlink Analyzer：用于在三维结构的上下文中分析和可视化交叉链接数据的软件
7. PDD CRAWLER: A FOCUSED WEB CRAWLER USING LINK AND CONTENT ANALYSIS FOR RELEVENCE PREDICTION [O] . Prashant Dahiwale, M M Raghuwanshi, Latesh Malik 2015

机译：pDD CRaWLER：使用链接和内容分析进行相关预测的聚焦网络爬虫

Link contexts in classifier-guided topical crawlers

摘要

著录项

相似文献

相关主题

期刊订阅