Crawling the Hidden Web

机译：搜寻隐藏的网页

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

Current-day crawlers retrieve content only from the publicly indexable Web, i.e., the set of Web pages reachable purely by following hypertext links, ignoring search forms and pages that require authorization or prior registration. In particular, they ignore the tremendous amount of high quality content "hidden" behind search forms, in large searchable electronic databases. In this paper, we address the problem of designing a crawler capable of extracting content from this hidden Web. We introduce a generic operational model of a hidden Web crawler and describe how this model is realized in HiWE (Hidden Web Exposer), a prototype crawler built at Stanford. We introduce a new Layout-based Information Extraction Technique (LITE) and demonstrate its use in automatically extracting semantic information from search forms and response pages. We also present results from experiments conducted to test and validate our techniques.

机译：当前的搜寻器仅从可公开索引的Web（即仅通过跟踪超文本链接即可访问的一组Web页面）中检索内容，而忽略需要授权或事先注册的搜索表单和页面。尤其是，他们忽略了可搜索的大型电子数据库中隐藏在搜索表单后面的大量高质量内容。在本文中，我们解决了设计爬虫的问题，该爬虫能够从此隐藏的Web中提取内容。我们介绍了隐藏Web爬虫的通用操作模型，并描述了如何在斯坦福大学开发的原型爬虫HiWE（隐藏Web Exposer）中实现该模型。我们引入了一种新的基于布局的信息提取技术（LITE），并演示了其在从搜索表单和响应页面自动提取语义信息中的用途。我们还介绍了为测试和验证我们的技术而进行的实验的结果。

著录项

来源
《Twenty-Seventh International Conference on Very Large Data Bases, 27th, Sep 11-14th, 2001, Roma, Italy》|2001年|p.129-138|共10页
会议地点 Roma(IT);Roma(IT)
作者
Sriram Raghavan; Hector Garcia-Molina;
展开▼
作者单位

Computer Science Department Stanford University Stanford, CA 94305, USA;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类自动化技术、计算机技术;
关键词

相似文献

外文文献
中文文献
专利

1. Focused crawling for the hidden web [J] . F. Can Computing reviews . 2017,第1期

机译：集中抓取隐藏的网页
2. Focused crawling for the hidden web [J] . Liakos Panagiotis, Ntoulas Alexandros, Labrinidis Alexandros, World Wide Web . 2016,第4期

机译：集中抓取隐藏的网页
3. HIGWGET-A Model for Crawling Secure Hidden WebPages [J] . K.F. Bharati, P. Premchand, A Govardhan International Journal of Data Mining & Knowledge Management Process . 2013,第2期

机译：HIGWGET-A用于爬网安全隐藏网页的模型
4. Dark Web-Onion Hidden Service Discovery and Crawling for Profiling Morphing, Unstructured Crime and Vulnerabilities Prediction [C] . Romil Rawat, Anand Singh Rajawat, Vinod Mahor, International Conference on Electrical and Electronic Engineering . 2021

机译：黑暗网上洋葱隐藏的服务发现和爬行，用于分析变形，非结构化犯罪和漏洞预测
5. Crawling and searching the hidden Web. [D] . Ntoulas, Alexandros. 2006

机译：搜寻和搜索隐藏的Web。
6. An Efficient Approach for Web Indexing of Big Data through Hyperlinks in Web Crawling [O] . R. Suganya Devi, D. Manjula, R. K. Siddharth 2015

机译：通过Web爬网中的超链接对大数据进行Web索引的一种有效方法
7. Crawling the Hidden Web: An Approach to Dynamic Web Indexing [O] . Moumie Soulemane, Mohammad Rafiuzzaman, Hasan Mahmud 2012

机译：搜寻隐藏的Web：动态Web索引的一种方法

Crawling the Hidden Web

摘要

著录项

相似文献

相关主题

期刊订阅