首页> 外文会议>3rd international universal communications symposium 2009 >Development of a Large-scale Web Crawler and Search Engine Infrastructure
【24h】

Development of a Large-scale Web Crawler and Search Engine Infrastructure

机译:大型Web爬网程序和搜索引擎基础结构的开发

获取原文
获取原文并翻译 | 示例

摘要

This paper reports the ongoing development of a large-scale Web crawler and search engine infrastructure at National Institute of Information and Communications Technology. This infrastructure has the following characteristics: (1) It collects one billion Japanese Web pages while keeping them up-to-date. (2) It selects 100 million pages from among the collected pages and converts them into a standard data format to store the results of morphological analysis, dependency parsing, and synonym augmentation. (3) The selected set of pages is searchable and accessible to the users. (4) The scalability of the system is achieved by using a large-scale cluster machine for distributed data processing.
机译:本文报告了国家信息和通信技术研究所正在开发的大型Web搜寻器和搜索引擎基础结构。该基础结构具有以下特征:(1)收集十亿个日语网页,同时保持其最新状态。 (2)从收集的页面中选择1亿个页面,并将其转换为标准数据格式,以存储形态分析,依存关系分析和同义词增强的结果。 (3)所选页面集可供用户搜索和访问。 (4)系统的可伸缩性是通过使用大型集群计算机进行分布式数据处理来实现的。

著录项

  • 来源
  • 会议地点 Tokyo(JP);Tokyo(JP)
  • 作者单位

    National Institute of Information and Communications Technology 3-5 Hikaridai Seika-cho, Soraku-gun, Kyoto 619-0289, Japan;

    rnNational Institute of Information and Communications Technology 3-5 Hikaridai Seika-cho, Soraku-gun, Kyoto 619-0289, Japan;

    rnNational Institute of Information and Communications Technology 3-5 Hikaridai Seika-cho, Soraku-gun, Kyoto 619-0289, Japan;

    rnGraduate School of Informatics, Kyoto University Yoshida Honmachi, Kyoto 606-8501, Japan;

    rnNational Institute of Information and Communications Technology 3-5 Hikaridai Seika-cho, Soraku-gun, Kyoto 619-0289, Japan Nara Institute of Science and Technology 8916-5 Takayama, Ikoma, NARA 630-0192 JAPAN;

    rnNational Institute of Information and Communications Technology 3-5 Hikaridai Seika-cho, Soraku-gun, Kyoto 619-0289, Japan G;

  • 会议组织
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 通信;
  • 关键词

    web information analysis; search engine; crawler;

    机译:网站信息分析;搜索引擎;履带式;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号