首页> 外文学位 >Index compression and redundancy elimination in large textual collections.
【24h】

Index compression and redundancy elimination in large textual collections.

机译:大型文本集合中的索引压缩和冗余消除。

获取原文
获取原文并翻译 | 示例

摘要

Large search engines process thousands of queries per second against their collections of billions of web pages. They often build inverted indexes for their collections to speed up query processing. The rapidly growing inverted index size has been one of the most important challenges for search engines in the past decade. Search engines use highly optimized compression schemes to decrease inverted index size and improve query throughput. Many index compression techniques have been studied in the literature.;Although millions of new web pages need to be downloaded by search engines every day, a considerable proportion of them share a lot of content. This results in a huge amount of data redundancy in both web pages and inverted indexes. The redundancy of the inverted indexes may significantly slow down query processing. Although many index compression methods have tried to reduce redundancy within page and postings, they could be improved significantly by taking better advantages of similarities between web pages. Also, most previous work has focused on compressing docID and frequency information stored in the index. However, it is also very important to compress position information in the index, since its size is much larger than that of docIDs or frequencies.;In this thesis, we focus on inverted index compression and query processing techniques. We study compression techniques for docIDs and frequencies with optimized document reordering techniques, which exploit the similarities between web pages. We also study the compression of position data in inverted indexes. In addition, we study file synchronization techniques that reduce redundant data transfer over networks. Search engines can use such techniques to save a large amount of network bandwidth. Our experimental results show that our techniques can significantly improve the search engine performance.
机译:大型搜索引擎每秒对数十亿个网页的集合进行数千个查询。他们通常为其集合构建反向索引,以加快查询处理。在过去十年中,快速增长的倒排索引大小一直是搜索引擎面临的最重要挑战之一。搜索引擎使用高度优化的压缩方案来减小倒排索引大小并提高查询吞吐量。文献中已经研究了许多索引压缩技术。尽管每天搜索引擎需要下载数百万个新网页,但是其中相当一部分共享许多内容。这导致网页和反向索引中都有大量的数据冗余。反向索引的冗余可能会大大减慢查询处理。尽管许多索引压缩方法已经尝试减少页面和发布中的冗余,但是可以通过更好地利用网页之间相似性的优势来显着改善它们。同样,大多数以前的工作都集中在压缩存储在索引中的docID和频率信息。但是,压缩位置信息在索引中也非常重要,因为它的大小远大于docID或docID的大小。本文主要研究反向索引压缩和查询处理技术。我们使用优化的文档重新排序技术研究docID和频率的压缩技术,该技术利用了网页之间的相似性。我们还研究了倒排索引中位置数据的压缩。此外,我们研究了文件同步技术,该技术可减少网络上的冗余数据传输。搜索引擎可以使用这样的技术来节省大量的网络带宽。我们的实验结果表明,我们的技术可以显着提高搜索引擎的性能。

著录项

  • 作者

    Yan, Hao.;

  • 作者单位

    Polytechnic Institute of New York University.;

  • 授予单位 Polytechnic Institute of New York University.;
  • 学科 Computer Science.
  • 学位 Ph.D.
  • 年度 2010
  • 页码 131 p.
  • 总页数 131
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号