首页> 外文学位 >Index compression and redundancy elimination in large textual collections.

【24h】

Index compression and redundancy elimination in large textual collections.

机译：大型文本集合中的索引压缩和冗余消除。

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

Large search engines process thousands of queries per second against their collections of billions of web pages. They often build inverted indexes for their collections to speed up query processing. The rapidly growing inverted index size has been one of the most important challenges for search engines in the past decade. Search engines use highly optimized compression schemes to decrease inverted index size and improve query throughput. Many index compression techniques have been studied in the literature.;Although millions of new web pages need to be downloaded by search engines every day, a considerable proportion of them share a lot of content. This results in a huge amount of data redundancy in both web pages and inverted indexes. The redundancy of the inverted indexes may significantly slow down query processing. Although many index compression methods have tried to reduce redundancy within page and postings, they could be improved significantly by taking better advantages of similarities between web pages. Also, most previous work has focused on compressing docID and frequency information stored in the index. However, it is also very important to compress position information in the index, since its size is much larger than that of docIDs or frequencies.;In this thesis, we focus on inverted index compression and query processing techniques. We study compression techniques for docIDs and frequencies with optimized document reordering techniques, which exploit the similarities between web pages. We also study the compression of position data in inverted indexes. In addition, we study file synchronization techniques that reduce redundant data transfer over networks. Search engines can use such techniques to save a large amount of network bandwidth. Our experimental results show that our techniques can significantly improve the search engine performance.

机译：大型搜索引擎每秒对数十亿个网页的集合进行数千个查询。他们通常为其集合构建反向索引，以加快查询处理。在过去十年中，快速增长的倒排索引大小一直是搜索引擎面临的最重要挑战之一。搜索引擎使用高度优化的压缩方案来减小倒排索引大小并提高查询吞吐量。文献中已经研究了许多索引压缩技术。尽管每天搜索引擎需要下载数百万个新网页，但是其中相当一部分共享许多内容。这导致网页和反向索引中都有大量的数据冗余。反向索引的冗余可能会大大减慢查询处理。尽管许多索引压缩方法已经尝试减少页面和发布中的冗余，但是可以通过更好地利用网页之间相似性的优势来显着改善它们。同样，大多数以前的工作都集中在压缩存储在索引中的docID和频率信息。但是，压缩位置信息在索引中也非常重要，因为它的大小远大于docID或docID的大小。本文主要研究反向索引压缩和查询处理技术。我们使用优化的文档重新排序技术研究docID和频率的压缩技术，该技术利用了网页之间的相似性。我们还研究了倒排索引中位置数据的压缩。此外，我们研究了文件同步技术，该技术可减少网络上的冗余数据传输。搜索引擎可以使用这样的技术来节省大量的网络带宽。我们的实验结果表明，我们的技术可以显着提高搜索引擎的性能。

著录项

作者
Yan, Hao.;
展开▼
作者单位

Polytechnic Institute of New York University.;

展开▼
授予单位 Polytechnic Institute of New York University.;
学科 Computer Science.
学位 Ph.D.
年度 2010
页码 131 p.
总页数 131
原文格式 PDF
正文语种 eng
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Video compression using frame redundancy elimination and discrete cosine transform coefficient reduction [J] . Saleh Ali Alshehri Multimedia Tools and Applications . 2021,第1期

机译：使用帧冗余消除和离散余弦变换系数减少视频压缩
2. Textual image compression: two-stage lossy/lossless encoding of textual images [J] . Witten I.H., Bell T.C. Proceedings of the IEEE . 1994,第6期

机译：文本图像压缩：文本图像的两阶段有损/无损编码
3. SimEli: similarity elimination method for sampling distant entries in development of core collections. [J] . Krishnan R. R., Sumathy R., Ramesh S. R., Crop Science . 2014,第3期

机译：SimEli：相似性消除方法，用于在核心集合的开发中对遥远的条目进行采样。
4. Local Feature-Based Photo Album Compression by Eliminating Redundancy of Human Partition [C] . Chia-Hsin Chan, Bo-Hsyuan Chen, Wen-Jiin Tsai Asian conference on computer vision;Workshop on new trends in image restoration and enhancement;Workshop on assistive vision;ACCV workshop on hyperspectral image and signal processing;Workshop on computer vision technologies for smart vehicle;Workshop on spontaneous facial behavior analysis;Workshop on 3D modelling and applications;ACCV workshop on e-heritage;Workshop on multiview lip-reading challenges;Workshop on facial informatics;Workshop on discrete geometry and mathematical morphology for computer vision;Workshop on mathematical and computational methods in biomedical imaging and image analysis;International workshop on driver drowsiness detection from video;Workshop on meeting HCI with CV;Workshop on human identification for surveillance methods and applications;Workshop on benchmark and evaluation of surveillance task;Workshop on computer vision for affective computing;Workshop on interpretation and visualization of deep neural nets . 2017

机译：通过消除人为分区的冗余实现基于局部特征的相册压缩
5. TailoredRE: A personalized cloud-based traffic redundancy elimination for smartphones. [D] . Soundararaj, Vivekgautham. 2016

机译：TailoredRE：针对智能手机的个性化基于云的流量冗余消除。
6. A Compression-Based Method for Detecting Anomalies in Textual Data [O] . Gonzalo de la Torre-Abaitua, Luis Fernando Lago-Fernández, David Arroyo 2021

机译：一种基于压缩的文本数据中的异常的方法
7. RMI-DRE: a redundancy-maximizing identification scheme for data redundancy elimination [O] . Nan Zhang, Xiaolong Yang, Min Zhang, 2016

机译：RMI-DRE：用于数据冗余消除的冗余最大化识别方案

Index compression and redundancy elimination in large textual collections.

摘要

著录项

相似文献

相关主题

期刊订阅