Using structural contexts to compress semistructured text collections

Joaquin Adiego; Gonzalo Navarro; Pablo de la Fuente

首页> 外文期刊>Information Processing & Management >Using structural contexts to compress semistructured text collections

【24h】

Using structural contexts to compress semistructured text collections

机译：使用结构性上下文压缩半结构化文本集合

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

We describe a compression model for semistructured documents, called Structural Contexts Model (SCM), which takes advantage of the context information usually implicit in the structure of the text. The idea is to use a separate model to compress the text that lies inside each different structure type (e.g., different XML tag). The intuition behind SCM is that the distribution of all the texts that belong to a given structure type should be similar, and different from that of other structure types. We mainly focus on semistatic models, and test our idea using a word-based Huffman method. This is the standard for compressing large natural language text databases, because random access, partial decompression, and direct search of the compressed collection is possible. This variant, dubbed SCMHuff, retains those features and improves Huffman's compression ratios. We consider the possibility that storing separate models may not pay off if the distribution of different structure types is not different enough, and present a heuristic to merge models with the aim of minimizing the total size of the compressed database. This gives an additional improvement over the plain technique. The comparison against existing prototypes shows that, among the methods that permit random access to the collection, SCMHuff achieves the best compression ratios, 2-4% better than the closest alternative. From a purely compression-aimed perspective, we combine SCM with PPM modeling. A separate PPM model is used to compress the text that lies inside each different structure type. The result, SCMPPM, does not permit random access nor direct search in the compressed text, but it gives 2-5% better compression ratios than other techniques for texts longer than 5 MB.

机译：我们描述了一种用于半结构化文档的压缩模型，称为结构化上下文模型（SCM），该模型利用了通常隐含在文本结构中的上下文信息。这个想法是使用一个单独的模型来压缩每个不同结构类型（例如，不同的XML标签）中的文本。 SCM的直觉是，属于给定结构类型的所有文本的分布应该相似，并且与其他结构类型不同。我们主要关注半静态模型，并使用基于单词的霍夫曼方法测试我们的想法。这是压缩大型自然语言文本数据库的标准，因为可以对压缩集合进行随机访问，部分解压缩和直接搜索。这种被称为SCMHuff的变体保留了这些功能并提高了Huffman的压缩率。我们认为，如果不同结构类型的分布差异不够大，则存储单独模型的可能性可能不会得到回报，并提出了一种启发式合并模型的方法，目的是最小化压缩数据库的总大小。与普通技术相比，这带来了额外的改进。与现有原型的比较表明，在允许随机访问集合的方法中，SCMHuff实现最佳压缩率，比最接近的压缩率高2-4％。从纯粹的压缩目标的角度来看，我们将SCM与PPM建模相结合。单独的PPM模型用于压缩每个不同结构类型内的文本。结果SCMPPM不允许在压缩的文本中进行随机访问或直接搜索，但是对于长度超过5 MB的文本，它的压缩率比其他技术高2-5％。

著录项

来源
《Information Processing & Management》 |2007年第3期|p.769-790|共22页
作者
Joaquin Adiego; Gonzalo Navarro; Pablo de la Fuente;
展开▼
作者单位

Depto. de Informatica, Universidad de Valladolid, ETIyT - Campus Miguel Delibes, Camino del Cementerio s, 47011 Valladolid, Valladolid, Spain;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类图书馆学、图书馆事业;情报学、情报工作;
关键词
text compression; semistructured documents; compressed text databases;

机译：文本压缩;半结构化文档;压缩文本数据库;

相似文献

外文文献
中文文献
专利

1. A compressed dynamic self-index for highly repetitive text collections [J] . Takaaki Nishimoto, Yoshimasa Takabatake, Yasuo Tabei Information and computation . 2020,第Auga期

机译：一种压缩的动态自我索引，用于高度重复的文本集合
2. (EFM)-F-2: an encrypted and compressed full-text index for collections of genomic sequences [J] . Bioinformatics . 2017,第18期

机译：（efm）-f-2：用于基因组序列集合的加密和压缩的全文索引
3. Temporal contexts: Effective text classification in evolving document collections [J] . Leonardo Rocha, Fernando Mourao, Hilton Mota, Information Systems . 2013,第3期

机译：时间上下文：不断发展的文档集中的有效文本分类
4. Combining structural and textual contexts for compressing semistructured databases [C] . Adiego, J., de la Fuente, . 2005

机译：结合结构和文本上下文来压缩半结构化数据库
5. Bringing the text to legibility: Translation, post-structuralism, and the colonial context. [D] . Niranjana, Tejaswini. 1988

机译：使文本更易读：翻译，后结构主义和殖民地背景。
6. METSP: A Maximum-Entropy Classifier Based Text Mining Tool for Transporter-Substrate Identification with Semistructured Text [O] . Min Zhao, Yanming Chen, Dacheng Qu, -1

机译：METSP：基于最大熵分类器的文本挖掘工具用于半结构化文本的转运体-基质识别
7. Using Structural Contexts to Compress Semistructured Text Collections [O] . Joaquín Adiego, Gonzalo Navarro, Pablo de la Fuente 2008

机译：使用结构上下文压缩半结构化文本集合
8. Translation of the Health Brochure and Impact on the Target Reader: A Constrastive Analysis of the Structural and Pragmatic Feaures of Texts Translated in Spanish Versus Texts Written Orginally in Spanish [R] . Jacobson, H. E. 2002

机译：健康手册的翻译及对目标读者的影响：西班牙语翻译文本的结构和语用特征的对比分析西班牙文原文翻译

Using structural contexts to compress semistructured text collections

摘要

著录项

相似文献

相关主题

期刊订阅