首页> 外文期刊>Information Processing & Management >Using structural contexts to compress semistructured text collections
【24h】

Using structural contexts to compress semistructured text collections

机译:使用结构性上下文压缩半结构化文本集合

获取原文
获取原文并翻译 | 示例
           

摘要

We describe a compression model for semistructured documents, called Structural Contexts Model (SCM), which takes advantage of the context information usually implicit in the structure of the text. The idea is to use a separate model to compress the text that lies inside each different structure type (e.g., different XML tag). The intuition behind SCM is that the distribution of all the texts that belong to a given structure type should be similar, and different from that of other structure types. We mainly focus on semistatic models, and test our idea using a word-based Huffman method. This is the standard for compressing large natural language text databases, because random access, partial decompression, and direct search of the compressed collection is possible. This variant, dubbed SCMHuff, retains those features and improves Huffman's compression ratios. We consider the possibility that storing separate models may not pay off if the distribution of different structure types is not different enough, and present a heuristic to merge models with the aim of minimizing the total size of the compressed database. This gives an additional improvement over the plain technique. The comparison against existing prototypes shows that, among the methods that permit random access to the collection, SCMHuff achieves the best compression ratios, 2-4% better than the closest alternative. From a purely compression-aimed perspective, we combine SCM with PPM modeling. A separate PPM model is used to compress the text that lies inside each different structure type. The result, SCMPPM, does not permit random access nor direct search in the compressed text, but it gives 2-5% better compression ratios than other techniques for texts longer than 5 MB.
机译:我们描述了一种用于半结构化文档的压缩模型,称为结构化上下文模型(SCM),该模型利用了通常隐含在文本结构中的上下文信息。这个想法是使用一个单独的模型来压缩每个不同结构类型(例如,不同的XML标签)中的文本。 SCM的直觉是,属于给定结构类型的所有文本的分布应该相似,并且与其他结构类型不同。我们主要关注半静态模型,并使用基于单词的霍夫曼方法测试我们的想法。这是压缩大型自然语言文本数据库的标准,因为可以对压缩集合进行随机访问,部分解压缩和直接搜索。这种被称为SCMHuff的变体保留了这些功能并提高了Huffman的压缩率。我们认为,如果不同结构类型的分布差异不够大,则存储单独模型的可能性可能不会得到回报,并提出了一种启发式合并模型的方法,目的是最小化压缩数据库的总大小。与普通技术相比,这带来了额外的改进。与现有原型的比较表明,在允许随机访问集合的方法中,SCMHuff实现最佳压缩率,比最接近的压缩率高2-4%。从纯粹的压缩目标的角度来看,我们将SCM与PPM建模相结合。单独的PPM模型用于压缩每个不同结构类型内的文本。结果SCMPPM不允许在压缩的文本中进行随机访问或直接搜索,但是对于长度超过5 MB的文本,它的压缩率比其他技术高2-5%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号