首页> 外文会议>String Processing and Information Retrieval >Processing Text Files as Is: Pattern Matching over Compressed Texts, Multi-byte Character Texts, and Semi-structured Texts
【24h】

Processing Text Files as Is: Pattern Matching over Compressed Texts, Multi-byte Character Texts, and Semi-structured Texts

机译:按原样处理文本文件:压缩文本,多字节字符文本和半结构化文本的模式匹配

获取原文

摘要

Techniques in processing text files "as is" are presented, in which given text files are processed without modification. The compressed pattern matching problem, first defined by Amir and Benson (1992), is a good example of the "as-is" principle. Another example is string matching over multi-byte character texts, which is a significant problem common to oriental languages such as Japanese, Korean, Chinese, and Taiwanese. A text file from such languages is a mixture of single-byte characters and multi-byte characters. Naive solution would be (1) to convert a given text into a fixed length encoded one and then apply any string matching routine to it; or (2) to directly search the text file byte after byte for (the encoding of) a pattern in which an extra work is needed for synchronization to avoid false detection. Both the solutions, however, sacrifice the searching speed. Our algorithm runs on such a multi-byte character text file at the same speed as on an ordinary ASCII text file, without false detection. The technique is applicable to any prefix code such as the Huffman code and variants of Unicode. We also generalize the technique so as to handle structured texts such as XML documents. Using this technique, we can avoid false detection of keyword even if it is a substring of a tag name or of an attribute description, without any sacrifice of searching speed.
机译:提出了按原样处理文本文件的技术,其中给定的文本文件无需修改即可处理。压缩模式匹配问题最早由Amir和Benson(1992)定义,是“按原样”原理的一个很好的例子。另一个示例是多字节字符文本上的字符串匹配,这是东方语言(如日语,韩语,中文和台湾语)常见的重要问题。来自此类语言的文本文件是单字节字符和多字节字符的混合。幼稚的解决方案是(1)将给定的文本转换为固定长度的编码文本,然后对其应用任何字符串匹配例程;或(2)直接在字节后面搜索文本文件,以寻找一种模式(该模式的编码),在该模式中,需要进行额外的同步工作以避免错误检测。但是,两种解决方案都牺牲了搜索速度。我们的算法在这样的多字节字符文本文件上以与普通ASCII文本文件相同的速度运行,没有错误检测。该技术适用于任何前缀代码,例如霍夫曼代码和Unicode变体。我们还对该技术进行了概括,以便处理诸如XML文档之类的结构化文本。使用这种技术,即使关键词是标签名称或属性描述的子字符串,我们也可以避免对关键词的错误检测,而不会牺牲搜索速度。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号