首页> 外文会议>String Processing and Information Retrieval >Processing Text Files as Is: Pattern Matching over Compressed Texts, Multi-byte Character Texts, and Semi-structured Texts

【24h】

Processing Text Files as Is: Pattern Matching over Compressed Texts, Multi-byte Character Texts, and Semi-structured Texts

机译：按原样处理文本文件：压缩文本，多字节字符文本和半结构化文本的模式匹配

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Techniques in processing text files "as is" are presented, in which given text files are processed without modification. The compressed pattern matching problem, first defined by Amir and Benson (1992), is a good example of the "as-is" principle. Another example is string matching over multi-byte character texts, which is a significant problem common to oriental languages such as Japanese, Korean, Chinese, and Taiwanese. A text file from such languages is a mixture of single-byte characters and multi-byte characters. Naive solution would be (1) to convert a given text into a fixed length encoded one and then apply any string matching routine to it; or (2) to directly search the text file byte after byte for (the encoding of) a pattern in which an extra work is needed for synchronization to avoid false detection. Both the solutions, however, sacrifice the searching speed. Our algorithm runs on such a multi-byte character text file at the same speed as on an ordinary ASCII text file, without false detection. The technique is applicable to any prefix code such as the Huffman code and variants of Unicode. We also generalize the technique so as to handle structured texts such as XML documents. Using this technique, we can avoid false detection of keyword even if it is a substring of a tag name or of an attribute description, without any sacrifice of searching speed.

机译：提出了按原样处理文本文件的技术，其中给定的文本文件无需修改即可处理。压缩模式匹配问题最早由Amir和Benson（1992）定义，是“按原样”原理的一个很好的例子。另一个示例是多字节字符文本上的字符串匹配，这是东方语言（如日语，韩语，中文和台湾语）常见的重要问题。来自此类语言的文本文件是单字节字符和多字节字符的混合。幼稚的解决方案是（1）将给定的文本转换为固定长度的编码文本，然后对其应用任何字符串匹配例程;或（2）直接在字节后面搜索文本文件，以寻找一种模式（该模式的编码），在该模式中，需要进行额外的同步工作以避免错误检测。但是，两种解决方案都牺牲了搜索速度。我们的算法在这样的多字节字符文本文件上以与普通ASCII文本文件相同的速度运行，没有错误检测。该技术适用于任何前缀代码，例如霍夫曼代码和Unicode变体。我们还对该技术进行了概括，以便处理诸如XML文档之类的结构化文本。使用这种技术，即使关键词是标签名称或属性描述的子字符串，我们也可以避免对关键词的错误检测，而不会牺牲搜索速度。

著录项

来源
《String Processing and Information Retrieval》|2002年|p.170-186|共17页
会议地点
作者
Masayuki Takeda; Satoru Miyamoto; Takuya Kida; Ayumi Shinohara; Shuichi Fukamachi; Takeshi Shinohara; Setsuo Arikawa;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类自动化技术、计算机技术;
关键词

相似文献

外文文献
中文文献
专利

1. Text Character Extraction Implementation from Captured Handwritten Image to Text Conversionusing Template Matching Technique [J] . Seema Barate1, Chaitrali Kamthe1, Shweta Phadtare1, MATEC Web of Conferences . 2016,第2016期

机译：使用模板匹配技术从捕获的手写图像中提取文本字符以实现文本转换
2. Do fourth graders integrate text and picture in processing and learning from an illustrated science text? Evidence from eye-movement patterns [J] . Lucia Mason, Maria Caterina Tornatora, Patrik Pluchino Computers & education . 2013,第1期

机译：四年级学生是否在处理和学习插图科学课本中整合了文字和图片？眼动模式的证据
3. TEXT-MINING AND PATTERN-MATCHING BASED PREDICTION MODELS FOR DETECTING VULNERABLE FILES IN WEB APPLICATIONS [J] . Gupta Mukesh Kumar, Govil Mahesh Chandra, Singh Girdhari Journal of web engineering . 2018,第1a2期

机译：基于文本挖掘和模式匹配的Web应用程序中易受攻击文件的预测模型
4. Processing Text Files as Is: pattern Matching over Compressed Texts, Multi-byte Character Text, and Semi-structured Texts [C] . Masayuki Takeda, Satoru Miyamoto, Takuya Kida, International symposium on string processing and information retrieval . 2002

机译：处理文本文件原样：模式匹配压缩文本，多字节字符文本和半结构化文本
5. Compressed pattern matching for text and images. [D] . Tao, Tao. 2005

机译：文本和图像的压缩模式匹配。
6. Reading Dickens’s characters: Employing psycholinguistic methods to investigate the cognitive reality of patterns in texts [O] . Michaela Mahlberg, Kathy Conklin, Marie-Josée Bisson -1

机译：阅读狄更斯的性格：运用心理语言学方法研究文本模式的认知现实
7. Processing Text Files as Is: Pattern Matching over Compressed Texts, Multi-byte Character Texts, and Semi-structured Texts [O] . Masayuki Takeda, Satoru Miyamoto, Takuya Kida, 2002

机译：按原样处理文本文件：压缩文本，多字节字符文本和半结构化文本的模式匹配
8. A Survey of Standardization Efforts of Coded Character Sets for Text Processing [R] . Knoerdel, J. E. 1981

机译：文本处理编码字符集标准化工作概述

Processing Text Files as Is: Pattern Matching over Compressed Texts, Multi-byte Character Texts, and Semi-structured Texts

摘要

著录项

相似文献

相关主题

期刊订阅