首页> 外文会议>Discovery science >Scalable Detection of Frequent Substrings by Grammar-Based Compression
【24h】

Scalable Detection of Frequent Substrings by Grammar-Based Compression

机译:通过基于语法的压缩可伸缩地检测频繁的子字符串

获取原文
获取原文并翻译 | 示例

摘要

A scalable pattern discovery by compression is proposed. A string is representable by a context-free grammar (CFG) deriving the string deterministically. In this framework of grammar-based compression, the aim of the algorithm is to output as small a CFG as possible. Beyond that, the optimization problem is approximately solvable. In such approximation algorithms, the compressor by Sakamoto et al. (2009) is especially suitable for detecting maximal common substrings as well as long frequent substrings. This is made possible thanks to the characteristics of edit-sensitive parsing (ESP) by Cormode and Muthukr-ishnan (2007), which was introduced to approximate a variant of edit distance. Based on ESP, we design a linear time algorithm to find all frequent patterns in a string approximately and prove a lower bound for the length of extracted frequent patterns. We also examine the performance of our algorithm by experiments in DNA sequences and other compressible real world texts. Compared to the practical algorithm developed by Uno (2008), our algorithm is faster with large and repetitive strings.
机译:提出了通过压缩的可伸缩模式发现。字符串可以由确定性派生该字符串的上下文无关文法(CFG)表示。在这种基于语法的压缩框架中,该算法的目的是输出尽可能小的CFG。除此之外,优化问题是可以解决的。在这种近似算法中,Sakamoto等人的压缩机。 (2009)特别适用于检测最大的公共子字符串以及长的频繁子字符串。这要归功于Cormode和Muthukr-ishnan(2007)提出的编辑敏感解析(ESP)的特性,该特性被引入来近似编辑距离的变体。基于ESP,我们设计了一种线性时间算法,以近似地查找字符串中的所有频繁模式,并证明提取的频繁模式长度的下限。我们还通过在DNA序列和其他可压缩的真实世界文本中进行实验来检验算法的性能。与Uno(2008)开发的实用算法相比,我们的算法在处理大型且重复的字符串时速度更快。

著录项

  • 来源
    《Discovery science》|2011年|p.236-246|共11页
  • 会议地点 Espoo(FI);Espoo(FI)
  • 作者单位

    Kyushu Institute of Technology, 680-4 Kawazu, Iizuka-shi, Fukuoka, 820-8502;

    Kyushu University, 744 Motooka, Nishi-ku, Fukuoka-shi, Fukuoka 819-0395;

    Gakushuin University, 1-5-1 Mejiro Toshima Tokyo, 171-8588;

    Kyushu Institute of Technology, 680-4 Kawazu, Iizuka-shi, Fukuoka, 820-8502,PRESTO JST, 4-1-8 Honcho Kawaguchi, Saitama 332-0012, Japan;

  • 会议组织
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 人工智能理论;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号