首页> 外文会议>International conference on computational linguistics >A best-first anagram hashing filter for approximate string matching with generalized edit distance
【24h】

A best-first anagram hashing filter for approximate string matching with generalized edit distance

机译:最佳优先字谜散列过滤器,用于近似字符串匹配和广义编辑距离

获取原文

摘要

This paper presents an efficient method for approximate string matching against a lexicon. We define a filter that for each source word selects a small set of target lexical entries, from which the best match is then selected using generalized edit distance, where edit operations can be assigned an arbitrary weight. The filter combines a specialized hash function with best-first search. Our work extends and improves upon a previously proposed hash-based filter, developed for matching with uniform-weight edit distance. We evaluate an approximate matching system implemented with the new best-first filter, by conducting several experiments on a historical corpus and a set of weighted rules taken from the literature. We present running times and discuss how performance varies using different stopping criteria and target lexica. The results show that the filter is suitable for large rule sets and million word corpora, and encourage further development.
机译:本文提出了一种有效的方法,用于针对词典进行近似字符串匹配。我们定义了一个过滤器,该过滤器为每个源单词选择一小组目标词法条目,然后使用广义编辑距离从中选择最佳匹配项,在其中可以为编辑操作分配任意权重。该过滤器将专门的哈希函数与最佳优先搜索结合在一起。我们的工作在以前提出的基于散列的过滤器上进行了扩展和改进,该过滤器是为与均匀权重编辑距离匹配而开发的。我们通过对历史语料进行一些实验并从文献中获取一组加权规则,来评估使用新的最佳优先滤波器实现的近似匹配系统。我们介绍了运行时间,并讨论了如何使用不同的停止条件和目标词汇来改变性能。结果表明,该过滤器适用于大型规则集和百万词库,并鼓励进一步发展。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号