首页> 外文会议>2017 ACM/IEEE Joint Conference on Digital Libraries >Retrieving and Combining Repeated Passages to Improve OCR
【24h】

Retrieving and Combining Repeated Passages to Improve OCR

机译:检索和合并重复的段落以改善OCR

获取原文
获取原文并翻译 | 示例

摘要

We present a novel approach to improve the output of optical character recognition (OCR) systems by first detecting duplicate passages in their output and then performing consensus decoding combined with a language model. This approach is orthogonal to, and may be combined with, previously proposed methods for combining the output of different OCR systems on the same image or the output of the same OCR system on differently processed images of the same text. It may also be combined with methods to estimate the parameters of a noisy channel model of OCR errors. Additionally, the current method generalizes previous proposals for a simple majority- vote combination of known duplicated texts. On a corpus of historical newspapers, an annotated set of clusters has a baseline word error rate (WER) of 33%. A majority vote procedure reaches 23% on passages where one or more duplicates were found, and consensus decoding combined with a language model achieves 18% WER. In a separate experiment, newspapers were aligned to very widely reprinted texts such as State of the Union speeches, producing clusters with up to 58 witnesses. Beyond 20 witnesses, majority vote outperforms language model rescoring, though the gap between them is much less in this experiment.
机译:我们提出一种新颖的方法来改善光学字符识别(OCR)系统的输出,方法是首先检测其输出中的重复段落,然后执行与语言模型结合的共识解码。该方法与先前提出的方法相正交,并且可以与先前提出的方法相结合,该方法用于在同一图像上组合不同OCR系统的输出或在相同文本的经过不同处理的图像上相同OCR系统的输出。它也可以与估计OCR错误的噪声信道模型的参数的方法相结合。另外,当前的方法将先前的建议归纳为已知重复文本的简单多数表决组合。在一组历史报纸上,一组带注释的簇的基线单词错误率(WER)为33%。在找到一个或多个重复项的段落中,多数表决程序达到23%,而共识解码与语言模型相结合可实现18%的WER。在一个单独的实验中,报纸与非常广泛转载的文本(例如国情咨文)保持一致,产生了多达58位证人的集群。在20位证人之外,多数投票的表现优于语言模型评分,尽管在此实验中他们之间的差距要小得多。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号