首页> 外文会议>Document recognition and retrieval XVII >Semi-supervised Learning For Detecting Text-lines in Noisy Document Images
【24h】

Semi-supervised Learning For Detecting Text-lines in Noisy Document Images

机译:半监督学习,用于检测嘈杂文档图像中的文本行

获取原文
获取原文并翻译 | 示例

摘要

Document layout analysis is a key step in document image understanding with wide applications in document digitization and reformatting. Identifying correct layout from noisy scanned images is especially challenging. In this paper, we introduce a semi-supervised learning framework to detect text-lines from noisy document images. Our framework consists of three steps. The first step is the initial segmentation that extracts text-lines and images using simple morphological operations. The second step is a grouping-based layout analysis that identifies text-lines, image zones, column separator and vertical border noise. It is able to efficiently remove the vertical border noises from multi-column pages. The third step is an online classifier that is trained with the high confidence line detection results from Step Two, and filters out noise from low confidence lines. The classifier effectively removes speckle noises embedded inside the content zones.rnWe compare the performance of our algorithm to the state-of-the-art work in the field on the UW-III database. We choose the results reported by the Image Understanding Pattern Recognition Research (IUPR) and Scansoft Omnipage SDK 15.5. We evaluate the performances at both the page frame level and the text-line level. The result shows that our system has much lower false-alarm rate, while maintains similar content detection rate. In addition, we also show that our online training model generalizes better than algorithms depending on offline training.
机译:文档布局分析是文档图像理解的关键步骤,在文档数字化和重新格式化方面具有广泛的应用。从嘈杂的扫描图像中识别正确的布局尤其具有挑战性。在本文中,我们介绍了一种半监督学习框架,可从嘈杂的文档图像中检测文本行。我们的框架包括三个步骤。第一步是使用简单的形态学操作提取文本行和图像的初始分割。第二步是基于分组的布局分析,可识别文本行,图像区域,列分隔符和垂直边框噪声。它能够有效地消除多列页面的垂直边框噪声。第三步是在线分类器,使用第二步中的高置信度线检测结果对其进行训练,并过滤掉低置信度线中的噪声。该分类器有效地消除了嵌入在内容区域内的斑点噪声。我们将算法的性能与UW-III数据库中该领域的最新技术进行了比较。我们选择图像理解模式识别研究(IUPR)和Scansoft Omnipage SDK 15.5报告的结果。我们评估页面框架级别和文本行级别的性能。结果表明,我们的系统具有较低的误报率,同时保持了相似的内容检测率。此外,我们还表明,根据离线培训,我们的在线培训模型比算法具有更好的泛化能力。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号