首页> 外文学位 >Optical Character Recognition of Printed Persian/Arabic Documents.
【24h】

Optical Character Recognition of Printed Persian/Arabic Documents.

机译:印刷的波斯/阿拉伯文档的光学字符识别。

获取原文
获取原文并翻译 | 示例

摘要

Texts are an important representation of language. Due to the volume of texts generated and the historical value of some documents, it is imperative to use computers to read generated texts, and make them editable and searchable. This task, however, is not trivial. Recreating human perception capabilities in artificial systems like documents is one of the major goals of pattern recognition research. After decades of research and improvements in computing capabilities, humans' ability to read typed or handwritten text is hardly matched by machine intelligence. Although, classical applications of Optical Character Recognition (OCR) like reading machine-printed addresses in a mail sorting machine is considered solved, more complex scripts or handwritten texts push the limits of the existing technology. Moreover, many of the existing OCR systems are language dependent. Therefore, improvements in OCR technologies have been uneven across different languages. Especially, for Persian, there has been limited research. Despite the need to process many Persian historical documents or use of OCR in variety of applications, few Persian OCR systems work with good recognition rate.;Consequently, the task of automatically reading Persian typed documents with close-to-human performance is still an open problem and the main focus of this dissertation.;In this dissertation, after a literature survey of the existing technology, we propose new techniques in the two important preprocessing steps in any OCR system: Skew detection and Page segmentation. Then, rather than the usual practice of character segmentation, we propose segmentation of Persian documents into sub-words. The choice of sub-word segmentation is to avoid the challenges of segmenting highly cursive Persian texts to isolated characters. For feature extraction, we will propose a hybrid scheme between three commonly used methods and finally use a nonparametric classification method.;A large number of papers and patents advertise recognition rates near 100%. Such claims give the impression that automation problems seem to have been solved. Although OCR is widely used, its accuracy today is still far from a child's reading skills. Failure of some real applications show that performance problems still exist on composite and degraded documents and that there is still room for progress.
机译:文本是语言的重要代表。由于生成的文本量大和某些文档的历史价值,必须使用计算机读取生成的文本,并使它们可编辑和可搜索。但是,这项任务并不简单。在诸如文档之类的人工系统中重建人类感知能力是模式识别研究的主要目标之一。经过数十年的研究和对计算能力的改进,机器智能几乎无法与人类阅读打字或手写文本的能力相提并论。尽管光学字符识别(OCR)的经典应用(例如在邮件分拣机中读取机器打印的地址)被认为已解决,但更复杂的脚本或手写文本却限制了现有技术的局限性。此外,许多现有的OCR系统都依赖于语言。因此,OCR技术的改进在不同语言之间并不均衡。特别是对于波斯语,研究很少。尽管需要处理许多波斯历史文档或在各种应用中使用OCR,但很少有波斯OCR系统能以较高的识别率工作;因此,自动读取具有接近人类性能的波斯打字文档的任务仍然是一个开放的任务本文对现有技术进行了文献综述,提出了在任何OCR系统的两个重要预处理步骤中的新技术:偏斜检测和页面分割。然后,我们建议将波斯文档分割为子词,而不是通常的字符分割方法。选择子词分段是为了避免将高度草书的波斯文本分段为孤立字符的挑战。对于特征提取,我们将提出三种常用方法之间的混合方案,最后使用非参数分类方法。大量论文和专利的识别率都接近100%。这样的说法给人的印象是自动化问题似乎已经解决了。尽管OCR被广泛使用,但如今它的准确性仍远未达到儿童的阅读能力。某些实际应用程序的失败表明,复合文档和降级文档仍然存在性能问题,并且仍有进步的空间。

著录项

  • 作者

    Shafii, Mahnaz.;

  • 作者单位

    University of Windsor (Canada).;

  • 授予单位 University of Windsor (Canada).;
  • 学科 Electrical engineering.;Computer engineering.;Computer science.
  • 学位 Ph.D.
  • 年度 2014
  • 页码 120 p.
  • 总页数 120
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号