首页> 外文会议>Conference on the North American Chapter of the Association for Computational Linguistics: Human Language Technologies >An Unsupervised Model of Orthographic Variation for Historical Document Transcription
【24h】

An Unsupervised Model of Orthographic Variation for Historical Document Transcription

机译:历史文献转录的正交变化无监督模型

获取原文

摘要

Historical documents frequently exhibit extensive orthographic variation, including archaic spellings and obsolete shorthand. OCR tools typically seek to produce so-called diplomatic transcriptions that preserve these variants, but many end tasks require transcriptions with normalized orthography. In this paper, we present a novel joint transcription model that learns, unsupervised, a probabilistic mapping between modern orthography and that used in the document. Our system thus produces dual diplomatic and normalized transcriptions simultaneously, and achieves a 35% relative error reduction over a state-of-the-art OCR model on diplomatic transcription, and a 46% reduction on normalized transcription.
机译:历史文献经常表现出广泛的正字法变化,包括过时的拼写和过时的速记。 OCR工具通常寻求产生保留这些变体的所谓外交转录,但是许多最终任务都需要使用标准化拼字法进行转录。在本文中,我们提出了一种新颖的联合转录模型,该模型可以无监督地学习现代拼字法与文档中所使用的概率映射。因此,我们的系统同时产生双重外交和标准化转录,并且相对于最新的外交转录OCR模型,相对错误减少了35%,归一化转录减少了46%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号