【24h】

Relating Articles Textually and Visually

机译:文章撰写了文章和视觉

获取原文

摘要

Historical documents have been undergoing large-scale digitization over the past years, placing massive image collections online. Optical character recognition (OCR) often performs poorly on such material, which makes searching within these resources problematic and textual analysis of such documents difficult. We present two approaches to overcome this obstacle, one textual and one visual. We show that, for tasks like finding newspaper articles related by topic, poor-quality OCR text suffices. An ordinary vector-space model is used to represent articles. Additional improvements obtain by adding words with similar distributional representations. As an alternative to OCR-based methods, one can perform image-based search, using word spotting. Synthetic images are generated for every word in a lexicon, and word-spotting is used to compile vectors of their occurrences. Retrieval is by means of a usual nearest-neighbor search. The results of this visual approach are comparable to those obtained using noisy OCR. We report on experiments applying both methods, separately and together, on historical Hebrew newspapers, with their added problem of rich morphology.
机译:在过去几年中,历史文件一直在进行大规模的数字化,在线放置巨大的图像集合。光学字符识别(OCR)经常在这种材料上执行不良,这使得在这些资源中进行搜索问题和文本分析这些文件的困难。我们提出了两种方法来克服这个障碍,一个文本和一个视觉。我们展示了,对于像主题相关的报纸文章等任务,质量差的OCR文本就足够了。普通的矢量空间模型用于表示文章。通过添加具有类似分布表示的单词来获得其他改进。作为基于OCR的方法的替代方案,可以使用Word Spotting执行基于图像的搜索。为词典中的每个单词生成合成图像,并且使用字斑用于编译其出现的向量。检索是借助于通常的最近邻居搜索。这种视觉方法的结果与使用Noisy OCR获得的结果相当。我们在历史希伯来报纸上报告应用两种方法,分别和一起使用的实验,并提出了富含形态的问题。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号