Page Classification for Meta-data Extraction from Digital Collections

机译：从数字集合中的元数据提取页面分类

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Automatic extraction of meta-data from collections of scanned documents (books and journals) is a useful task in order to increase the accessibility of these digital collections. In order to improve the extraction of meta-data, the classification of the page layout into a set of pre-defined classes can be helpful. In this paper we describe a method for classifying document images on the basis of their physical layout, that is described by means of a hierarchical representation: the Modified X-Y tree. The Modified X-Y tree describes a document by means of a recursive segmentation by alternating horizontal and vertical cuts along either spaces or lines. Each internal node of the tree represents a separator (a space or a line), whereas leaves represent regions in the page or separating lines. The Modified X-Y tree is built starting from a symbolic description of the document, instead of dealing directly with the image. The tree is afterwards encoded into a fixed-size representation that takes into account occurrences of tree-patterns in the tree representing the page. Lastly, this feature vector is fed to an artificial neural network that is trained to classify document images. The system is applied to the classification of documents belonging to Digital Libraries, examples of classes taken into account for a journal are "title page", "index", "regular page". Some tests of the system are made on a data-set of more than 600 pages belonging to a journal of the 19th Century.

机译：从扫描文档（书籍和期刊）的集合自动提取元数据是一个有用的任务，以便增加这些数字集合的可访问性。为了改善元数据的提取，页面布局的分类为一组预定义的类可以有所帮助。在本文中，我们描述了一种用于基于其物理布局对文档图像进行分类的方法，其通过分层表示描述：修改的X-Y树。修改的X-Y树通过沿空格或线路交替和垂直切割来描述借助于递归分割的文档。树的每个内部节点表示分隔符（空格或一行），而叶子表示页面中的区域或分离线。修改后的X-Y树是从文档的符号描述开始构建的，而不是直接与图像处理。之后树被编码为固定大小表示，该表示考虑到表示页面的树中的树形图案的发生。最后，该特征向量被馈送到培训以对文档图像进行分类的人工神经网络。该系统应用于属于数字库的文档的分类，考虑到期刊的类的示例是“标题页”，“索引”，“常规页面”。系统的一些测试是在一个属于19世纪的600页的数据集的数据集上进行。

著录项

来源
《International Conference on Database and Expert Systems Applications》|2006年||共10页
会议地点
作者

展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP311.13;
关键词

相似文献

外文文献
中文文献
专利

1. Visual Classification with Information Visualization (Infoviz) for Digital Library Collections [J] . Judith Gelernter Knowledge Organization . 2007,第3期

机译：信息分类的可视化分类与信息可视化（Infoviz），用于数字图书馆馆藏
2. Hagley Museum and Library, Digital Collections, http://digital.hagley.org/. Created by the Hagley Museum and Library. Managed by Kevin Martin, curator of digital collections. Reviewed May 14–16, 2009 [J] . li class=last id=contrib-1 class=nameclass=name-search href=/search?author1=Eric+John+Abrahamsonsortspec=datesubmit=SubmitEric John Abrahamson/a The Journal of American History . 2010,第4期

机译：哈格利博物馆和图书馆，数字馆藏，http：//digital.hagley.org/。由哈格利博物馆和图书馆创建。由数字馆藏策展人Kevin Martin管理。 2009年5月14日至16日点评
3. Organizing a personal image collection with statistical model-based ICL clustering on spatio-temporal camera phone meta-data [J] . A. Pigeau, M. Gelgon Journal of visual communication & image representation . 2004,第3期

机译：使用基于统计模型的ICL聚类在时空照相手机元数据上组织个人图像收集
4. Page Classification for Meta-data Extraction from Digital Collections [C] . International Conference on Database and Expert Systems Applications . 2006

机译：从数字集合中的元数据提取页面分类
5. Leveraging georeferenced meta-data for the management of large video collections. [D] . Arslan Ay, Sakire. 2010

机译：利用地理参考元数据来管理大型视频集合。
6. Collection and extraction of water level information from a digital river camera image dataset [O] . Sanita Vetra-Carvalho, Sarah L. Dance, David C. Mason, 2020

机译：数字河流相机图像数据集的水位信息收集和提取
7. Page classification for meta-data extraction from digital collections [O] . Francesca Cesarini, Marco Lastri, Simone Marinai, 2001

机译：页面分类，用于从数字馆藏中提取元数据
8. The Planetary Data System - A Case Study in the Development and Management of Meta-Data for a Scientific Digital Library [R] . Hughes, J. 1998

机译：行星数据系统 - 科学数字图书馆元数据开发和管理的案例研究

Page Classification for Meta-data Extraction from Digital Collections

摘要

著录项

相似文献

相关主题

期刊订阅