Automatic classification of older electronic texts into the Universal Decimal Classification-UDC

Matjaz Kragelj; Mirjana Kljajic Borstnar

首页> 外文期刊>Journal of documentation >Automatic classification of older electronic texts into the Universal Decimal Classification-UDC

【24h】

Automatic classification of older electronic texts into the Universal Decimal Classification-UDC

机译：旧电子文本的自动分类到通用十进制分类-UDC

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Purpose - The purpose of this study is to develop a model for automated classification of old digitised texts to the Universal Decimal Classification (UDC), using machine-learning methods. Design/methodology/approach - The general research approach is inherent to design science research, in which the problem of UDC assignment of the old, digitised texts is addressed by developing a machine-learning classification model. A corpus of 70,000 scholarly texts, fully bibliographically processed by librarians, was used to train and test the model, which was used for classification of old texts on a corpus of 200,000 items. Human experts evaluated the performance of the model. Findings - Results suggest that machine-learning models can correctly assign the UDC at some level for almost any scholarly text. Furthermore, the model can be recommended for the UDC assignment of older texts. Ten librarians corroborated this on 150 randomly selected texts. Research limitations/implications - The main limitations of this study were unavailability of labelled older texts and the limited availability of librarians. Practical implications - The classification model can provide a recommendation to the librarians during their classification work; furthermore, it can be implemented as an add-on to full-text search in the library databases. Social implications - The proposed methodology supports librarians by recommending UDC classifiers, thus saving time in their daily work. By automatically classifying older texts, digital libraries can provide a better user experience by enabling structured searches. These contribute to making knowledge more widely available and useable. Originality/value - These findings contribute to the field of automated classification of bibliographical information with the usage of full texts, especially in cases in which the texts are old, unstructured and in which archaic language and vocabulary are used.

机译：目的 - 本研究的目的是使用机器学习方法开发用于通用十进制分类（UDC）的旧数字化文本的自动分类模型。设计/方法/方法 - 一般研究方法是设计科学研究的固有，其中通过开发机器学习分类模型来解决旧的数字化文本的UDC分配问题。通过图书馆员的70,000名学术文本，完全书目的语料库，用于培训和测试该模型，用于分类200,000项的语料库上的旧文本。人类专家评估了模型的性能。结果 - 结果表明，机器学习模型可以在几乎任何学术文本中正确地分配UDC。此外，可以推荐该模型用于旧文本的UDC分配。十个图书管理员在150个随机选择的文本上证实了这一点。研究限制/含义 - 本研究的主要局限性是标记为旧文本的不可用，图书管理员的可用性有限。实际意义 - 分类模型可以在分类工作中向图书馆员提供建议;此外，它可以实现为库数据库中的全文搜索。社会影响 - 建议的方法通过推荐UDC分类器来支持图书馆员，从而节省日常工作中的时间。通过自动对较旧的文本进行分类，数字库可以通过启用结构化搜索来提供更好的用户体验。这些有助于使知识更广泛可用和可用。原创性/价值 - 这些调查结果有助于使用完整文本的书目信息自动分类，特别是在文本陈旧，非结构化和使用古代语言和词汇的情况下。

著录项

来源
《Journal of documentation》 |2021年第3期|755-776|共22页
作者
Matjaz Kragelj; Mirjana Kljajic Borstnar;
展开▼
作者单位

Information Technology Office National and University Library Ljubljana Slovenia;

Department of Information Systems Faculty of Organisational Sciences University of Maribor Kranj Slovenia;

展开▼
收录信息美国《科学引文索引》(SCI);
原文格式 PDF
正文语种 eng
中图分类
关键词
Digital library; Artificial intelligence; Machine learning; Text classification; Older texts; Universal Decimal Classification;

机译：数字图书馆;人工智能;机器学习;文本分类;较旧的文字;通用十进制分类;

相似文献

外文文献
中文文献
专利

1. The Library of Congress, Dewey Decimal, and Universal Decimal Classification Systems are Incomplete and Unsystematic [J] . Cari Merkley Evidence Based Library and Information Practice . 2011,第4期

机译：国会图书馆，杜威十进制和通用十进制分类系统不完整且不系统
2. The Universal Decimal Classification: The History, Present Status, and Future Prospects of a Large General Classification Scheme (Book Review) [J] . Paul S. Dunkin College & Research Libraries . 1974,第5期

机译：通用十进制分类：大型通用分类方案的历史，现状和未来展望（书评）
3. Automatic Classification of Swedish Metadata Using Dewey Decimal Classification: A Comparison of Approaches [J] . Koraljka Golub, Johan Hagelb?ck, Anders Ard? Journal of Data and Information Science . 2020,第1期

机译：使用杜威十进制分类自动分类瑞典元数据：方法比较方法
4. A STRATEGIC APPROACH TO REVISING THE UNIVERSAL DECIMAL CLASSIFICATION [C] . Eugeniusz SCIBOR, Ina S. SHCHERBINA-SAMOJLOVA Information*knowledge*evolution . 1988

机译：修订通用十进制分类的策略方法
5. Automatic Text Ontological Representation and Classification via Fundamental to Specific Conceptual Elements (TOR-FUSE). [D] . Razavi, Amir Hossein. 2012

机译：通过基本概念到特定概念元素的自动文本本体表示和分类（TOR-FUSE）。
6. Natural Language Processing and Automatic SNOMED-Encoding of Free Text: An Analysis of Free Text Data from a Routine Electronic Patient Record Application with a Parsing Tool Using the German SNOMED II [O] . Joerg H. Hohnloser, Matthias Holzer, Martin R.G. Fischer, 1996

机译：自然语言处理和自由文本的自动SNOMED编码：使用德语SNOMED II的解析工具对例行电子病历应用中的自由文本数据进行分析
7. Automatic classification of older electronic texts into the Universal Decimal Classification–UDC [O] . Matjaž Kragelj, Mirjana Kljajić Borštnar 2020

机译：旧电子文本的自动分类到通用十进制分类-UDC

Automatic classification of older electronic texts into the Universal Decimal Classification-UDC

摘要

著录项

相似文献

相关主题

期刊订阅