首页> 外文期刊>Journal of documentation >Automatic classification of older electronic texts into the Universal Decimal Classification-UDC
【24h】

Automatic classification of older electronic texts into the Universal Decimal Classification-UDC

机译:旧电子文本的自动分类到通用十进制分类-UDC

获取原文
获取原文并翻译 | 示例
           

摘要

Purpose - The purpose of this study is to develop a model for automated classification of old digitised texts to the Universal Decimal Classification (UDC), using machine-learning methods. Design/methodology/approach - The general research approach is inherent to design science research, in which the problem of UDC assignment of the old, digitised texts is addressed by developing a machine-learning classification model. A corpus of 70,000 scholarly texts, fully bibliographically processed by librarians, was used to train and test the model, which was used for classification of old texts on a corpus of 200,000 items. Human experts evaluated the performance of the model. Findings - Results suggest that machine-learning models can correctly assign the UDC at some level for almost any scholarly text. Furthermore, the model can be recommended for the UDC assignment of older texts. Ten librarians corroborated this on 150 randomly selected texts. Research limitations/implications - The main limitations of this study were unavailability of labelled older texts and the limited availability of librarians. Practical implications - The classification model can provide a recommendation to the librarians during their classification work; furthermore, it can be implemented as an add-on to full-text search in the library databases. Social implications - The proposed methodology supports librarians by recommending UDC classifiers, thus saving time in their daily work. By automatically classifying older texts, digital libraries can provide a better user experience by enabling structured searches. These contribute to making knowledge more widely available and useable. Originality/value - These findings contribute to the field of automated classification of bibliographical information with the usage of full texts, especially in cases in which the texts are old, unstructured and in which archaic language and vocabulary are used.
机译:目的 - 本研究的目的是使用机器学习方法开发用于通用十进制分类(UDC)的旧数字化文本的自动分类模型。设计/方法/方法 - 一般研究方法是设计科学研究的固有,其中通过开发机器学习分类模型来解决旧的数字化文本的UDC分配问题。通过图书馆员的70,000名学术文本,完全书目的语料库,用于培训和测试该模型,用于分类200,000项的语料库上的旧文本。人类专家评估了模型的性能。结果 - 结果表明,机器学习模型可以在几乎任何学术文本中正确地分配UDC。此外,可以推荐该模型用于旧文本的UDC分配。十个图书管理员在150个随机选择的文本上证实了这一点。研究限制/含义 - 本研究的主要局限性是标记为旧文本的不可用,图书管理员的可用性有限。实际意义 - 分类模型可以在分类工作中向图书馆员提供建议;此外,它可以实现为库数据库中的全文搜索。社会影响 - 建议的方法通过推荐UDC分类器来支持图书馆员,从而节省日常工作中的时间。通过自动对较旧的文本进行分类,数字库可以通过启用结构化搜索来提供更好的用户体验。这些有助于使知识更广泛可用和可用。原创性/价值 - 这些调查结果有助于使用完整文本的书目信息自动分类,特别是在文本陈旧,非结构化和使用古代语言和词汇的情况下。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号