首页> 外文期刊>Journal of Data and Information Science >Automatic Classification of Swedish Metadata Using Dewey Decimal Classification: A Comparison of Approaches
【24h】

Automatic Classification of Swedish Metadata Using Dewey Decimal Classification: A Comparison of Approaches

机译:使用杜威十进制分类自动分类瑞典元数据:方法比较方法

获取原文
           

摘要

Purpose With more and more digital collections of various information resources becoming available, also increasing is the challenge of assigning subject index terms and classes from quality knowledge organization systems. While the ultimate purpose is to understand the value of automatically produced Dewey Decimal Classification (DDC) classes for Swedish digital collections, the paper aims to evaluate the performance of six machine learning algorithms as well as a string-matching algorithm based on characteristics of DDC. Design/methodology/approach State-of-the-art machine learning algorithms require at least 1,000 training examples per class. The complete data set at the time of research involved 143,838 records which had to be reduced to top three hierarchical levels of DDC in order to provide sufficient training data (totaling 802 classes in the training and testing sample, out of 14,413 classes at all levels). Findings Evaluation shows that Support Vector Machine with linear kernel outperforms other machine learning algorithms as well as the string-matching algorithm on average; the string-matching algorithm outperforms machine learning for specific classes when characteristics of DDC are most suitable for the task. Word embeddings combined with different types of neural networks (simple linear network, standard neural network, 1D convolutional neural network, and recurrent neural network) produced worse results than Support Vector Machine, but reach close results, with the benefit of a smaller representation size. Impact of features in machine learning shows that using keywords or combining titles and keywords gives better results than using only titles as input. Stemming only marginally improves the results. Removed stop-words reduced accuracy in most cases, while removing less frequent words increased it marginally. The greatest impact is produced by the number of training examples: 81.90% accuracy on the training set is achieved when at least 1,000 records per class are available in the training set, and 66.13% when too few records (often less than 100 per class) on which to train are available—and these hold only for top 3 hierarchical levels (803 instead of 14,413 classes). Research limitations Having to reduce the number of hierarchical levels to top three levels of DDC because of the lack of training data for all classes, skews the results so that they work in experimental conditions but barely for end users in operational retrieval systems. Practical implications In conclusion, for operative information retrieval systems applying purely automatic DDC does not work, either using machine learning (because of the lack of training data for the large number of DDC classes) or using string-matching algorithm (because DDC characteristics perform well for automatic classification only in a small number of classes). Over time, more training examples may become available, and DDC may be enriched with synonyms in order to enhance accuracy of automatic classification which may also benefit information retrieval performance based on DDC. In order for quality information services to reach the objective of highest possible precision and recall, automatic classification should never be implemented on its own; instead, machine-aided indexing that combines the efficiency of automatic suggestions with quality of human decisions at the final stage should be the way for the future. Originality/value The study explored machine learning on a large classification system of over 14,000 classes which is used in operational information retrieval systems. Due to lack of sufficient training data across the entire set of classes, an approach complementing machine learning, that of string matching, was applied. This combination should be explored further since it provides the potential for real-life applications with large target classification systems.
机译:目的在于具有越来越多的数字收集各种信息资源的可用,也增加了从优质知识组织系统中分配主题索引术语和类的挑战。虽然最终目的是了解瑞典数字集合的自动产生的杜威十进制分类(DDC)课程的价值,旨在评估六种机器学习算法的性能以及基于DDC特性的串匹配算法。设计/方法/方法最先进的机器学习算法需要每个类至少1,000个训练示例。研究时的完整数据涉及143,838条记录,这些记录必须减少到DDC的前三个层次等级,以提供足够的培训数据(培训和测试样本中总计802级,在所有级别的14,413级中提供) 。调查结果评估表明,支持带线性内核的向量机始于其他机器学习算法以及平均串匹配算法;当DDC的特性最适合任务时,字符串匹配算法优于特定类的机器学习。 Word Embeddings结合不同类型的神经网络(简单的线性网络,标准神经网络,1D卷积神经网络和经常性神经网络)产生比支持向量机更差的结果,但达到接近的结果,具有较小的表示尺寸的好处。机器学习中的功能的影响表明,使用关键字或组合标题和关键字的结果提供比仅作为输入的标题更好的结果。源于略微改善结果。在大多数情况下,去除止血术减少准确性,同时删除较少的单词略微增加。最大的影响是通过培训的数量产生:训练集的准确性81.90%,当训练集中至少有1,000条记录时,训练集的准确性,而66.13%的记录太少(通常小于每班)在其上培训 - 可用 - 这些持有仅适用于前3个层级(803而不是14,413级)。由于所有类别缺乏培训数据,因此必须将分层级别的分层级别的数量减少到DDC的前三个级别,因此它们在实验条件下工作,但仅适用于操作检索系统中的最终用户。实际意义总结,对于操作信息检索系统,应用纯自动DDC不起作用,无论是使用机器学习(因为缺少大量DDC类的训练数据)或使用字符串匹配算法(因为DDC特性表现良好仅在少数类中自动分类)。随着时间的推移,可以使用更多的训练示例,并且DDC可以富有用同义词来增强自动分类的准确性,这也可以基于DDC利用信息检索性能。为了使优质信息服务达到最高可能的精度和召回,自动分类永远不会自行实施;相反,机器辅助索引将自动建议效率与最终阶段的人类决策质量结合起来应该是未来的方式。原创性/价值研究探讨了在运营信息检索系统中使用的超过14,000多个类的大型分类系统的机器学习。由于整个课程缺乏足够的训练数据,应用了一种方法互补机器学习,字符串匹配。应该进一步探索这种组合,因为它提供了具有大目标分类系统的现实应用程序的潜力。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号