首页> 外文期刊>The environmentalist >Active learning in automated text classification: a case study exploring bias in predicted model performance metrics
【24h】

Active learning in automated text classification: a case study exploring bias in predicted model performance metrics

机译:主动学习在自动文本分类中的案例研究:探索预测的模型性能指标中的偏差

获取原文
获取原文并翻译 | 示例
           

摘要

Machine learning has emerged as a cost-effective innovation to support systematic literature reviews in human health risk assessments and other contexts. Supervised machine learning approaches rely on a training dataset, a relatively small set of documents with human-annotated labels indicating their topic, to build models that automatically classify a larger set of unclassified documents. "Active" machine learning has been proposed as an approach that limits the cost of creating a training dataset by interactively and sequentially focussing on training only the most informative documents. We simulate active learning using a dataset of approximately 7000 abstracts from the scientific literature related to the chemical arsenic. The dataset was previously annotated by subject matter experts with regard to relevance to two topics relating to toxicology and risk assessment. We examine the performance of alternative sampling approaches to sequentially expanding the training dataset, specifically looking at uncertainty-based sampling and probability-based sampling. We discover that while such active learning methods can potentially reduce training dataset size compared to random sampling, predictions of model performance in active learning are likely to suffer from statistical bias that negates the method's potential benefits. We discuss approaches and the extent to which the bias resulting from skewed sampling can be compensated. We propose a useful role for active learning in contexts in which the accuracy of model performance metrics is not critical and/or where it is beneficial to rapidly create a class-balanced training dataset.
机译:机器学习已成为一种具有成本效益的创新,可以支持在人类健康风险评估和其他情况下进行系统的文献综述。监督式机器学习方法依赖于训练数据集(一个相对较小的文档集,其中带有人工注释的标签指示其主题)来建立模型,该模型可以自动对较大的未分类文档集进行分类。已经提出“主动”机器学习作为一种方法,该方法通过交互地且顺序地集中于仅训练最有用的文档来限制创建训练数据集的成本。我们使用来自与化学砷有关的科学文献的大约7000个摘要的数据集来模拟主动学习。该数据集先前由主题专家对与毒理学和风险评估有关的两个主题的相关性进行了注释。我们研究了替代采样方法的性能,以依次扩展训练数据集,特别是基于不确定性的采样和基于概率的采样。我们发现,尽管与随机采样相比,这种主动学习方法可以潜在地减少训练数据集的大小,但是主动学习中模型性能的预测可能会遭受统计偏差,从而抵消了该方法的潜在优势。我们讨论了可以补偿偏斜采样所导致的偏差的方法和程度。我们建议在以下情况下的主动学习中扮演有用的角色:在这种情况下,模型性能指标的准确性并不重要,并且/或者对于快速创建班级平衡的训练数据集有利的情况下。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号