首页> 外文会议>International Conference on Emerging Trends in Electrical, Electronic and Communications Engineering >A Qualitative Assessment of Machine Learning Support for Detecting Data Completeness and Accuracy Issues to Improve Data Analytics in Big Data for the Healthcare Industry
【24h】

A Qualitative Assessment of Machine Learning Support for Detecting Data Completeness and Accuracy Issues to Improve Data Analytics in Big Data for the Healthcare Industry

机译:对机器学习支持的定性评估,用于检测数据完整性和准确性问题,以改善医疗保健行业大数据中的数据分析

获取原文

摘要

Tackling Data Quality issues as part of Big Data can be challenging. For data cleansing activities, manual methods are not efficient due to the potentially very large amount of data.. This paper aims to qualitatively assess the possibilities for using machine learning in the process of detecting data incompleteness and inaccuracy, since these two data quality dimensions were found to be the most significant by a previous research study conducted by the authors. A review of existing literature concludes that there is no unique machine learning algorithm most suitable to deal with both incompleteness and inaccuracy of data. Various algorithms are selected from existing studies and applied against a representative big (healthcare) dataset. Following experiments, it was also discovered that the implementation of machine learning algorithms in this context encounters several challenges for Big Data quality activities. These challenges are related to the amount of data particualar machine learning algorithms can scale to and also to certain data type restrictions imposed by some machine learning algorithms. The study concludes that 1) data imputation works better with linear regression models, 2) clustering models are more efficient to detect outliers but fully automated systems may not be realistic in this context. Therefore, a certain level of human judgement is still needed.
机译:作为大数据的一部分解决数据质量问题可能是具有挑战性的。对于数据清洁活动,由于可能的大量数据,手动方法是不高效的。本文旨在定性地评估在检测数据不完整和不准确的过程中使用机器学习的可能性,因为这两个数据质量尺寸是发现是由作者进行的先前研究研究中最重要的。对现有文献的审查得出结论认为,没有独特的机器学习算法最适合处理数据的不完整性和不准确性。各种算法选自现有研究,并应用于代表性的大(医疗保健)数据集。在实验之后,还发现,在这种背景下的机器学习算法的实施遇到了大数据质量活动的几个挑战。这些挑战与数据分析的数据量与某些机器学习算法施加的某些数据类型限制进行规模。该研究得出结论,1)数据估算更好地利用线性回归模型,2)聚类模型更有效地检测异常值,但在这种情况下,完全自动化系统可能无法逼真。因此,仍然需要一定程度的人类判断。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号