首页> 外文期刊>JMIR Medical Informatics >Effective Training Data Extraction Method to Improve Influenza Outbreak Prediction from Online News Articles: Deep Learning Model Study
【24h】

Effective Training Data Extraction Method to Improve Influenza Outbreak Prediction from Online News Articles: Deep Learning Model Study

机译:有效的培训数据提取方法,提高在线新闻文章的流感爆发预测:深度学习模型研究

获取原文
           

摘要

Background Each year, influenza affects 3 to 5 million people and causes 290,000 to 650,000 fatalities worldwide. To reduce the fatalities caused by influenza, several countries have established influenza surveillance systems to collect early warning data. However, proper and timely warnings are hindered by a 1- to 2-week delay between the actual disease outbreaks and the publication of surveillance data. To address the issue, novel methods for influenza surveillance and prediction using real-time internet data (such as search queries, microblogging, and news) have been proposed. Some of the currently popular approaches extract online data and use machine learning to predict influenza occurrences in a classification mode. However, many of these methods extract training data subjectively, and it is difficult to capture the latent characteristics of the data correctly. There is a critical need to devise new approaches that focus on extracting training data by reflecting the latent characteristics of the data. Objective In this paper, we propose an effective method to extract training data in a manner that reflects the hidden features and improves the performance by filtering and selecting only the keywords related to influenza before the prediction. Methods Although word embedding provides a distributed representation of words by encoding the hidden relationships between various tokens, we enhanced the word embeddings by selecting keywords related to the influenza outbreak and sorting the extracted keywords using the Pearson correlation coefficient in order to solely keep the tokens with high correlation with the actual influenza outbreak. The keyword extraction process was followed by a predictive model based on long short-term memory that predicts the influenza outbreak. To assess the performance of the proposed predictive model, we used and compared a variety of word embedding techniques. Results Word embedding without our proposed sorting process showed 0.8705 prediction accuracy when 50.2 keywords were selected on average. Conversely, word embedding using our proposed sorting process showed 0.8868 prediction accuracy and an improvement in prediction accuracy of 12.6%, although smaller amounts of training data were selected, with only 20.6 keywords on average. Conclusions The sorting stage empowers the embedding process, which improves the feature extraction process because it acts as a knowledge base for the prediction component. The model outperformed other current approaches that use flat extraction before prediction.
机译:背景技术每年,流感影响3至500万人,并导致全球290,000至650,000人死亡。为减少流感造成的死亡人口,有几个国家建立了流感监测系统来收集预警数据。然而,适当和及时的警告在实际疾病爆发和监督数据的出版之间的1至2周延迟受阻。为了解决问题,已经提出了使用实时互联网数据的流感监测和预测的新方法(例如搜索查询,微博和新闻)。一些目前流行的方法提取在线数据和使用机器学习以预测分类模式的流感事件。然而,许多这些方法主观提取训练数据,并且很难正确捕获数据的潜在特征。通过反映数据的潜在特征,致力于设计专注于提取培训数据的新方法。目的在本文中,我们提出了一种以反映隐藏特征的方式提取训练数据的有效方法,并通过在预测之前仅过滤和选择与流感相关的关键字来提高性能。方法虽然Word嵌入通过编码各种令牌之间的隐藏关系提供单词的分布式表示,但我们通过选择与流感爆发相关的关键字来增强单词嵌入式并使用Pearson相关系数对提取的关键字进行排序,以便仅保持令牌高相关与实际流感爆发。关键字提取过程之后是基于长短短期记忆的预测模型,预测流感爆发。为了评估所提出的预测模型的性能,我们使用并比较了各种单词嵌入技术。结果无需我们所提出的分拣过程的Word嵌入,当平均选择50.2关键字时,显示0.8705的预测精度。相反,使用我们所提出的分拣过程嵌入的单词嵌入0.8868的预测精度,预测精度的提高为12.6%,但选择较少量的培训数据,平均只有20.6关键字。结论分拣阶段赋予嵌入过程,从而改善了特征提取过程,因为它充当预测分量的知识库。该模型优于在预测之前使用平坦提取的其他电流方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号