...
首页> 外文期刊>BMC Medical Informatics and Decision Making >A user-friendly tool to transform large scale administrative data into wide table format using a mapreduce program with a pig latin based script
【24h】

A user-friendly tool to transform large scale administrative data into wide table format using a mapreduce program with a pig latin based script

机译:一个用户友好的工具,可使用带有基于猪拉丁语的脚本的mapreduce程序将大规模管理数据转换为宽表格式

获取原文
           

摘要

Background Secondary use of large scale administrative data is increasingly popular in health services and clinical research, where a user-friendly tool for data management is in great demand. MapReduce technology such as Hadoop is a promising tool for this purpose, though its use has been limited by the lack of user-friendly functions for transforming large scale data into wide table format, where each subject is represented by one row, for use in health services and clinical research. Since the original specification of Pig provides very few functions for column field management, we have developed a novel system called GroupFilterFormat to handle the definition of field and data content based on a Pig Latin script. We have also developed, as an open-source project, several user-defined functions to transform the table format using GroupFilterFormat and to deal with processing that considers date conditions. Results Having prepared dummy discharge summary data for 2.3 million inpatients and medical activity log data for 950 million events, we used the Elastic Compute Cloud environment provided by Amazon Inc. to execute processing speed and scaling benchmarks. In the speed benchmark test, the response time was significantly reduced and a linear relationship was observed between the quantity of data and processing time in both a small and a very large dataset. The scaling benchmark test showed clear scalability. In our system, doubling the number of nodes resulted in a 47% decrease in processing time. Conclusions Our newly developed system is widely accessible as an open resource. This system is very simple and easy to use for researchers who are accustomed to using declarative command syntax for commercial statistical software and Structured Query Language. Although our system needs further sophistication to allow more flexibility in scripts and to improve efficiency in data processing, it shows promise in facilitating the application of MapReduce technology to efficient data processing with large scale administrative data in health services and clinical research.
机译:背景技术在健康服务和临床研究中,对大规模管理数据的二次使用越来越普遍,因此对用户友好的数据管理工具的需求量很大。诸如Hadoop之类的MapReduce技术是用于此目的的有前途的工具,但由于缺乏将大量数据转换为宽表格式的用户友好功能而受到限制,该功能将每个主题用一行代表以用于健康服务和临床研究。由于Pig的原始规范几乎没有提供用于列字段管理的功能,因此我们开发了一种名为GroupFilterFormat的新颖系统,用于基于Pig Latin脚本处理字段和数据内容的定义。作为开源项目,我们还开发了一些用户定义的函数,这些函数可使用GroupFilterFormat转换表格式并处理考虑日期条件的处理。结果我们准备了230万住院病人的虚拟出院摘要数据和9.5亿事件的医疗活动日志数据,然后使用Amazon Inc.提供的Elastic Compute Cloud环境执行处理速度和扩展基准。在速度基准测试中,无论是大型数据集还是小型数据集,响应时间都显着减少,并且数据量与处理时间之间存在线性关系。扩展基准测试显示出明显的可伸缩性。在我们的系统中,节点数量加倍导致处理时间减少了47%。结论我们新开发的系统可以作为开放资源广泛访问。对于习惯于对商业统计软件和结构化查询语言使用声明性命令语法的研究人员而言,该系统非常简单易用。尽管我们的系统需要进一步完善,以在脚本中提供更大的灵活性并提高数据处理的效率,但它显示了将MapReduce技术应用于在卫生服务和临床研究中具有大规模管理数据的高效数据处理中的应用前景。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号