...
首页> 外文期刊>BMC Genomics >High dimensional biological data retrieval optimization with NoSQL technology
【24h】

High dimensional biological data retrieval optimization with NoSQL technology

机译:使用NoSQL技术进行高维生物数据检索优化

获取原文
           

摘要

High-throughput transcriptomic data generated by microarray experiments is the most abundant and frequently stored kind of data currently used in translational medicine studies. Although microarray data is supported in data warehouses such as tranSMART, when querying relational databases for hundreds of different patient gene expression records queries are slow due to poor performance. Non-relational data models, such as the key-value model implemented in NoSQL databases, hold promise to be more performant solutions. Our motivation is to improve the performance of the tranSMART data warehouse with a view to supporting Next Generation Sequencing data. In this paper we introduce a new data model better suited for high-dimensional data storage and querying, optimized for database scalability and performance. We have designed a key-value pair data model to support faster queries over large-scale microarray data and implemented the model using HBase, an implementation of Google's BigTable storage system. An experimental performance comparison was carried out against the traditional relational data model implemented in both MySQL Cluster and MongoDB, using a large publicly available transcriptomic data set taken from NCBI GEO concerning Multiple Myeloma. Our new key-value data model implemented on HBase exhibits an average 5.24-fold increase in high-dimensional biological data query performance compared to the relational model implemented on MySQL Cluster, and an average 6.47-fold increase on query performance on MongoDB. The performance evaluation found that the new key-value data model, in particular its implementation in HBase, outperforms the relational model currently implemented in tranSMART. We propose that NoSQL technology holds great promise for large-scale data management, in particular for high-dimensional biological data such as that demonstrated in the performance evaluation described in this paper. We aim to use this new data model as a basis for migrating tranSMART's implementation to a more scalable solution for Big Data.
机译:由微阵列实验产生的高通量转录组数据是目前在转化医学研究中使用的最丰富和最频繁存储的数据。尽管在诸如tranSMART的数据仓库中支持微阵列数据,但在查询关系数据库中查找数百种不同的患者基因表达记录时,由于性能低下,查询速度很慢。非关系数据模型(例如在NoSQL数据库中实现的键值模型)有望成为性能更高的解决方案。我们的动机是为了改善tranSMART数据仓库的性能,以支持下一代测序数据。在本文中,我们介绍了一种新数据模型,该模型更适合于高维数据存储和查询,并针对数据库可伸缩性和性能进行了优化。我们设计了一个键值对数据模型来支持对大规模微阵列数据的更快查询,并使用HBase(Google的BigTable存储系统的一种实现)来实现该模型。使用从NCBI GEO获取的有关多发性骨髓瘤的大量公开的转录组数据集,对在MySQL Cluster和MongoDB中实现的传统关系数据模型进行了实验性能比较。与在MySQL Cluster上实现的关系模型相比,在HBase上实现的新键值数据模型在高维度生物数据查询性能上平均提高了5.24倍,在MongoDB上的查询性能平均提高了6.47倍。性能评估发现,新的键值数据模型(尤其是其在HBase中的实现)优于当前在tranSMART中实现的关系模型。我们建议NoSQL技术在大规模数据管理方面具有广阔的前景,特别是对于高维生物学数据(如本文所述的性能评估中所展示的)。我们旨在以这种新的数据模型为基础,将tranSMART的实现迁移到针对大数据的更具可扩展性的解决方案。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号