High dimensional biological data retrieval optimization with NoSQL technology

Shicai Wang; Ioannis Pandis; Chao Wu; Sijin He; David Johnson; Ibrahim Emam; Florian Guitton; Yike Guo

首页> 外文期刊>BMC Genomics >High dimensional biological data retrieval optimization with NoSQL technology

【24h】

High dimensional biological data retrieval optimization with NoSQL technology

机译：使用NoSQL技术进行高维生物数据检索优化

获取原文

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

High-throughput transcriptomic data generated by microarray experiments is the most abundant and frequently stored kind of data currently used in translational medicine studies. Although microarray data is supported in data warehouses such as tranSMART, when querying relational databases for hundreds of different patient gene expression records queries are slow due to poor performance. Non-relational data models, such as the key-value model implemented in NoSQL databases, hold promise to be more performant solutions. Our motivation is to improve the performance of the tranSMART data warehouse with a view to supporting Next Generation Sequencing data. In this paper we introduce a new data model better suited for high-dimensional data storage and querying, optimized for database scalability and performance. We have designed a key-value pair data model to support faster queries over large-scale microarray data and implemented the model using HBase, an implementation of Google's BigTable storage system. An experimental performance comparison was carried out against the traditional relational data model implemented in both MySQL Cluster and MongoDB, using a large publicly available transcriptomic data set taken from NCBI GEO concerning Multiple Myeloma. Our new key-value data model implemented on HBase exhibits an average 5.24-fold increase in high-dimensional biological data query performance compared to the relational model implemented on MySQL Cluster, and an average 6.47-fold increase on query performance on MongoDB. The performance evaluation found that the new key-value data model, in particular its implementation in HBase, outperforms the relational model currently implemented in tranSMART. We propose that NoSQL technology holds great promise for large-scale data management, in particular for high-dimensional biological data such as that demonstrated in the performance evaluation described in this paper. We aim to use this new data model as a basis for migrating tranSMART's implementation to a more scalable solution for Big Data.

机译：由微阵列实验产生的高通量转录组数据是目前在转化医学研究中使用的最丰富和最频繁存储的数据。尽管在诸如tranSMART的数据仓库中支持微阵列数据，但在查询关系数据库中查找数百种不同的患者基因表达记录时，由于性能低下，查询速度很慢。非关系数据模型（例如在NoSQL数据库中实现的键值模型）有望成为性能更高的解决方案。我们的动机是为了改善tranSMART数据仓库的性能，以支持下一代测序数据。在本文中，我们介绍了一种新数据模型，该模型更适合于高维数据存储和查询，并针对数据库可伸缩性和性能进行了优化。我们设计了一个键值对数据模型来支持对大规模微阵列数据的更快查询，并使用HBase（Google的BigTable存储系统的一种实现）来实现该模型。使用从NCBI GEO获取的有关多发性骨髓瘤的大量公开的转录组数据集，对在MySQL Cluster和MongoDB中实现的传统关系数据模型进行了实验性能比较。与在MySQL Cluster上实现的关系模型相比，在HBase上实现的新键值数据模型在高维度生物数据查询性能上平均提高了5.24倍，在MongoDB上的查询性能平均提高了6.47倍。性能评估发现，新的键值数据模型（尤其是其在HBase中的实现）优于当前在tranSMART中实现的关系模型。我们建议NoSQL技术在大规模数据管理方面具有广阔的前景，特别是对于高维生物学数据（如本文所述的性能评估中所展示的）。我们旨在以这种新的数据模型为基础，将tranSMART的实现迁移到针对大数据的更具可扩展性的解决方案。

著录项

来源
《BMC Genomics》 |2014年第8期|共页
作者
Shicai Wang; Ioannis Pandis; Chao Wu; Sijin He; David Johnson; Ibrahim Emam; Florian Guitton; Yike Guo;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类医学遗传学;
关键词

相似文献

外文文献
中文文献
专利

1. Improving the energy efficiency of relational and NoSQL databases via query optimizations [J] . Mahajan Divya, Blakeney Cody, Zong Ziliang Sustainable Computing . 2019,第JUNa期

机译：通过查询优化提高关系数据库和NoSQL数据库的能源效率
2. Improving the energy efficiency of relational and NoSQL databases via query optimizations [J] . Mahajan Divya, Blakeney Cody, Zong Ziliang Sustainable Computing . 2019,第Juna期

机译：通过查询优化提高关系和NoSQL数据库的能源效率
3. Efficient querying of multidimensional RDF data with aggregates: Comparing NoSQL, RDF and relational data stores [J] . Ravat Franck, Song Jiefu, Teste Olivier, International Journal of Information Management . 2020,第Octa期

机译：高效查询聚集体的多维RDF数据：比较NoSQL，RDF和关系数据存储
4. Benchmark for OLAP on NoSQL technologies comparing NoSQL multidimensional data warehousing solutions [C] . Chevalier Max, El Malki Mohammed, Kopliku Arlind, International Conference on Research Challenges in Information Science . 2015

机译：OLAP在NoSQL技术上的基准，比较NoSQL多维数据仓库解决方案
5. A Differential Privacy-Based and Dimensionality Reduction Framework to Optimize the Generalizability of Machine Learning Techniques for High-Dimensional Biological Data [D] . Le, Trang T. 2017

机译：基于差异隐私和维度减少框架，以优化高维生物数据机器学习技术的易用性
6. High dimensional biological data retrieval optimization with NoSQL technology [O] . Shicai Wang, Ioannis Pandis, Chao Wu, 2014

机译：使用NoSQL技术进行高维生物数据检索优化
7. High dimensional biological data retrieval optimization with NoSQL technology [O] . Shicai Wang, Ioannis Pandis, Chao Wu, 2014

机译：使用NoSQL技术进行高维生物数据检索优化

High dimensional biological data retrieval optimization with NoSQL technology

摘要

著录项

相似文献

相关主题

期刊订阅