首页> 美国卫生研究院文献>BioMed Research International >Handling Big Data Scalability in Biological Domain Using Parallel and Distributed Processing: A Case of Three Biological Semantic Similarity Measures

【2h】

Handling Big Data Scalability in Biological Domain Using Parallel and Distributed Processing: A Case of Three Biological Semantic Similarity Measures

机译：使用并行和分布式处理在生物领域中处理大数据可伸缩性：三种生物语义相似性度量的案例

代理获取

本网站仅为用户提供外文OA文献查询和代理获取服务，本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文，但由于OA文献来源多样且变更频繁，仍可能出现获取不到、文献不完整或与标题不符等情况，如果获取不到我们将提供退款服务。请知悉。

页面导航

摘要
著录项
相似文献
相关主题

摘要

In the field of biology, researchers need to compare genes or gene products using semantic similarity measures (SSM). Continuous data growth and diversity in data characteristics comprise what is called big data; current biological SSMs cannot handle big data. Therefore, these measures need the ability to control the size of big data. We used parallel and distributed processing by splitting data into multiple partitions and applied SSM measures to each partition; this approach helped manage big data scalability and computational problems. Our solution involves three steps: split gene ontology (GO), data clustering, and semantic similarity calculation. To test this method, split GO and data clustering algorithms were defined and assessed for performance in the first two steps. Three of the best SSMs in biology [Resnik, Shortest Semantic Differentiation Distance (SSDD), and SORA] are enhanced by introducing threaded parallel processing, which is used in the third step. Our results demonstrate that introducing threads in SSMs reduced the time of calculating semantic similarity between gene pairs and improved performance of the three SSMs. Average time was reduced by 24.51% for Resnik, 22.93%, for SSDD, and 33.68% for SORA. Total time was reduced by 8.88% for Resnik, 23.14% for SSDD, and 39.27% for SORA. Using these threaded measures in the distributed system, combined with using split GO and data clustering algorithms to split input data based on their similarity, reduced the average time more than did the approach of equally dividing input data. Time reduction increased with increasing number of splits. Time reduction percentage was 24.1%, 39.2%, and 66.6% for Threaded SSDD; 33.0%, 78.2%, and 93.1% for Threaded SORA in the case of 2, 3, and 4 slaves, respectively; and 92.04% for Threaded Resnik in the case of four slaves.

机译：在生物学领域，研究人员需要使用语义相似性度量（SSM）比较基因或基因产物。持续的数据增长和数据特征的多样性构成了所谓的大数据。当前的生物SSM无法处理大数据。因此，这些措施需要具有控制大数据大小的能力。我们通过将数据划分为多个分区来使用并行和分布式处理，并对每个分区应用SSM措施。这种方法有助于管理大数据的可伸缩性和计算问题。我们的解决方案包括三个步骤：分裂基因本体（GO），数据聚类和语义相似度计算。为了测试此方法，在前两个步骤中定义了拆分GO和数据聚类算法并评估了性能。通过引入第三步中使用的线程并行处理，可以增强生物学中最好的三种SSM [Resnik，最短语义分化距离（SSDD）和SORA]。我们的结果表明，在SSM中引入线程减少了计算基因对之间语义相似度的时间，并提高了三个SSM的性能。 Resnik的平均时间减少了24.51％，SSSD的平均时间减少了22.93％，SORA的平均时间减少了33.68％。 Resnik的总时间减少了8.88％，SSSD的总时间减少了23.14％，SORA的总时间减少了39.27％。在分布式系统中使用这些线程度量，再结合使用拆分GO和数据聚类算法基于它们的相似性来拆分输入数据，与平均划分输入数据的方法相比，减少了平均时间。时间的减少随着拆分次数的增加而增加。线程SSDD的时间减少百分比分别为24.1％，39.2％和66.6％；对于有2个，3个和4个从属的线程SORA，分别为33.0％，78.2％和93.1％；如果是四个奴隶，则为Threaded Resnik的92.04％。

著录项

期刊名称 BioMed Research International
作者
Ameera M. Almasoud; Hend S. Al-Khalifa; Abdulmalik S. Al-Salman;
展开▼
作者单位

展开▼
年(卷),期 2006(2019),
年度 2006
页码 6750296
总页数 20
原文格式 PDF
正文语种
中图分类生物学;
关键词

相似文献

外文文献
中文文献
专利

1. Cluster analysis of cancer data using semantic similarity, sequence similarity and biological measures [J] . Sajid Nagi, Dhruba K. Bhattacharyya Network Modeling Analysis in Health Informatics and Bioinformatics . 2014,第1Suppla期

机译：使用语义相似性，序列相似性和生物学措施对癌症数据进行聚类分析
2. Semantic tracking and recommendation using fourfold similarity measure from large scale data using hadoop distributed framework in cloud [J] . Priyadarshini R., Latha Tamilselvan, Rajendran N. International journal of intelligent unmanned systems . 2019,第4期

机译：在云中使用hadoop分布式框架从大规模数据中使用四重相似性度量进行语义跟踪和推荐
3. A GO-driven semantic similarity measure for quantifying the biological relatedness of gene products [J] . Spiridon C. Denaxas, Christos Tjortjis Intelligent decision technologies . 2009,第4期

机译：GO驱动的语义相似性度量用于量化基因产物的生物学相关性
4. Representation is Everything: Towards Efficient and Adaptable Similarity Measures for Biological Data [C] . Charu C. Aggarwal International Conference on Data Mining . 2006

机译：代表性是一切：对生物数据的高效和适应性的相似性措施
5. Shared and distributed memory parallel algorithms to solve big data problems in biological, social network and spatial domain applications. [D] . Sharma, Rahil. 2016

机译：共享和分布式内存并行算法可解决生物，社交网络和空间领域应用中的大数据问题。
6. The use of semantic similarity measures for optimally integrating heterogeneous Gene Ontology data from large scale annotation pipelines [O] . Gaston K. Mazandu, Nicola J. Mulder 2014

机译：使用语义相似性度量来最佳地集成来自大规模注释管道的异构基因本体数据
7. Shared and distributed memory parallel algorithms to solve big data problems in biological, social network and spatial domain applications [O] . Rahil Sharma -1

机译：共享和分布式内存并行算法，解决生物，社交网络和空间域应用中的大数据问题

Handling Big Data Scalability in Biological Domain Using Parallel and Distributed Processing: A Case of Three Biological Semantic Similarity Measures

摘要

著录项

相似文献

相关主题

期刊订阅