首页> 美国卫生研究院文献>BioMed Research International >Handling Big Data Scalability in Biological Domain Using Parallel and Distributed Processing: A Case of Three Biological Semantic Similarity Measures
【2h】

Handling Big Data Scalability in Biological Domain Using Parallel and Distributed Processing: A Case of Three Biological Semantic Similarity Measures

机译:使用并行和分布式处理在生物领域中处理大数据可伸缩性:三种生物语义相似性度量的案例

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

In the field of biology, researchers need to compare genes or gene products using semantic similarity measures (SSM). Continuous data growth and diversity in data characteristics comprise what is called big data; current biological SSMs cannot handle big data. Therefore, these measures need the ability to control the size of big data. We used parallel and distributed processing by splitting data into multiple partitions and applied SSM measures to each partition; this approach helped manage big data scalability and computational problems. Our solution involves three steps: split gene ontology (GO), data clustering, and semantic similarity calculation. To test this method, split GO and data clustering algorithms were defined and assessed for performance in the first two steps. Three of the best SSMs in biology [Resnik, Shortest Semantic Differentiation Distance (SSDD), and SORA] are enhanced by introducing threaded parallel processing, which is used in the third step. Our results demonstrate that introducing threads in SSMs reduced the time of calculating semantic similarity between gene pairs and improved performance of the three SSMs. Average time was reduced by 24.51% for Resnik, 22.93%, for SSDD, and 33.68% for SORA. Total time was reduced by 8.88% for Resnik, 23.14% for SSDD, and 39.27% for SORA. Using these threaded measures in the distributed system, combined with using split GO and data clustering algorithms to split input data based on their similarity, reduced the average time more than did the approach of equally dividing input data. Time reduction increased with increasing number of splits. Time reduction percentage was 24.1%, 39.2%, and 66.6% for Threaded SSDD; 33.0%, 78.2%, and 93.1% for Threaded SORA in the case of 2, 3, and 4 slaves, respectively; and 92.04% for Threaded Resnik in the case of four slaves.
机译:在生物学领域,研究人员需要使用语义相似性度量(SSM)比较基因或基因产物。持续的数据增长和数据特征的多样性构成了所谓的大数据。当前的生物SSM无法处理大数据。因此,这些措施需要具有控制大数据大小的能力。我们通过将数据划分为多个分区来使用并行和分布式处理,并对每个分区应用SSM措施。这种方法有助于管理大数据的可伸缩性和计算问题。我们的解决方案包括三个步骤:分裂基因本体(GO),数据聚类和语义相似度计算。为了测试此方法,在前两个步骤中定义了拆分GO和数据聚类算法并评估了性能。通过引入第三步中使用的线程并行处理,可以增强生物学中最好的三种SSM [Resnik,最短语义分化距离(SSDD)和SORA]。我们的结果表明,在SSM中引入线程减少了计算基因对之间语义相似度的时间,并提高了三个SSM的性能。 Resnik的平均时间减少了24.51%,SSSD的平均时间减少了22.93%,SORA的平均时间减少了33.68%。 Resnik的总时间减少了8.88%,SSSD的总时间减少了23.14%,SORA的总时间减少了39.27%。在分布式系统中使用这些线程度量,再结合使用拆分GO和数据聚类算法基于它们的相似性来拆分输入数据,与平均划分输入数据的方法相比,减少了平均时间。时间的减少随着拆分次数的增加而增加。线程SSDD的时间减少百分比分别为24.1%,39.2%和66.6%;对于有2个,3个和4个从属的线程SORA,分别为33.0%,78.2%和93.1%;如果是四个奴隶,则为Threaded Resnik的92.04%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号