首页> 外文期刊>Information technology and libraries >Efficiently Processing and Storing Library Linked Data using Apache Spark and Parquet
【24h】

Efficiently Processing and Storing Library Linked Data using Apache Spark and Parquet

机译:使用Apache Spark和Parquet高效处理和存储库链接数据

获取原文
           

摘要

Resource Description Framework (RDF) is a commonly used data model in the Semantic Web environment. Libraries and various other communities have been using the RDF data model to store valuable data after it is extracted from traditional storage systems. However, because of the large volume of the data, processing and storing it is becoming a nightmare for traditional data-management tools. This challenge demands a scalable and distributed system that can manage data in parallel. In this article, a distributed solution is proposed for efficiently processing and storing the large volume of library linked data stored in traditional storage systems. Apache Spark is used for parallel processing of large data sets and a column-oriented schema is proposed for storing RDF data. The storage system is built on top of Hadoop Distributed File Systems (HDFS) and uses the Apache Parquet format to store data in a compressed form. The experimental evaluation showed that storage requirements were reduced significantly as compared to Jena TDB, Sesame, RDF/XML, and N-Triples file formats. SPARQL queries are processed using Spark SQL to query the compressed data. The experimental evaluation showed a good query response time, which significantly reduces as the number of worker nodes increases.
机译:资源描述框架(RDF)是语义Web环境中常用的数据模型。从传统存储系统中提取数据之后,图书馆和其他社区一直在使用RDF数据模型来存储有价值的数据。但是,由于数据量巨大,处理和存储数据已成为传统数据管理工具的噩梦。这项挑战需要可并行管理数据的可伸缩分布式系统。在本文中,提出了一种分布式解决方案,用于有效处理和存储传统存储系统中存储的大量库链接数据。 Apache Spark用于大型数据集的并行处理,并提出了面向列的架构来存储RDF数据。该存储系统基于Hadoop分布式文件系统(HDFS)构建,并使用Apache Parquet格式以压缩形式存储数据。实验评估表明,与Jena TDB,Sesame,RDF / XML和N-Triples文件格式相比,存储需求显着降低。使用Spark SQL处理SPARQL查询以查询压缩数据。实验评估显示出良好的查询响应时间,随着工作节点数量的增加,查询响应时间显着减少。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号