S3QLRDF: distributed SPARQL query processing using Apache Spark—a comparative performance study

The proliferation of semantic data in the form of Resource Description Framework (RDF) triples demands an efficient, scalable, and distributed storage along with a highly available and fault-tolerant parallel processing strategy. There are three open issues with distributed RDF data management systems that are not well addressed altogether in existing work. First is the querying efficiency, second is that solutions are optimized for certain types of query patterns and don’t necessarily work well for all types, and third is concerned with reducing pre-processing cost. More precisely, the rapid growth of RDF data raises the need for an efficient partitioning strategy over distributed data management systems to improve SPARQL (SPARQL Protocol and RDF Query Language) query performance regardless of its pattern shape with minimized pre-processing time. In distributed RDF systems, both the data and the query processing are highly distributed. On the other hand, SPARQL workloads are dynamic and structurally diverse that can have different degrees of complexity. A complex SPARQL query over a large RDF graph in distributed systems requires combining a lot of distributed pieces of data through join operations. Therefore, designing an efficient data-partitioning schema and join strategy to minimize data transfer is the fundamental challenge in distributed RDF data management systems. In this context, we propose a new relational partitioning schema called Property Table Partitioning (PTP) for RDF data, that further partitions existing Property Table into multiple tables based on distinct properties (comprising of all subjects with non-null values for those distinct properties) in order to minimize input size and number of join operations of a query. This paper proposed a distributed RDF data management system called S3QLRDF, which is built on top of Spark and utilizes SQL to execute SPARQL queries over PTP schema. The experimental analysis with respect to preprocessing costs and query performance, using synthetic and real datasets shows that S3QLRDF outperforms state-of-the-art distributed RDF management systems.


Introduction
The Semantic Web refers to a Web of Data that associates the semantics of information and services on the web to provide machine-readable and processable data. The RDF 1 is a data model proposed by W3C to represent metadata about Web resources and facilitates the search engine to precisely locate and extract information on the Semantic Web. Recently, RDF has gained popularity for its flexible data model, which is used for publishing data on the Web through a number of applications and use cases in many areas such as social networks, commercial search engines, public knowledge bases, and databases. There are a growing number of organizations, institutions, and companies adopting Semantic Web technologies to represent data in a semantically structured way and thereby contributing to the Web of Data. Top search engine providers, Google, Bing, Yahoo!, and Yandex, have agreed to create a protocol (Schema.org 2 ) for a structured data vocabulary in order to define entities, actions, and relationships through the internet, which helps search engines figure out the meanings on web pages more effectively and serve relevant results based on search queries of the internet users. To improve the accuracy of recommendations, recommender companies are increasingly using semantics and semantic tagging. DBpedia [1], YAGO [2], Bio2RDF [3], Google's Knowledge Vault [4], Probase [5], PubChemRDF [6], and Universal Protein Resource (UniProtKB) [7] consist of billions of facts that are represented as RDF data contained in the Linked Open Data (LOD) [8] cloud. At present, the existing Semantic Web databases include not only general Semantic Web-based datasets (e.g., DBpedia, YAGO, etc.) but also domainspecific knowledge bases such as Music (MusicBrainz 3 and The Music Ontology 4 ), Biomedical datasets (SIDER, 5 Diseasome, 6 DrugBank 7 ), Medicine Ontology 8 and Geography (LinkedGeoDat 9 ). The use of semantic web technologies for domainspecific applications has gained momentum over recent years due to the infusion of machine learning models and deep neural networks. Therefore, we can expect the Semantic Web to grow steadily at web-scale and produce a large amount of RDF data. This steady growth of RDF data necessitates an efficient RDF management solution for storing and querying these very large RDF graphs. Over the last decade, many RDF data management systems have been designed to provide scalable, highly available, and fault-tolerant RDF stores with efficient SPARQL 10 query processing for distributed environments (e.g., Partout [9], DREAM [10]). In the last few years, many distributed RDF management systems are built on Big Data technologies 1 3 Distributed and Parallel Databases (2023) 41:191-231 like Hadoop (e.g., Rya [11], H2RDF+ [12], SHARD [13], CliqueSquare [14], Pig-SPARQL [15], Sempala [16], S2RDF [17], SPARQLGX [18], PRoST [19]). These RDF data processing systems rely on cluster computing engines based on MapReduce [20] as an execution layer or in-memory frameworks such as Spark 11 and Impala [21]. In most cases, these systems are optimized for specific query patterns. Some of these existing systems often give up their data preprocessing time for better querying performance. Therefore, it is necessary to implement a distributed RDF management system for efficient query performance on a wide range of query patterns with minimized preprocessing overhead, and that is the goal of this work.
To achieve the above goal, RDF data partitioning strategy using existing approaches, Property Table [22] and Vertical Partitioning (VP) [23], to form SPT + VP [24] storage layout has been proposed. The combined SPT + VP RDF management solution outperforms state-of-the-art systems for all types of query patterns except a few complex-shaped queries. To overcome the query performance issue on complex query patterns, we further partition the Property Table into multiple tables based on distinct properties to propose a new storage schema called Property Table Partitioning (PTP). For storing and querying RDF data, we use HDFS 12 (Hadoop Distributed File System) and an in-memory cluster computing framework Spark, one of the most important and popular Hadoop ecosystem components.
This paper is an extension of the publication [25], where an initial version of the S3QLRDF system had been presented, by adding the following novel contributions: -An experimental evaluation between Big Data file formats, Parquet and ORC, using the S3QLRDF system with the PTP schema that demonstrates the impact of using different file formats for storing RDF data. -An empirical comparison of open-source Spark-based state-of-the-art systems: S3QLRDF, S2RDF, SPARQLGX, and PRoST based on real datasets, YAGO and DBLP, confirms the effectiveness and applicability of our approach.
Giving this outlook, rest of this paper is presented as follows: Sect. 2 studies the related work. We introduce the background with preliminary definitions used in Sect. 3. A detailed overview of S3QLRDF (SPARQL on Spark SQL for RDF) system is given in Sect. 4. Experimental evaluation of S3QLRDF with other state of the art Hadoop-based SPARQL query engines is presented in Sect. 5. Section 6 illustrates the impact of using different Big Data columnar file formats on the S3QLRDF system for RDF data management. The comparative performance evaluation of the S3QLRDF system with other state of the art Spark-based SPARQL processors using real datasets is demonstrated in Sect. 7. Section 8 concludes the paper.

3 2 Related Work
Over the past decade, many RDF data management systems have been built based on distributed storage systems to provide efficient, scalable, highly available and fault tolerance services. These systems use various indexing and partitioning strategies on RDF elements to develop RDF storage layouts. In this section, we discuss the state-of-the-art distributed RDF management systems that are relevant to this work.
Rya [11] has been implemented on top of a key-value store Accumulo 13 stores RDF triple in the Row ID part of the Accumulo tables and indexes the triples across three separate tables (spo, pos, and osp) by maintaining the different ordering of the subject, predicate, object for each table. These three permutations (spo, pos, and osp) of triple components are sufficient to answer all possible triple patterns. When solving a SPARQL query, Rya executes the first subquery using a range scan on the relevant index, and for the subsequent subqueries, it uses index lookups.
CliqueSquare [14] uses built-in data replication mechanism of HDFS to partition the RDF dataset by hashing on all three columns of triples based on their subject, predicate and object values and creates three replicas by default. The first replica holds the partitions of triples based on their subject, predicate, and object values. Second replica stores all subject, predicate, and object partitions of the same value within the same node. For the third replica, CliqueSquare groups all the subject partitions within a node by the value of the predicate in their triples. It also groups all object partitions based on their predicate values. CliqueSquare uses a clique-based algorithm to select the partitions in such a way that can reduce as much as possible data exchange in the shuffle phases and minimize the number of MapReduce stages.
S2RDF [17] has been built on top of Spark that uses a relational partitioning technique called Extended Vertical Partitioning (ExtVP) which is an extension of VP [27] approach to store RDF data on the HDFS using Parquet columnar storage format. The goal of ExtVP approach is to minimize the input size for the query by using a semi-join based preprocessing approach to compute the possible join relations between partitions of VP tables. More specifically, it pre-computes semi-join reductions for subject-subject (SS), object-subject (OS) and subject-object (SO) correlations between triple patterns to avoid dangling triples that do not contribute to any join. For SPARQL query execution, the triples are joined via shared variables. S2RDF executes SPARQL queries by translating them into SQL queries, which are then evaluated using Spark SQL.
SPARQLGX [18] also built on top of Spark uses Vertically Partitioned approach proposed in [27] to store all RDF triples with the same property in a file into HDFS. To translate a SPARQL query, it parses the triple patterns one by one and maps them to Spark's RDD API. To deal with a group of triple patterns, it translates the first triple pattern and then searches for a common variable to make a join with the next one. In that way, the result of each sub-query is joined with the next one having a common variable with it. A cross-product is computed if no common variable is found between two triple patterns. The system uses its own statistics to optimize the computation with fewer intermediate results.
PRoST [19] is a Spark based distributed system for RDF storage and SPARQL querying that stores data twice using VP and Property Table to leverage the strengths  and minimize the shortcomings of both schemes. Star joins are evaluated on the  property table, whereas other patterns and joins are addressed with the vertical partitioning tables. PRoST translates SPARQL queries into the Join Tree format where  every node represents either the VP table or Property Table. The triple patterns with the same subject in a unique basic graph pattern are grouped to form a single node where the Property Table is used. All the other groups with a single triple pattern are translated to nodes that use the VP tables. Table 1 shows a summary of distributed RDF systems that are discussed above.

Background
In this section, we briefly introduce background information on the RDF data and SPARQL query model followed by the Big Data technologies used.

RDF
RDF is a schema-free data model recommended by W3C to describe information about any resource on the Web. An RDF dataset consists of a collection of triples (subject, predicate, object), abbreviated as (s, p, o). In an RDF triple (aka RDF statement), subject denotes the entity or a class of resources; predicate denotes the attribute or aspect and relationship (aka property) between entities or classes; and object denotes an entity, class, or literal value. The RDF dataset represents triples as a directed graph with annotations called RDF graph. Nodes of an RDF graph represent either subject or object, and edges represent properties. Each node can be an Internationalized Resource Identifier (IRI), a literal or blank node. Fig. 1 simulates an RDF graph with 16 edges of a simple publication network of an RDF dataset that consists of 16 triples, where ellipse nodes represent resources, directed edges represent properties, and rectangular nodes represent literal values. An RDF graph could also have blank nodes that represent resources without URI or literal assignment. Let I, B, and L be infinite sets of IRIs, blank nodes, and literals respectively which are pairwise disjoint. All RDF valid terms are the union of (I ∪ B ∪ L) and denoted by T.
RDF Triple A ternary tuple (s, p, o) ∈ (I ∪ B) × I × (I ∪ B ∪ L) is called an RDF triple where s, p, and o denote subject, predicate, and object respectively.
RDF Graph An RDF graph G = {t 1 ,…,t n } is a finite set of RDF triples t i where 1 ≤ i ≤ n.
RDF Dataset An RDF dataset is a collection of RDF graphs D = {G 0 , (i 1 , G 1 ),…,(i n , G n )} with i 1 ,…,i n ∈ I. The pairs (i i , G i ) are named graphs identified by IRI and the default graph is G 0 that does not have a name.

SPARQL
SPARQL is the standard query language recommended by W3C for RDF data. A basic SPARQL query consists of a SELECT clause followed by query variables represented by bound variables (variable with specified value) that appear in the result set and a WHERE clause followed by graph patterns that match against the RDF graph that the query is being executed on. A SPARQL query can be one of four types, including SELECT, ASK, DESCRIBE, and CONSTRUCT. On the other hand, a graph pattern that defines the query semantics can be one of the following types: Basic Graph Pattern (BGP), Basic Graph Pattern with Filter Constraints (FGP), Optional Graph Pattern (OGP), Union Graph Pattern (UGP) or Alternative Graph Pattern (AGP), and Group Graph Pattern (GGP). BGP, FGP, and OGP consist of one or multiple triple patterns, while a GGP or UGP (aka AGP) consists of one or multiple BGPs, FGPs or OGPs. Each part of a triple pattern: subject, predicate, and object can be either a bound or unbound variable. Basically, the result of a SPARQL query is obtained by replacing the variables of the query graph patterns with elements of the RDF graph. A SPARQL query has solution modifiers: ORDER BY (sort by defined order), DISTINCT (remove all duplicates), REDUCED (remove some duplicates), OFFSET (skip the first specified number of solutions) and LIMIT (upper bound on the number of solutions). SPARQL BGPs fall in one of the four following categories: Linear Shaped Pattern consists of a set of triple patterns that are linked together as subject-object joins via different unique join variables at the subject or object positions, i.e., the join variable is on subject position in one triple pattern and on object position in the other. Star Shaped Pattern consists of a set of triple patterns that are linked together via a single join variable at the subject or object position. Snowflake Shaped Pattern consists of several star shapes linked via different join variables at the subject or object positions in the triple pattern. Complex Structure is the compositions of the above-mentioned query patterns. Figure 2a represents a SPARQL query that returns the title of articles written by John Wayne. The corresponding graph pattern of the SPARQL query is shown in Fig. 2b. The result is the set of ordered bindings of (?article, ?t) that render the query graph isomorphic to subgraphs in the data. Assuming data are stored in a table D(s, p, o), the query can be answered by first decomposing it into three subqueries: q 1 ≡ σ p = name ∧ o = John Wayne (D), q 2 ≡ σ p = author (D), and q 3 ≡ σ p = title (D). The subqueries are answered independently by scanning table D; then, their intermediate results are joined on the subject and object attribute: q 1 ⋈ q1 . s = q2.o q 2 ⋈ q2 . s = q3.s q 3 . By applying the query on the data in Fig. 1, we get (?article, ?t ) ∈ {(Article_2, Title Two), (Article_3, Title Three)}.

Hadoop & Spark
Hadoop is an open-source framework for distributed storage and processing of large datasets based on the HDFS and MapReduce paradigm. HDFS is a popular distributed file system due to its replication capability to provide data redundancy where MapReduce can be I/O intensive and not suitable for interactive queries. To overcome this issue, a number of distributed computation engines based on in-memory processing strategy have been introduced (e.g., Spark).
Spark is an in-memory cluster-computing framework like MapReduce, which utilizes in-memory caching and advanced directed acyclic graph (DAG) execution engine to create efficient query plans for data transformations. Spark runs programs up to 100 times faster in-memory processing mode and 10 times faster in disk processing mode than Hadoop MapReduce. Spark has a SQL like module called Spark SQL 14 that is used for structured data processing and allows running SQL like queries on Spark data. Spark SQL includes a cost-based optimizer that enables control code generation to make queries faster.

Big Data file formats
In this section, we discuss the state-of-the-art Big Data file formats called Parquet and ORC, which are relevant to this work. Parquet 15 is a column-oriented data storage format of the Apache Hadoop ecosystem. It stores data in a column-oriented way, where the values of each column are organized consecutively on a disk that enables better compression.
Parquet stores data organized by horizontal partitions called row groups. For each row group, the data values are organized by column chunk. Each column chunk corresponds to a column in the data set. A column chunk consists of multiple pages where each page contains values for a particular column. Parquet stores metadata at all the levels in the hierarchy (i.e., file, column chunk, and page). A sample parquet file format is shown in Fig. 3. This data format supports additional optimizations include encodings (bit packing, run length, and dictionary encoding) as well as compression algorithms like Snappy, 16 GZip, 17 LZO, 18 and so on. Parquet supports both flat and nested data. Parquet has a filter pushdown option that prunes extraneous data to reduce the number of data scans and reads when a query contains a filter expression. Pruning data reduces the I/O, CPU, and network overhead to optimize query performance. Another advantage is that NULL values are not stored explicitly in Parquet, therefore, sparse columns cause little to no storage overhead.
ORC 19 (Optimized Row Columnar) is a columnar file format that provides a highly efficient way to store and access relational HDFS data. It stores collections of rows in one file, and within the collection, the row data is stored in a columnar format. This allows parallel processing of row collections across a cluster. Each file with the columnar layout is optimized for compression and skipping of data/columns reduces read and decompression load. Its file structure consists of three parts: Stripe, Footer, and Postscript. It breaks the source file into a set of rows called a Stripe. The default stripe size is 250 MB. This large stripe size enables an efficient read of columns from HDFS. The file footer contains a list of stripes in the file, the number of rows per stripe, and each column's data type. It also contains column-level aggregate count, min, max, and sum. Postscript contains compression parameter and size of the compressed footer. Each stripe in an ORC File has three parts: Index data, Row data, and Stripe footer. Index data include min and max values for each column and the row positions within each column. Row index entries provide offsets that enable seeking the right compression block and byte within a decompressed block. The Row data are composed of multiple streams per column, and they are used in table scans. The stripe footer contains a directory of stream locations. The columns in an ORC File separate the stripes or sections of the file. An internal index is used to track a section of the data within each column. This organization allows readers to efficiently omit the columns that are not required. Only required column values on each query are scanned and transferred on query execution. The ORC File supports sparse indexes that are data statistics and position pointers. The data statistics are used in query optimization, and they are also used to answer simple aggregation queries. The ORC reader uses these statistics to avoid unnecessary data read from HDFS. The position pointers are used to locate the index groups and stripes. The ORC File uses a two-level compression scheme. Each column can apply one of the four types of encoding schemes based on its data type: (1) a sequence of bytes, (2) a run-length encoded sequence of bytes, (3) a run-length and delta encoded sequence of integers, and (4) a bit vector. Users can further ask the writer of an ORC File to compress streams of data with a general-purpose codec among ZLIB, 20 Snappy, and LZO. Metadata about the ORC data, such as the schema and compression format, are serialized into the file and are made available to the readers. The operator translates the ORC File schema into appropriate data flow types when possible. Figure 4 illustrates the layout of the ORC File structure.

S3QLRDF architecture
In this section, we present the overall architecture of S3QLRDF 21 system. It consists of three main components: Data Loader-RDF data ingestion and partitioning using PTP schema, Query Translator-Spark SQL query generator from the SPARQL query, and Query Evaluator-Spark SQL query evaluated directly into the Spark SQL engine (Fig. 5). Data Loader S3QLRDF comes with a novel RDF data partitioning strategy called PTP schema. We model and store an RDF graph following the concept of Spark DataFrame-a distributed collection of an immutable set of records organized 20 https:// zlib. net/ into named columns. RDF data is first loaded into HDFS in N-Triples format, and then Spark read and partition the data using the PTP schema that is a modified and enhanced version of the well-known PT schema introduced by Wilkinson et al. [22]. We introduced the Modified Property Table in [24] which is a modified version of the traditional PT where multi-valued properties are stored in a single cell using a nested data structure (e.g., Array). We briefly present the Modified Property Table  schema followed by our proposed PTP schema; an extension of the Modified Property Table approach. We use RDF in N-Triples format for the data storage layout. Initially, we create a TT (Triple Table) with three columns where each row comprises an RDF statement, i.e., triples (subject, property, object). Then we create PT (Property Table) with the following schema:

PT(subject, property 1 , …, property n )
where n is the total number of distinct properties present in a particular RDF dataset. Here, each RDF subject is stored in the subject column and their object values reside in their corresponding property columns. Table 2 represents the Modified Property Table that is obtained from the Triple Table for the RDF Graph shown in Fig. 1.
Next, we partition the Modified Property Table into multiple tables based on distinct properties present in the RDF dataset to devise our proposed PTP schema. Each of the PTP tables contains only those subjects that have a value for the particular property on which that partition is based, and we use the name of that particular property as the partitioned table name. Table 3 shows the proposed RDF data layout that is obtained from partitioning the whole Modified Property Table (Table 2).
An RDF dataset can have many properties, and most subjects will only use a small subset of these properties, therefore, these tables will be sparse containing NULL values. We decide to use the general-purpose Parquet columnar storage format to materialize those PTP tables in HDFS because Parquet does not store NULL values explicitly, thus sparse columns cause little to no storage overhead. We also keep a statistics file to store the actual sizes (number of tuples) of each PTP table along with the name of multi-valued attributes, such that these statistics can be used for query generation.
The goal of PTP approach is to reduce the number of tuples to scan and the amount of I/O required for a query. Since each table of the PTP is the fragment of the Property Table, it is possible to minimize unnecessary I/O and comparisons during join execution to reduce in-memory consumption. Spark is an in-memory system, and memory is typically much more limited than HDFS disk space, thus saving this resource is important for scalability. Another advantage of the PTP approach is that star patterns can be answered entirely without the need for a join.
Query Translator The query translator generates the equivalent Spark SQL expressions from SPARQL query based on PTP schema using the statistics file that is generated during the PTP tables creation process. Every SPARQL query Fig. 4 The ORC File Format [26] 1 3 defines a graph pattern to be matched against an RDF graph. A triple pattern is the basic building block of a SPARQL query, and a Basic Graph Pattern (BGP) is simply the concatenation of a set of triple patterns using AND (.). Since a BGP represents the core of the SPARQL query, we will mainly focus on the BGP fragment. A triple group (tg) consists of a set of triple patterns having the same subject in a BGP. So, a BGP (bgp) can have more than one distinct triple group.
Consider the following BGP: bgp = { ?x type ?p . ?x name "John Wayne" . ?y type "Article" . ?y author ?x . ?y title ?t . ?z website_of ?x }  The number of bound values for the bgp is (tg 1 → 1, tg 2 → 1, tg 3 → 0). Here, the basic concept is that each triple group can be answered by a subquery without a join where variables occurring in a triple group define the columns to be selected and fixed values are used as conditions in the WHERE clause. Variables are mapped by subject and property based on their position in the triple pattern. A subject variable is mapped to subject column and the object(s) variable is mapped to its corresponding property (multi-valued property is labeled with a special extension) column. It is worth mentioning here that Spark uses the LATERAL VIEW EXPLODE function to flatten a complex column (multi-valued property). This variable mapping is used to name the output columns such that an outer query can easily refer to it. The table for a triple group is selected from the properties which belong to that triple group.
We also add a test for NOT NULL to the property (multi-valued property with a special extension) in the WHERE clause if the corresponding object is a variable in the triple pattern. This is not necessary for variables on the subject position as the subject column does not contain NULL values.  1 has a smaller number of tuples compared to tg 2 , so tg 1 will be given the highest rank during the query execution to execute first. Now, out of the remaining two triple groups, tg 2 and tg 3 , tg 3 has a lower number of tuples compared to tg 2, but the number of bound values of tg 2 is higher than tg 3 . Since we are giving higher priority to the number of bound values than number of tuples of the selected table, tg 2 will be given a higher rank than tg 3 . Finally, the triple groups are arranged in such a way that there must be at least one common variable between a triple group and any of its higher ranked triple group(s) to avoid cross joins when processing them in that order. So, the final ordering (ranking) among the three triple groups will be tg 1 → tg 2 → tg 3 .
Overall SPARQL translation process can be described as follows: The subquery sq 1 for tg 1 is SELECT subject, type FROM name WHERE type IS NOT NULL AND name = 'John Wayne' The author is a multi-valued property that is identified from the statistics file. Thus, the author column is flattened by the LATERAL VIEW EXPLODE function, and we rename that column with an extension _lev. Therefore, the input SPARQL query can be translated to an equivalent Spark SQL query by mapping its operators to the equivalent Spark SQL keywords. A FILTER expression in SPARQL can be mapped to the equivalent conditions in Spark SQL by adapting the SPARQL syntax to the syntax of SQL, and then these conditions can be added to the WHERE clause of the corresponding (sub)query in Spark SQL statement. The OPTIONAL pattern can be mapped to a LEFT OUTER JOIN, and UNION, LIMIT, ORDER BY, and DISTINCT can be mapped directly using their equivalent clauses in the SQL dialect of Spark. Finally, a SPARQL query is fed to the Spark engine as an equivalent Spark SQL query.
Query Executor In this process, the Spark SQL query created by the query translator is directly evaluated into the Spark SQL engine through the DataFrame interface.

Evaluation
In this section, we present a comparative performance evaluation of our RDF management system S3QLRDF along with other state-of-the-art Hadoop-based RDF querying approaches, namely CliqueSquare, S2RDF, SPARQLGX, and Rya as they are the most similar to our system. The experimental setup and a discussion of results are presented.

Benchmark queries
For the performance evaluation of our RDF management solutions, we utilize two synthetic and one real dataset, as shown in Table 4. The synthetic datasets are LUBM with the number of universities set to 1000, 5000, and 10,000, and Wat-Div with scale factor of 1000, 5000, and 10,000.
LUBM was proposed in 2005 with a data generator and was originally designed to test the inference capabilities of Semantic Web repositories. We use LUBM to measure the soundness (correctness by analyzing how many of the returned answers are correct) and completeness (how many of the correct answers are returned) of conjunctive SPARQL queries. It provides 14 predefined test queries, but many of these queries have simple structures and are quite similar to each other. Therefore, we selected Q1, Q2, Q4, Q8, Q12, and Q14 from the LUBM test query set based on their structure and selectivity. Q1 has a starshaped pattern with high selectivity, and it carries large input; Q2 has a complex pattern with large intermediate results; Q4 is a simple highly selective star query with a small size of result set; Q8 is the most complex snowflake query of the LUBM benchmark; Q12 is a simple selective query, which has a constant number of solutions similar to Q1, Q4, and Q8 regardless of the dataset size; and Q14 is the most unselective query, which has a large size of results set. Q2 and Q14 have increasing numbers of solutions proportional to the dataset size. The University of Waterloo introduced WatDiv in 2014. WatDiv has a data generator as well as a query generator, and it was designed to cover both structural and data-driven features of four different types of query shapes, namely, linear, star, snowflake, and complex SPARQL queries. We want to assess the impact of different query patterns, which WatDiv is designed for. The WatDiv basic query set contains queries of varying shape and selectivity to model different scenarios. The queries are grouped into the following subsets: The real-life dataset is the YAGO2, which is a semantic knowledge base, derived from Wikipedia, WordNet, and GeoNames. YAGO2 does not provide benchmark queries; we have created a set of representative test queries (Y1-Y5) with different structures and complexities relative to LUBM and WatDiv query sets. Regarding LUBM queries, we modified some of the original queries because executing those original queries without the inferred triples returns an empty result set. All YAGO2 and modified LUBM queries are listed in Appendices 1 and 2 respectively.

Cluster configuration
To conduct the comparative analysis of distributed RDF data management solutions, we constructed seven node clusters (1 master and 6 workers) on the Google Cloud Platform. Each node in the cluster has a 32 vCPUs Intel(R) Xeon(R) CPU @ 2.30GHz processor, 120 GB of memory, and 1 TB of hard disk space running Ubuntu 16.04.3 LTS OS. Hadoop 2.7.7 and Spark 2.4.4 are configured on all nodes where each spark worker is given 100 GB of memory and 30 cores. In addition, Parquet filter pushdown is enabled and broadcast joins in Spark SQL are disabled.

Empirical comparison
We present an empirical comparison of our prototype S3QLRDF system with four others state-of-the-art open source distributed RDF management systems: CliqueSquare, S2RDF, SPARQLGX, and Rya. All of the systems are Hadoop-based that take advantage of scalable and fault-tolerant distributed processing, based on Google's distributed file system and MapReduce parallel model. The store sizes and data loading times are listed in Table 5.
During data loading phase, we parse data to replace all URIs with their corresponding namespace prefix and remove data type information from RDF objects to convert them into primitive types. We do not consider the data import on the HDFS as part of the preprocessing phase. We conduct a performance evaluation of S3QLRDF with other competitor systems based on three metrics: preprocessing (loading) times, store sizes, and query execution times. All measurements are averaged over four runs. S3QLRDF has two data loading options: 1. Drop all columns whose entries are all empty (NULL), and 2. Keep all columns even if all entries are empty (NULL), which we call light-load. The light-load requires much less time compared to the first loading option to store RDF data in PTP schema. We notice that using the first data loading option cannot reduce noticeable storage space consumption and also query execution times compared to the light-load in our cluster configuration. Therefore, we discuss results with the light-load preprocessing option for S3QLRDF. S3QLRDF has a two-step data loading process. The first step is creating the Property Table, and the second step is to create PTP tables. We do not report about the Property Table in the results of query run time because it does not participate in query evaluation. Since Spark SQL has the cacheTable functionality to cache table in memory, we report query execution times for both caching and without caching PTP table along with the average mean runtimes (AM). S2RDF has two preprocessing modes: VP and ExtVP, so we keep both of them in our results. We indicate "TimeOut" whenever the query processing does not complete within a certain amount of time (8 hours) and "Fail" whenever the query is not supported by the system or the system crashes before the timeout delay. Fig. 6 indicates the storage space distribution of LUBM (avg. of 1000, 5000, and 10,000), WatDiv (avg. of SF 1000, 5000, and 10,000), and YAGO2 datasets. From Table 5, we can see that S2RDF-VP and SPARQLGX have low space overhead; on the other hand, CliqueSquare and S2RDF-ExtVP need more storage space due to their underlying data storage layouts.
From Fig. 7, we notice that CliqueSquare, S2RDF-ExtVP, and Rya need more time to load data compare to S2RDF-VP and SPARQLGX because of their preprocessing methods. The lack of in-memory data processing framework in CliqueSquare and Rya causes high overhead. S2RDF-ExtVP incurs significantly higher overhead compared to S2RDF-VP because of additional pre-computation phases. Although YAGO2 is the smallest dataset, S2RDF-ExtVP needs more preprocessing time with YAGO2 due to its large number of predicates. We observe that the data loading time of S2RDF-ExtVP depends not only on the size of the dataset but also on the number of predicates. S3QLRDF has a moderate overhead in terms of data loading time and storage space as compared to other systems.
The performance comparison for LUBM 10000 is illustrated in Fig. 8 on a log scale while absolute runtimes are given in Table 6. We can observe that S3QLRDF outperforms all other systems by up to an order of magnitude on average (arithmetic mean). Q1 and Q4 are the most selective queries, returning only a few results and can be answered by S3QLRDF within 5200 ms or less. These queries define a star-shaped pattern, which can be answered very efficiently with the PTP table of S3QLRDF. For the most unselective query, Q14, S3QLRDF outperforms all other systems. Q2, Q8, and Q12 define the complex patterns where Q8 and Q12 produce results of constant size as the size of the dataset increases. On the other hand, the intermediate result set of Q2 increases when the input dataset increases. Also, for these queries, runtimes of S3QLRDF are significantly faster than for all other systems, which is below 9000 milliseconds. If we use the cacheTable functionality of Spark SQL to cache PTP tables in memory, which we call S3QLRDF-CT, then we achieve an order of magnitude faster response time despite that the caching table incurs a little overhead due to caching time. We also report the number of query  Fig. 9 compares the different systems on the largest dataset (SF10000) of Wat-Div, corresponding AM runtimes are listed in Table 7. For WatDiv, S3QLRDF and S3QLRDF-CT show a competitive runtime performance for all query categories when increasing the size of the dataset. In Table 7, we report the number of queries to execute per hour (Query/hr) under all query categories for all competitors. Again,   S3QLRDF and S3QLRDF-CT outperform all of its competitors by an order of magnitude in terms of Query/hr. Figure 10 illustrates the execution times for YAGO2 queries of all compared systems while absolute runtimes, and Query/hr are given in Table 8. CliqueSquare fails to execute YAGO2 queries; therefore, we did not include CliqueSquare in the YAGO2 query evaluation. We can observe that S3QLRDF and S3QLRDF-CT outperform SPARQLGX and Rya by an order of magnitude on runtime in all queries. S2RDF has faster query response times for Y1 and Y2 compared to S3QLRDF because of the materialized join reduction tables of ExtVP and because S3QLRDF incurs a little overhead while flattening a complex column. Since a number of complex columns are required to be flattened in Y1 and Y2, S3QLRDF is slower in response time compared to S2RDF, but in terms of average runtime and Query/hr, S3QLRDF outperforms all of its competitors, including S2RDF.
In this section, we conduct a comparative performance evaluation of the SQL system on a Hadoop cluster with the state-of-the-art systems CliqueSquare, S2RDF, SPARQLGX, and Rya, using different query shapes, complexities with three different datasets up to 1.4 billion triples. Our proposed S3QLRDF system outperforms state-of-the-art distributed SPARQL query processors by an order of magnitude on average for all query shapes.

Benchmarking S3QLRDF under Columnar File Formats
Columnar file formats have well known advantages that can improve the storage efficiency by effective data compression, as well as helping to achieve significant performance gains by moving only relevant portions of data into memory during query processing. Columnar storage formats have been available for storing data in HDFS for over a decade. Currently, Parquet and ORC formats are two of the most popular ones for HDFS.

Relational data management using parquet and ORC
Relational data management including analysis is one of the most popular data processing paradigms. Modern cloud-based relational data processing systems typically do not manage their storage. They leverage a variety of external file formats to store and access data. Over the last decade, a variety of external file formats such as Parquet, ORC, etc., have been developed to store large volumes of relational data in the cloud. High-performance networking and storage devices are used pervasively to process this massive amount of data in Big Data frameworks like Spark and Hadoop. The performance of a file format in terms of storage efficiency and data access rate plays an important role in data management. Parquet and ORC are columnar data storage in the Hadoop ecosystem. They offer features that store data by employing different encoding, column-wise compression, compression based on data type, and predicate pushdown. Typically, enhanced compression ratios, or skipping blocks of data, involves reading fewer bytes from HDFS, resulting in enhanced query performance. We use Parquet and ORC file formats as the storage backend for our S3QLRDF system to run the experiments in order to measure the RDF data storage efficiency, loading, and query execution performance.

Empirical comparison
We present an empirical comparison between Parquet and ORC file formats while using S3QLRDF system with the PTP schema. We performed our evaluation on a small cluster of 6 machines (1 master and 5 workers) using AWS EC2 instances. Each machine is equipped with 64 GB of memory, 1 TB of disk space and with an 8 Core Intel Xeon Platinum 8175M CPU @ 2.50 GHz. The cluster runs with Hadoop 2.7.7, Hive 2.3.6, and Spark 2.4.4 on Ubuntu 16.04 LTS. The resource manager, Yarn, uses 240 GB of memory and 40 virtual cores. In our cluster configuration, a Spark partition size is equal to the default size of an HDFS block (128 MB). We kept the default settings for both Parquet and ORC file formats with filter pushdown enabled.
The experiments are conducted on a synthetic dataset, WatDiv, with around 109 million triples and 86 predicates, and a real-world dataset, a dump of YAGO (Yago2s 2.5.3), with a total size of 245 million triples and 104 predicates. The PT (Property Table) creation is the prerequisite to create the PTP tables, therefore, we report total time to create PT and PTP as data loading time. Both Parquet and We report datasets loading times and HDFS sizes for PTP schema based on Parquet and ORC file formats in Table 9. From Table 9, we can see that ORC outperforms Parquet in terms of storage space and data loading time. These two formats physically organize the data in different manners, which is why they differ from one another in terms of their total size. Figures 11 and 12 present resource usages (CPU and RAM) and the total amount of bytes read from and written on the HDFS during the data loading process. The percent of CPU and the amount of RAM usage are slightly less in ORC than Parquet. Similarly, S3QLRDF reads and saves less amount of data while working with ORC than Parquet.
WatDiv comes with a set of 20 predefined query templates called Basic Testing Use Case that can be grouped in four categories according to their shape: complex (C), snowflake (F), star (S), and linear (L). Each of the queries from the basic query set is evaluated four times to get the average run time. Finally, the query run times are aggregated by the query shapes. YAGO does not provide benchmark queries; we have created four representative test queries (C, F, S, and L) based on the categories of WatDiv basic query set where C, F, S, and L represent complex, snowflake, star, and linear-shaped query. We submitted each query at a time as a single Spark Application in the cold-start scenario when memory was free. The run times reported for each query are the average of 4 execution times. Since Spark SQL has the cacheTable functionality to cache  tables in memory before execution, we report average query execution times for both caching (CT) and without caching (W/O-CT) PTP tables. We also report the query run times (T-CT) including caching times to investigate how the caching table affects the overall query runtimes. The performance comparison between Parquet and ORC storage formats based on PTP schema in terms of the query execution times for WatDiv and YAGO are shown in Tables 10 and 11 respectively. The first observation was that ORC with CT, compared to that of other options, had the best query performance for all WatDiv query types. For YAGO, ORC with CT shows the best performance except for the C and S query types, although it is not significantly worse. We did not consider caching times of PTP table in memory for CT, but if we report caching times along with query runtimes (T-CT) then ORC has slightly worse performance for the majority of query types. We also observe that Parquet without cacheTable method (W/O-CT) shows reasonably better performance for all query types. For future experiments in Sect. 7, we will be using Parquet without cacheTable method to measure query runtimes.
From the above discussion, we can conclude that the caching table in memory adds some overhead to the total query runtimes; therefore, the cacheTable method is recommended only for batch execution of queries. We demonstrate query performance while using cacheTable method for batch execution of queries in Sect. 5.3.

Empirical evaluation of spark-based RDF management systems
Over the last few years, several systems have been designed to exploit the Spark framework for building scalable RDF processing engines like S3QLRDF, S2RDF, SPARQLGX, and PRoST. These systems load data as triples, and a simple partitioning technique, like vertical partitioning or property table partitioning, is applied to their raw form for further processing. In such systems, the RDD API, or Spark SQL, is used to answer the SPARQL query.

Benchmarked SPARQL evaluators
In this section, we present a brief overview on Spark-based RDF management systems, namely S3QLRDF, S2RDF, SPARQLGX, and PRoST. Table 12 shows the RDF data partitioning techniques used in the state-of-the-art Spark-based systems.
Spark-based systems listed in Table 12 use one or a combination of relational partitioning techniques. S3QLRDF uses PTP schema to devise the RDF data storage layout, S2RDF makes use of both VP and ExtVP approaches, SPARQLGX uses only the VP approach, and PRoST combines the VP with the Wide Property Table  (WPT) [19] for their storage layout. Table 13 represents the RDF query processing methods used in Spark-based systems based on Spark data abstraction.
For the performance evaluation of Spark-based RDF management solutions, we utilize two real datasets YAGO (Yago2s 2.5.3) and DBLP as shown in Table 14. The YAGO is a semantic knowledge base, derived from Wikipedia, WordNet, and GeoNames. Meanwhile, the DBLP Computer Science Bibliography provides bibliographic information on computer science journals and proceedings. Both YAGO  22 do not provide benchmark queries. Thus, we have created four representative test queries C, F, S, and L for each dataset based on varying shape; like complex, snowflake, star, and linear to model different scenarios respectively. These query patterns actually affect the overall query performance. All YAGO and DBLP queries are listed in Appendices 3 and 4 respectively. We keep the same cluster configuration mentioned in Sect. 6.2.

Experimental results
We present an empirical comparison of 4 open-source Spark-based state-of-theart systems: S3QLRDF, S2RDF, SPARQLGX, and PRoST based on real datasets, YAGO and DBLP. The store sizes and data loading times are listed in Table 15.
From Table 15, we can see that SPARQLGX has low space overhead; on the other hand, S2RDF needs more storage space due to their underlying data layouts. SPAR-QLGX also has low preprocessing overhead compared to other systems. S2RDF needs more preprocessing time with YAGO due to its large number of predicates.  We observe that the data loading time of S2RDF depends not only on the size of the dataset but also on the number of predicates which involve extensive precomputations with high loading time; therefore, this system is not suitable for some datasets having a large number of properties. S3QLRDF has a moderate overhead in terms of data loading time when compared to other systems. Figures 13 and 14 present resource usages (CPU and RAM) and the total amount of bytes read from and written on the HDFS during the data loading phase. SPAR-QLGX has highest CPU utilization while reading and saving less amount of data for both YAGO and DBLP datasets. On the other hand, S2RDF has the highest amount of RAM usage compared to other systems. From the above discussion, we can conclude that S2RDF is the costliest system for the cluster because of the highest data loading times and RAM usages.
We conduct a query performance evaluation of Spark-based RDF management systems based on query execution times and cluster resource utilization. We report the query run times including caching times for those systems that use cacheTable functionality to cache table in memory. Not all systems offer to execute a set of queries in the same Spark application to take advantage of   in-memory data left by a previously executed query. Thus, we submitted each query at a time as a single Spark application to make a fair comparison among all systems. All measurements are averaged over four runs. YAGO does not provide benchmark queries. Therefore, we use the YAGO test queries C, F, S, and L listed in Appendix 3 to benchmark the performance of different Spark-based systems. Figure 15 illustrates the performance comparison for YAGO. S3QLRDF shows the best performance, except for query C and S, although it is not significantly worse. S3QLRDF incurs a little overhead while flattening a complex column. Since a number of complex columns are required to be flattened in C and S, S3QLRDF is slower in response time compared to S2RDF, which has the fastest query response times for C and S compared to all other systems due to the materialized join reduction of ExtVP tables. S2RDF trades off the query performances with disk space and loading time. SPARQLGX has poor runtimes for all queries among all systems. From Fig. 16 we can see that the number of bytes required to read during query evaluation is less in S3QLRDF for all of the queries, except C. We also figure out from Fig. 17 that the system SPARQLGX, which is inexpensive in terms of data loading time, become costly in cluster resource utilization (CPU and RAM) for evaluating most of the queries, except query F. Like YAGO, DBLP does not have benchmark queries; therefore, we use the DBLP test queries C, F, S, and L listed in Appendix 4. Figure 18 illustrates the execution times for DBLP queries of all compared systems. We can observe that S3QLRDF outperforms its competitors on runtime in most of the queries, except F, where PRoST shows the best performance. Like YAGO, SPAR-QLGX again shows poor query performance among all systems. We can also observe from Fig. 19 that S3QLRDF reads relatively a less number of bytes to answer queries C and F; on the other hand, PRoST requires less number of bytes to read during query S and L evaluation. The average cluster CPU usage percent is high in S2RDF and SPARQLGX while the average RAM usage is almost similar for all systems (Fig. 20). In this section, we conduct an empirical evaluation of 4 state-of-the-art Spark-based RDF management solutions based on common criteria: preprocessing (loading) times, store sizes, query execution times, and cluster resource utilization. All of these systems use different data partitioning techniques to devise their relational storage schemas for RDF triplestore on top of Hadoop. The aim of using Spark with Hadoop is to provide efficient RDF management systems to improve query performance by exploiting data parallelization. Moreover, data partitioning also plays a vital role in efficient query processing which has a huge impact on query performance. In this paper, we focus on two key elements in the distributed system for efficient SPARQL query processing; data parallelization and data partitioning. We propose a novel RDF data partitioning schema called Property Table Partitioning, and we use the in-memory framework Spark to exploit data parallelization while utilizing the inherent scalability of Hadoop for the distributed RDF management system. We also demonstrate how columnar storage formats, like Parquet and ORC, can affect the overall performance of the distributed RDF storage and SPARQL querying system. We presented S3QLRDF, a distributed RDF management solution based on Property Table Partitioning schema built on top of Spark. Based on our extensive evaluation of S3QLRDF with other open-source state-of-the-art systems using real and synthetic RDF datasets, we conclude that the S3QLRDF system improves the response time for SPARQL query processing and outperforms state-of-the-art distributed RDF management systems by up to an order of magnitude better performance. For future work, we consider further improvements of S3QLRDF system in terms of querying performance, especially for the query that involves flattening a number of complex columns. We aim at generating a better query plan with complex properties for less expensive retrieval.