2. Literature Survey
When considering the scale of the data, query processing with acceptable performance on large data is frequently a concern. Scale independence is addressed to provide improved results when processing a big dataset, where we can provide a bound on the dataset needed to answer a question regardless of the size of the dataset and its changes. Several approaches achieving scale independence in big data, independent of size and range, have been suggested. The majority of database processing methods choose views that, on average, speed up query execution. However, the application's average output is also limited by the scale of the underlying database.
The method proposed [1] describes a scale-independent view selection and maintenance scheme that employs innovative static analysis strategies to ensure that newly generated views never become scaling inefficiencies. With novel static algorithms, scaling problems are avoided by ensuring that the provided views accept invariants on both the scale of the view and the amount of work needed for incremental maintenance. However, as the update of IMV's costs increases with the size of the application, a drawback occurs in the collection of incrementally maintained Materialized Views (IMV). This paper also discusses how to find hotspots and mitigate output loss using schema analysis and query rewrite techniques. This method combines load balancing and concurrent execution. To limit hotspots, this strategy combines load balancing and concurrent execution.
In [4], the accessible portions of the schema are defined for processing, and answering a query with both access and integrity constraints is thoroughly studied. The actual schema emphasizes a query-independent property called super-extractability. If at all we can extract a superset of values from a schema using the designed access pattern, it is said to be super-extractable.. A general definition of database access rewritability under constraints and conditions is that a query is logically access rewritable if it can be tested by running the query on the available data. A method is proposed for deciding whether a part of the schema can be extrapolated using the provided access methods while taking advantage of the constraints.
The Performance Insightful Query Language (PIQL) is a declarative language proposed in [2] that enables the queries to be independent of the size of the dataset even if it scales by computing an upper limit on the number of key/value store operations for each query. Performance Insightful Query Language is a SQL (Structured Query Language) extension that provides additional bounding details to the programmer. For handling unbounded requests, PIQL has a paginate provision. With relationship cardinality constraints in the database schema, it also provides bound in intermediate outcomes. It offers a service level objective adherence prediction model that calculates the feasibility of scale independence achieving service level goals using the query plan and process bounds.
The simplest SQL extension allows you to describe relationship cardinality and the corresponding size specifications. On such requests, the database programmer chooses a slower bounded plan than an unbounded plan to prevent output degradation. i.e., if the compiler is unable to generate a bound plan for a problem, it attempts to bound the computation in any way possible.PIQL in [3] sets strict limits on the number of I/O operations needed to answer any question. It was created with large-scale data-intensive applications in consideration. It is built on a subset of SQL, resulting in reliable output predictions and aid for modifying entities and relationships in web applications.
The methodologies in [5, 6] explore how to answer questions with limited access patterns. Especially the method in [5] discusses rewriting queries with restricted access patterns with integrity restrictions. For rewriting views, an algorithm is provided for identifying an exact executable plan, whether one exists, or a minimal containing plan for queries. In [6,] the complete answer for a query is computed using the binding pattern of a relation. A relation's binding pattern defines the limit on attributes that can be used to access the relation.
The technique in [7] proposes a Scalable Consistency Adjustable Data Storage (SCADS) architecture that allows users to specify application-specific requirements, uses utility computing to provide cost-effective scale-up and scale-down, and applies machine-learning algorithms to handle performance problems and predict the resource requirements of new queries before they are executed.
Performance-Safe Query Language, –Performance Tradeoff, Declarative Consistency and Scaling Up and Down Using Machine Learning are three technologies that include data scale independence. Since Performance-Safe Query Language is a scale-aware query language, it allows for effective web programming by ensuring scalability and predictability. Declarative Consistency –Performance Tradeoff is a declarative consistency and performance tradeoff language that enables developers to determine the accuracy of an application in terms of required performance SLAs (Service Level Agreement). Upgrading and Downgrading Machine Learning is the ability to use machine-learning models to effectively add and subtract capacity in order to meet SLAs.
Generally, optimizations based on Map Reduce functions are random computations written in a general-purpose programming language. In general, Map Reduce-based systems do not process iterative and complex queries that require access to a largest number of collections. Query optimization adapting the Map Reduce method isn't appropriate for all types of queries. In paper [8] the Hadoop framework is extended and the map-reduce framework is used to parallelize queries across nodes. HadoopDb has the advantage of fault tolerance and can operate in a variety of conditions. To have improved efficiency, the majority of query processing is done inside the database engine.
HadoopDb has a data storage layer and a data processing layer, with the storage layer being a block-structure framework that manages the core name node. The Data Loader portion partitions data globally based on a specified partition key. HadoopDb has a database connector that serves as a link between the task tracker and the independent database structure on the nodes. Any map reduce job provides a sql query as well as link parameters such as jdbc engine, query fetch size, and other query tuning parameters to the connector. The catalogue part keeps track of database metadata and stores it in HDFS as an XML (Extensible Markup Language) file (Hadoop Distributed Storage System).
HaLoop [9] is an updated Hadoop map reduce architecture that not only extends map reduce with programming language support for iterative queries, but also increases efficiency by rendering the task scheduler loop-aware with cache mechanisms. It offers a unique parallel and distributed system for large-scale iterative data processing applications.
The programming model and design of twister in [10] improves the map, minimizes runtime for facilitating iterative queries and computes efficiently. HaLoop is designed on top of Hadoop platform, and the optimizations include a loop-aware scheduler, loop-invariant caching, and caching mainly for fast fix point verification. Data is read from local disks associated with the worker node, and the intermediate data is stored in a distributed memory of the worker node. The outputs of the map tasks are moved to the corresponding reduce task through a broker network, where they are buffered before the reduce computation is executed. As a result, the intermediate fits into the distributed memory.
The approach in [11] proposes an optimization framework for SQL-like map-reduce queries. This paper focuses on a query language called Map Reduce Query Language, which is descriptive to catch much of the computation in declarative form and can also be optimized in a better way. It has been described how to map the algebraic forms extracted from queries to the optimization system. This paper describes how to map the algebraic forms obtained from queries to the optimization structure.
Many algebraic optimizations are also addressed, such as fusion cascading map-reduce jobs into one single job and summarizing combine function structure from a map-reduce job's reduce function. For deductive and relational database structures, [12] proposes an incremental estimation algorithm for computing the view obtained from the relations. The total number of alternative derivations for each of the derived tuples in the view is tracked using a counting algorithm. In the query evaluation process, the algorithm operates for both set and duplicate semantics. In this article, negation and aggregation are used to explore another algorithm for nonrecursive views. Since it precisely computes the result, the algorithm produces an optimum result. Since it precisely computes the view tuples inserted in the database, the algorithm produces an optimal answer. Another algorithm, known as the rederive algorithm, has been developed for incremental recursive view management.
In paper [13] the discussion focuses more about recursive delta-based computing. The delta adjustments are primarily used to facilitate iterative computations between iterations, and the state is easily modified in an extensible manner. This paper presents a programming model that details the implementation and optimization of these queries in REX's runtime environment. In the REX runtime scheme, failures are treated gracefully. A cost-based optimizer called Comet [14] is discussed that shares computations at the SQL level and at the Map Reduce level to eliminate redundancies. In [15], a rule-based optimizer takes advantage of similarities between input tables and operators of the same partition key. Almost all generic database optimizers don't pay attention to iterative queries in particular.
The architecture of effective map reduce algorithms for data processing, deep learning, and related joins are described in [16]. The main function of the Map Reduce system are often executed on a single master computer, where data has been preprocessed until the map functions are called. This paper discusses map reduce algorithms in the sense of data mining, such as frequent pattern mining, sorting, probabilistic modeling, and graph analysis. HiveQL, a SQL-like declarative language, has been thoroughly explored in [17], and it supports collections such as arrays and maps ,tables containing primitive types, and nested compositions.
It contains a metastore that holds the schema and statistics needed for data exploration, query compilation and query optimization. The query generator generates the execution plan using the metadata contained in the Metastore. Finally, the tasks are completed on the order on which they are dependent. Only after any of the dependent tasks' prerequisites have been completed, the task performed. A map/reduce task does serializing the portion of the plan into an xml file referred as plan.xml. This file was further added to the task's job cache, and then Hadoop instances of ExecReducers and ExecMapper are created. In [18], the incremental maintenance view is addressed. A database ring is built and used as the basis for query calculus architecture for efficient aggregate queries.
The properties of a ring with a regular form of polynomials and calculating inverses for delta queries are inherited by the query calculus. It also gets rid of costly query operators like joins, which are used to compute incremental view maintenance. The ring database's algebraic structure is required, and it is then bound to ring to form an aggregative query calculus.The technique in [19] discusses data intensive processing and addressing large-data issues in depth, and the distinct solution for those problems is illustrated.. If the related dataset was too massive to be stored in memory for data intensive retrieval, it was saved to disk.
As a result, it is preferable to prevent random data access and arrange data processing computations sequentially. This paper discusses data centre reliability .The suggested methodology [20] has a detailed discussion of a scalable distributed architecture for learning models from massive datasets.Parallel Learning of Tree Ensembles with Map Reduce is a distributed computation architecture that describes and implements distributed computations using the map reduce model.
In [21], a new architecture called Spark was proposed, which supports applications for maintaining MapReduce's scalability and fault tolerance. Resilient Distributed Datasets is a new abstraction introduced by Spark. It's a read-only set of objects that's partitioned around computers and can be restored if one of them is destroyed. The majority of technologies used to execute large-scale data-intensive applications use an acyclic data flow model, which is ineffective for these applications. The reuse of data through several parallel operations is the key subject of this paper.