Hadoop distributed file system (HDFS) facilitates the input data available to multiple nodes. The input data can be extracted from multiple sources in semi structured or unstructured manner and loaded into HDFS by splitting files in multiple numbers and process. These multiple files using map and reduce operations finally extract the data generated from HDFS and store into downstream systems. While working with huge data and processing at multiple nodes, there are chances of losing quality data and may induce unrelated data, which results in quality issues at each stage in the testing process. Data functional testing involves the testing of the info at three phases, they are validation of Hadoop pre-processing, validation of Hadoop map-reduces process and validation of Extracted data and store in downstream systems shown within the Hadoop Map-Reduce Architecture.
Figure 1 shows the Hadoop framework for distributed processing of large datasets using map and reduce in any node of a system. At the initial step the pre-loaded input data is loaded into the HDFS for processing, map process split the data and store the intermediate key-value pair’s data from mappers. These key-value pairs are shuffled based on the key prior to node reduce phase. The reduce phase performs the shrink operation and stores the generated output in local file system. The following can illustrate the validation process
- Testing Whiz will connect to data in import connection mode.
- Import option brings the data into Testing Whiz from the connected data source.
- It stores and compress the data with-in the semi structured data format.
- The eventual distributing of this file will push the data into the Big data services supported by the Testing Whiz Backend.
The semi structured data present in the local drive should be imported to Testing Whiz application software. The semi structured data utilized here comprises data presented in 22,535 rows (each associated with 7 attributes). This file occupies a memory storage of 5.2GB and is needed to be loaded and processed in the Test suite as a string operation in the Test project. The proposed testing methods uses variables like set, compare, length, text, string, data type etc., are used to assess the data source characteristics. This procedure is done in the test command wizard and the functional flow is presented in the Figure2. The data is then allowed into Test Editor for further data validation. The data is imported into Testing Editor where the Test commands are generated. These commands are useful to infer the input, output, click, if conditions, write messages, open page, perform, get, and write messages. This scenario is presented in Figure 3.
Validation of Hadoop Pre-processing through Hive
Semi structured data can be stored in the HDFS to analyse the business logic by using hive queries. The source data is in the excel format is be stored in the local system directory as “F:\Bigdata\population2020.xl. Here the source file must be loaded into the target file for the proposed validation. The process consists of key steps like deduplication, row validation, extra record mismatch and column validation. Table 4.1 illustrate the datatype of attributes, which is useful to implement the source to target mapping. These operations implemented through Hive queries on a typical example is listed in table 4.2. In this example, semi structured sample data set is named as “\DB1" with identical schemas need to be validated. It contains a named \Users" with Seven columns (Locid, Location, Time, AgeGrp, popMale, popFmale, popTotal) as displayed in the Table 4.1.
Table 4.1: Data types of Data.xls
Data types
|
Locid
|
int
|
Location
|
String
|
Time
|
Time stamp
|
AgeGrp
|
Int
|
AgeGrpStart
|
Int
|
popFmale
|
Long
|
popTotal
|
Long
|
Table 4.2: Sample semi structured data
LocID
|
Location
|
Time
|
AgeGrp
|
PopMale
|
PopFemale
|
PopTotal
|
4
|
Afghanistan
|
2000
|
0-4
|
2012.352
|
1925.856
|
3938.208
|
4
|
NA
|
NA
|
NA
|
NA
|
NA
|
NA
|
8
|
Albania
|
2000
|
09-May
|
168.511
|
158.977
|
3938.208
|
50
|
Bangladesh
|
2000
|
0-4
|
8559.564
|
8219.196
|
3095.612
|
356
|
India
|
2000
|
30-34
|
39544.76
|
36322.17
|
75866.94
|
4
|
Afghanistan
|
2000
|
50-54
|
251.974
|
269.793
|
521.767
|
51
|
Armenia
|
2000
|
09-May
|
-200.345
|
132.02
|
2534.159
|
36
|
Australia
|
2000
|
09-May
|
706.907
|
653.649
|
1686.971
|
52
|
Barbados
|
2000
|
100+
|
0.001
|
0.008
|
1360.827
|
764
|
Thailand
|
2000
|
-7.5-79
|
314.754
|
384.148
|
1110.062
|
50
|
Bangladesh
|
2000
|
0-4
|
8559.564
|
8219.196
|
908.023
|
356
|
India
|
2000
|
15-19
|
56474.45
|
52292.7
|
108767.2
|
356
|
India
|
2000
|
20-24
|
49754.06
|
46016.39
|
95770.44
|
The metadata (presented in Table 4.2) is loaded with a default name “dirrockid” in the master server. This server runs through a Linux based operating system in which storage and processing management services are handled. Query 1 is helpful to store the source file into target master HDFS server. The number source data count can be equalised to number of column in the master server as described in table 4.3. After data loading the following variable operations are performed.
hdfs. bfs E://PRK//Data.xls hdfs -put dirrockid
|
Query 1: source file to master server storage
|
Table 4.3 column count of source and master data sets
Number of columns in source
|
Number of columns in Master server
|
13
|
13
|
Apache Hive being batch processing engine, does not support primary, foreign or unique key constraints. It can insert the duplicate records in the Hive table. There are no constraints to ensure uniqueness or primary key, but if a table and have loaded data twice, then you can de-duplicate in several ways. Below methods explain the procedure to identify and Remove duplicate records or rows from Hive table. Deduplication often refers to elimination of redundant sub files (also known as chunks, blocks, or extents). Unlike compression, data is not changed and eliminates storage capacity for identical data. It offers significant advantage in terms of reduction in storage, network bandwidth and promises increased scalability and it is a process that eliminates excessive copies of data and significantly decreases storage capacity requirements. Hive query can be run as an inline process as the data is being written into the storage system and/or as a background process to eliminate duplicates after the data is written to master server. The query 2 shows the Hive query which is useful to perform the deduplication process
duplicate remove db. dirrockid partition by LocId
|
Query 2: Hive query to check the deduplications
|
The non-relational model is not dependent on primary keys. Thus, rows are not allocated with a unique number and there is a possibility to have duplicated data. A duplication in data is is observed with a possible match of data of one row with that of another row. In order to perform this operation, a correct record must be stored in the master server. Any duplication in the data should be taken into temp storage. Otherwise, having a different number of rows for the same record in the semi structured will not be detected and will produce incorrect results. Therefore, each row is counted and the results are stored for use in subsequent stages of the validation process. Based on the data in the \Users" table shown in both Tables 4.2 and 4.3, all records appear once in each table except for the record with the LocId = 4 in \DB2". This record has two identical rows, where the values of all columns are equal as shown in Table 4.4. Hence, the \count" for all the rows is 1 except for the record with the LocId = 50 in \DB2", which is 2 instead.
Table 4.4: Duplicate row count
LocId
|
Duplicate count
|
4
|
3
|
50
|
2
|
After the deduplication step, new database is created in the master server. Each database contains the same tables and data as the original databases before deduplication, but with one additional column \row count" that shows the number of times each row appears in a table. Table 4.5 shows the schema of the newly stored databases \DDB1" and \DDB2" - duplicated copies of \DB1" and \DB2" respectively - while Tables 4.5 and 4.6 show their data count with Query 3 shows the validation of row count
Create table db.dirrcokit partition by LocId=50
|
Query 3: deleting pipe and symbol field termination
|
Table 4.5 : Row count of population2020.xsl
LocID
|
Location
|
Time
|
AgeGrp
|
PopMale
|
PopFemale
|
PopTotal
|
Row_count
|
4
|
Afghanistan
|
2000
|
0-4
|
2012.352
|
1925.856
|
3938.208
|
1
|
50
|
Bangladesh
|
2000
|
0-4
|
8559.564
|
8219.196
|
3095.612
|
1
|
356
|
India
|
2000
|
30-34
|
39544.76
|
36322.17
|
75866.94
|
1
|
4
|
Afghanistan
|
2000
|
50-54
|
251.974
|
269.793
|
521.767
|
1
|
356
|
India
|
2000
|
15-19
|
56474.45
|
52292.7
|
108767.2
|
1
|
356
|
India
|
2000
|
20-24
|
49754.06
|
46016.39
|
95770.44
|
1
|
The results (Table 4.5) show that the different number of records for the user with LocID 4, 50 and 356 is detected. Without deduplication, the row validation executed as the next step in the validation process would conclude that the data related to that user in \DB1"is identical, which would be wrong. As \master server" contains duplicate entries of that user, which is need to be accounted for. To avoid such false positives, it is important that databases are deduplicated prior to row validation.
Extra Record Report It contain list of extra record on the source side and the target side. It shows that out of the total record validated by validation engine, there are few records identified as extra records on source dataset column as Loc Id 50 is Extra Record. Then the Hive query can eliminate from the master server. Mismatched Record Report Misplaced Records are the one whose value on the target database side got corrupted or set null values. Misplaced Record Report displays all the misplaced records.
select username, country from DB1
Minus
Select username, country from DB2;
|
select username
,country
from ( select 1 tab, Locid, Location, Time, AgeGrp, popMale, popFmale, popTotal DB1
union all
select 2 tab, Locid, Location, Time, AgeGrp, popMale, popFmale, popTotal DB2
)
group by LocID
having count(case when tab = 2 then 1 end) = 0;
|
Query 4: Mismatched Record Report
|
Report in Table 4.6 Shows of misplaced record report, but theLocation,popMale,and ageGap got the misplaced values. For Location column value on the source database value must be time data type and the source data have no data in the column, but value on the target side is also null. Same is the case for popMale and ageGap column. Similarly, framework identifies record which are set to null and summary report.
Create external table partition <location> hdfc Directory directory name
|
Query 5:creating external table
|
Table 4.6: source to target value validation
LocId
|
Source column name
|
Source value
|
Target column name
|
Target value
|
4
|
Location
|
null
|
Last name
|
Null
|
8
|
AgeGap
|
09-May
|
AgeGap
|
Null
|
51
|
popMale
|
-200.345
|
popMale
|
Null
|
52
|
ageGap
|
100+
|
ageGap
|
Null
|
764
|
ageGap
|
-7-5-79
|
ageGap
|
Null
|
Report Analysis: Based on the business need framework provides the recommendation to re-migrate data. For example, Report Analysis: Based on the business need framework provides the recommendation to re-migrate data. For example, out of the total records validated by framework if 30% of records are duplicated or set to null then framework provides the recommendation to re-migrated data as data health degraded.In other words, and based on the validation status of each record of the row delta calculation, applying the changes in sequence on the related table in the first deduplicated database will result the data of the same table in the second one. In order to implement the previous representation using Hive, delta values are calculated and stored in the newly created delta tables based on the Table 4.7.The row delta table contains all columns from both tables that are being compared along with the validation status that indicate what kind of change (\insert" or \delete") has occurred on the data in those tables. Table 4.7 shows the schema of the newly created validation database \ValidationDB" with the \Users row delta" table.
Table 4.7: validation of data.xls
LocID
|
Location
|
Time
|
AgeGrp
|
PopMale
|
PopFemale
|
PopTotal
|
Validation
|
4
|
Afghanistan
|
2000
|
0-4
|
2012.352
|
1925.856
|
3938.208
|
Insert
|
4
|
NA
|
NA
|
NA
|
NA
|
NA
|
NA
|
delete
|
8
|
Albania
|
2000
|
09-May
|
168.511
|
158.977
|
3938.208
|
delete
|
50
|
Bangladesh
|
2000
|
0-4
|
8559.564
|
8219.196
|
3095.612
|
Insert
|
356
|
India
|
2000
|
30-34
|
39544.76
|
36322.17
|
75866.94
|
Insert
|
4
|
Afghanistan
|
2000
|
50-54
|
251.974
|
269.793
|
521.767
|
Insert
|
51
|
Armenia
|
2000
|
09-May
|
-200.345
|
132.02
|
2534.159
|
Delete
|
36
|
Australia
|
2000
|
09-May
|
706.907
|
653.649
|
1686.971
|
Insert
|
52
|
Barbados
|
2000
|
100+
|
0.001
|
0.008
|
1360.827
|
Delete
|
764
|
Thailand
|
2000
|
-7.5-79
|
314.754
|
384.148
|
1110.062
|
Insert
|
50
|
Bangladesh
|
2000
|
0-4
|
8559.564
|
8219.196
|
908.023
|
Delete
|
356
|
India
|
2000
|
15-19
|
56474.45
|
52292.7
|
108767.2
|
Insert
|
356
|
India
|
2000
|
20-24
|
49754.06
|
46016.39
|
95770.44
|
Insert
|
In the table 4.8, a full join between the tables in \DDB1" and \DDB2" is applied based on a unique value that identify each record, and the related validation status is then calculated and data type of DB1 and DB2 is to be mapped.
Table 4.8 mapping validation with data types
DB1 data type
|
DB2 data type
|
Validation-statues
|
int
|
int
|
Mapped
|
String
|
String
|
Mapped
|
Time stamp
|
Time stamp
|
Mapped
|
int
|
int
|
Mapped
|
int
|
int
|
Mapped
|
Long
|
Long
|
Mapped
|
Lastly, Table 4.9 result is altered to get the rows that are only represented in either table. Note that since there are no primary keys to identify the rows in big data, a new hash function called \multi hash" - was implemented to generate a unique identifier for each row based on the values in all its columns. The hash value that was obtained from this UDF is then used to compare and join similar records in those tables. A detailed description of the implemented \multi hash" function can be found in Appendix B.1. The source file will be stored into Hdfs location for map reduce operations
Table :4.8 Master server storage data sets
LocID
|
Location
|
Time
|
AgeGrp
|
PopMale
|
PopFemale
|
PopTotal
|
4
|
Afghanistan
|
2000
|
0-4
|
2012.352
|
1925.856
|
3938.208
|
50
|
Bangladesh
|
2000
|
0-4
|
8559.564
|
8219.196
|
3095.612
|
356
|
India
|
2000
|
30-34
|
39544.76
|
36322.17
|
75866.94
|
4
|
Afghanistan
|
2000
|
50-54
|
251.974
|
269.793
|
521.767
|
356
|
India
|
2000
|
15-19
|
56474.45
|
52292.7
|
108767.2
|
356
|
India
|
2000
|
20-24
|
49754.06
|
46016.39
|
95770.44
|
Validation of Hadoop map reduces process:
Once the validation process completed then the DB2 are loaded into HDFS, figure 4.4 Hadoop map-reduce job to process the input files from different sources, there are many issues involved in these process includes jobs. The data set can run correctly in the single node but not in multiple nodes, incorrect aggregation of data in reduce phase, incorrect output data format, coding error in map-reduce phase, mapper spill size. Validation of Hadoop map-reduce process can Tune number of map and reduce tasks appropriately.Validation of DB2 data processing and output files generated, output file format should be as per the requirement specified. Compare the output file with the input source file formats. Verify and validate the business logic configuration in single node against the multiple nodes. Consolidate and validate the data after the reduce phase. Validate the key value pair generated in map and reduce phase and verify the shuffling activity thoroughly. Size of the mapper output is sensitive to disk IO, network IO and memory sensitive on shuffle phase, use of minimal map output key-map output value and compressing of mapper output and minimizing the mapper output will give the observable performance, this can be done by filter out records on mapper side instead of reducer side Compress the intermediate output of mappers before transmitting to reducers, which will reduce the less disk I/O and network traffic from reading and moving files around .
Working of MapReduce
Figure 4.4 shows MapReduce process execution can be classified into four phases. These phases are explained below.In this step, process the initial data using a Map-Reduce operation to obtain the desired result. Map-reduce is a data processing concept for condensing large volumes of data into useful aggregated results.Check required business logic on standalone unit and then on the set of units.
- Validate the Map-Reduce process to ensure that the “key-value” pair is generated correctly.
- Check the aggregation and consolidation of data after performing "reduce" operation.
- Compare the output data with initial files to make sure that the output file was generated and its format meets all the requirements.
Table 4.9 Map Reduce process
Input sampling
|
Average map time(s)
|
Average shuffle time(s)
|
Average merge time(s)
|
Average reduce time(s)
|
Average
Total time(s)
|
Afghanistan
|
18.0
|
16.4
|
7.3
|
1.66
|
43.36
|
Bangladesh
|
9.7
|
10.09
|
3.7
|
4.8
|
28.29
|
India
|
32.3
|
30.9
|
10.21
|
17.07
|
90.48
|
The framework for distributed storage and distributed processing of Big Data on clusters of commodity hardware. Its Hadoop Distributed File System (HDFS) splits files into large blocks ( 5GB) and distributes the blocks amongst the nodes in the cluster. Figure 4.5 describes the processing data in Hadoop Map/Reduce ships code to the nodes that have the required data and the nodes then process the data in parallel. This approach takes advantage of data locality, in contrast to conventional HPC architecture which usually relies on a parallel file system. It consists of many small sub projects which belong to the category of infrastructure for distributed computing. Hadoop mainly consists of [5]:
There are various problems in dealing with storage of large amount of data. Though the storage capacities of the drives have increased massively but the rate of reading data from them hasn’t shown that considerable improvement. There occur many problems also with using many pieces of hardware as it increases the chances of failure. This can be avoided by Replication i.e. creating redundant copies of the same data at different devices so that in case of failure the copy of the data is available. The main problem is of combining the data being read from different devices. Many a methods are available in distributed computing to handle this problem but still it is quite challenging. Such problems are easily handled by Hadoop [5]. The problem of failure is handled by the Hadoop Distributed File System and problem of combining data is handled by Map reduce programming Paradigm. Map Reduce reduces the problem of disk reads and writes by providing a programming model dealing in computation with keys and values. Hadoop thus provides: a reliable shared storage and an analysis system. The storage is provided by HDFS and analysis by MapReduce.
MapReduce is the programming paradigm allowing massive scalability. The MapReduce basically performs two different tasks i.e. Map Task and Reduce Task [5]. Figure 4.5 A map-reduce computation executes as follows: Map tasks are given input from distributed file system. The map tasks produce a sequence of key-value pairs from the input and this is done according to the code written for map function. These value generated are collected by master controller and are sorted by key and divided among reduce tasks [5][6]. The sorting basically assures that the same key values ends with the same reduce tasks. The Reduce tasks combine all the values associated with a key working with one key at a time. Again the combination process depends on the code written for reduce job. The Master controller process and some number of worker processes at different compute nodes are forked by the user. The Master controller creates some number of maps and reduces tasks which are usually decided by the user program. The tasks are assigned to the worker nodes by the master controller. Track of the status of each Map and Reduce task is kept by the Master Process. The failure of a compute node is detected by the master as it periodically pings the worker nodes. All the Map tasks assigned to that node are restarted even if it had completed and this is due to the fact that the results of that computation would be available on that node only for the reduce tasks. The status of each of these Map tasks is set to idle by Master. These get scheduled by Master on a Worker only when one becomes available. The Master must also inform each Reduce task that the location of its input from that Map task has changed [5].
Table 4.10:The comparison between average actual and estimated runtime
Input data/LocID
|
Size of data(GB)
|
Location actual runtime(s)
|
Location count
Estimated runtime(s)
|
Sort avg.actual
runtime
|
Sort estimated runtime(s)
|
Inverted index avg actual runtime
|
Inverted index estimated runtime
|
4
|
2
|
278.98
|
258.84
|
120.93
|
101.77
|
274.33
|
260.31
|
50
|
1
|
130.45
|
110.23
|
56.78
|
56.45
|
156.67
|
130.67
|
367
|
3
|
380.04
|
350.56
|
150.67
|
156.74
|
345.21
|
312.23
|
After each job is executed, based on the type of application and the size of input data, its runtime stored in the database. Each table stores three runtimes and the information on the database, the runtime of a new job estimated. For instance, Table 4.10 shows the runtime history for LocId on 3 nodes. (There is the same table for each application in the database.)Figures 4.5 and 4.6 show the estimated runtime and actual run time , respectively for three benchmarks. According to Fig. 10, the error percentage decreases with increasing data size, and this shows that our proposed method is suitable for big data. The maximum error rate is less than 12%. According to Figs... 4.11, 4.12, 4.13 shows the error rate of the estimated run time by the Exponential Averaging method and actual runtime in Word Count, Tera- Sort, and Inverted index in the different number of nodes. The results show that the error rate is less than 5%.
Table 4 portrays the applicability of data staging, ETL, and process validation of conventional and proposed test procedures. The proposed big data test approach is capable to implement all these validation stages and thus signifies its capability compared with existing methods. Functional and Non-functional testing can be done in all the possible testing approaches. In Manual Testing, the reliability of this process is purely dependent on the test engineer's knowledge and experience. The preference for manual testing is very limited since it is a time-killing, complex, costly, and hectic procedure. It is already discussed that the Selenium approach is limited to only structured data and whose capability ceases with the rise of data sources. The proposed test procedure requires no coding and can perform the testing task in no time. It is quite reliable and can easily handle all types of data formats. Its performance is less distracted even with the rise of the data source. To test and verify the structured and unstructured data formats, schemas at different sources such as Hive, Map-reduce, Sqoop, and pig by using the Testing Whiz automation tool. Table 4 describes the multiple statistical features attained by using the conventional and proposed test routines. It can be observed that big data testing has statistically outperformed the conventional test methods. Figure 7: shows result here has been slightly adjusted to account for project preparation time, execution time, and analytic time. The invalid defects were found in the testing because of said error.
Figure 7 portrays the tools used by testers to identify discrepancies of the proposed method with conventional methods in processing structured data. It can be observed that the efficiency of the proposed big data testing through Test Whiz can handle a huge number of test cases and requires less analytic time in a short execution time. The Selenium test procedures need the conversion of unstructured data into structured data and thus the processing tasks consume more time and pose more burden for the test engineer to compare the expected results with the actual results.