Increasing the dependency on technology is generating immerse volume of data. However, process and analysis of large amount of data is difficult via conventional methods. Therefore, Apache Hadoop is designed to store and process data that has various size up to terabytes, petabytes and zettabytes [2]. Furthermore, applying big data analysis has been highly suggested to process the dataset by using Hadoop and MapReduce in order to enhance the computational and mining process [1].
A. Hadoop and Big Data
Many studies have been defining and explaining the big data as a large volume of data. In addition, velocity, variety and increasing the volume of data are other important features of big data [2]. In contrast, data mining is more about strategic decision making on structured data that can be as small volume of data or large volume of data [6].
There are different types of big data which are structured data, unstructured data and semi structured data [10]. Also, images, video, audio and natural language are considered as other types of big data [9].
Firstly, big data that has recognize pattern and easy to read and understand is recognized as structured data such as excel table or table in database [10]. Secondly, big data that has unrecognized pattern such as email file which is recognize as unstructured data [10]. Thirdly, semi-structured data that is recognized as data similar to XML and JSON [10].
The described types of big data are considering as valuable sources for big data analysis by using Apache Hadoop. In addition, real world example of applying Apache Hadoop as solution for big data analysis are highlighted as following [11]:
- In Healthcare domain: Apache Hadoop provides analysis for archiving claims data and process for logging terabytes of data that are made from transactional system.
- In Telephone Company domain: Apache Hadoop with HBase provides storage for billions of rows of call records details, where data with TBs are added on monthly basis.
The applications of Hadoop in real world are various. In fact, the variety, velocity and volume of data are generated rapidly on daily basis. Therefore, it is important to highlight in this paper the theory and concepts of Apache Hadoop and MapReduce as big data analysis tools as well as the main steps to setup Apache Hadoop ver.3.1.1 on Mac operating system.
B. Hadoop Architecture
Apache Hadoop is open-source software that is distributed processing of large data among multiple node or single node [3]. Hadoop installation is dependent on the aim of the project and the volume of data in dataset. User of Hadoop has the capability to process the large volume of dataset in multiple nodes, which is required Hadoop installation as pseudo-distributed model.
Alternatively, local mode or standalone mode allow the user of Hadoop to have single node on local system to process the dataset. The core components of Hadoop are recognized as Hadoop Distributed File System (HDFS) and MapReduce [1]. The HDFS is implemented in Java and mainly it is dividing the file of dataset into blocks based on the dataset size [10]. The HDFS process the dataset by using NameNode system and DataNode system [10]. The NameNode has main role in containing the dataset file and other related information such as dataset file name, permission and the location of each block of dataset [12]. Consequently, the NameNode is connected to DataNode where a replication of data blocks are stored in memory and processed for the read and write operations (Figure 1)[12].
C. Hadoop prerequisites
There are different methods to run Hadoop in different platforms. But most of the research articles and resources are depending on the Unix operating system [7]. Based on that, Linux based operating system is used to illustrate the steps for setup Apache Hadoop. In addition, there are several Hadoop versions available to download each version has different setup and configuration [8]. The latest version of Hadoop 3.1.1 is used to illustrate the setup and the remaining steps for big data analysis.
Installing Hadoop is requiring step by step process where the dependencies between steps are critical [3]. There are different methods to setup Hadoop based on the projects of big data analysis. One of the methods is Single Node Cluster where the Hadoop is processing the data on local system. Moreover, all methods for setup Hadoop are requiring installing Java version 8 (Figure 2).
Furthermore, SSH localhost should be open and the sharing option is enabled (Figure 3).
It is important to ensure that Java is installed and SSH is not blocked on port 22 (Figure 2) (Figure 3) [3]. Downloading the required Hadoop version and extracting the files can be complicated but using package management software such as Homebrew is simplifying the installation of software and ensure the latest update is installed [5]. Homebrew is simplifying the installation of Hadoop and it is ensuring the latest update for any other packages on Hadoop. Simply, type the following line in bash terminal in order to install Homebrew (Figure 4).
D. Hadoop Setup
In order to start the process of installing Hadoop type the following command in bash terminal (Figure 5).
Installing Hadoop is requiring more than 700 MB size from hard disk space and it may be required to install additional package such as Hive and Pig in order to simplify the analysis on big data analysis (Figure 6). The advantage of having homebrew is easier of installing other packages / software by typing the following syntax
$ brew install (type the name of software)
Once setup of Hadoop is completed, it shows the following
In case of removing Hadoop type the following
$ brew uninstall Hadoop (or type the name of software)
E. Hadoop Configuration
Hadoop has several files need to be configured before run Hadoop as single node mode [3]. There are two main directories need to be configured which are sbin and bin.
The first directory sbin has five files that need to add the following command (Figure 7)(Figure 8)(Figure 9)(Figure 10).
$ cd /usr/local/cellar/hadoop/3.1.1/libexec/etc/hadoop
The second directory bin has several files that are support the control of HDFS as the following path
$ cd /usr/local/cellar/hadoop/3.1.1/ libexec/bin
Simply, it is required to process a format for NameNode only for the first time after Hadoop installation as following command shows
$ hdfs namenode -forma
F. Running Hadoop
Access the directory sbin in order to start dfs
$ cd /usr/local/cellar/hadoop/3.1.1/ libexec /sbin
$ ./start-dfs.sh
Starting the dfs is lunching name node, data node and secondary name node, which are main components of HDFS.
$ ./start-yarn.sh
Starting the yarn is lunching the other parts of Hadoop, which are node managers and resource manager.
In order to check the functionality and ensure all components of Hadoop are lunched, type the following command (Figure 11)
Open the web browser http://localhost:9870, where it shows the NameNode page.
Open the web browser http://localhost:8088/cluster to access Hadoop main page.