Existing research has extensively explored cascading encryption algorithms to improve data security in various contexts, including cloud storage and Big Data environments. Filaly et al. (2023) proposed a hybrid encryption algorithm for information security in Hadoop based on AES, CP-ABE, and RSA encryption while the hybrid encryption method aims to enhance security for big data applications, the scalability issues related to encrypting and processing large data efficiently are not addressed in this paper. Aswathi et al. (2022) focused on securing Big Data in Hadoop using hybrid encryption (RSA and AES). This paper addresses the limitations of AES encryption. Similarly, Viswanath and Krishna (2021) focused on a hybrid encryption framework in a multi-cloud environment for securing big data storage. The researcher prevents the data from unauthorized access in the private cloud only. Negi et al. (2023) introduced a hybrid cryptographic approach (AES integrated with ECC) for secure cloud-based file storage. The key management and performance overhead are challenging issues of this paper. Lai et al. (2022), and Kumari and Malhotra (2022) explored secure storage of files on the cloud using hybrid cryptography but did not effectively implement its scheme in a real-world scenario. Chaudhari et al. (2023), a survey on hybrid cryptography for secure file storage, highlighting its relevance and effectiveness.
On the other hand, Jain et al. (2019) introduced the SMR (Secure Map Reduce) layer, and also Gupta et al. (2023) introduced the FSMR layer (Fortified Secure Map Reduce) between the HDFS and the map-reduce to secure the data but overall performance, scalability, complexity of the structure is the main issue of this approach.
However, these studies address various aspects of hybrid encryptions and when introduce an extra layer within HDFS and MapReduce framework in the big data environments. We noticed that HDFS data security and speed optimization is a significant gap in this literature.
Justifications
The selection of the current research topic is motivated by the rising importance of the security Big Data field, particularly within the HDFS framework. While existing literature has explored hybrid encryption in diverse contexts, and also adds a new layer in the Hadoop environment. There is a notable absence of studies focusing specifically on performance in securing HDFS data. This research aim is to fill this gap by evaluating and proposing a new hybrid encryption scheme tailored for safeguarding data stored in HDFS, thereby contributing to the advancement of data security and improving the performance practices in Big Data environments.
Summary of current work
A novel approach is presented through this paper, to boost the security of HDFS data and improve performance by implementing a hybrid encryption scheme combining Twofish and AES algorithms with the map-reduce framework. The proposed method utilizes MapReduce for parallel encryption, aiming to mitigate the processing overhead associated with conventional encryption techniques. The results of the experiments validate the proposed hybrid method's effectiveness, which is used to safeguard sensitive data stored in HDFS and improve performance, addressing the identified literature gap.
Proposed Methodology
The proposed hybrid encryption (combine Twofish with AES) model with a map-reduce framework [27] aims to boost the security of HDFS data in the Big Data field. The model leverages the parallel processing capabilities of MapReduce to achieve efficient encoding and decoding of data stored in HDFS. The following components constitute the proposed model:
3.1. Twofish and AES Integration:
To protect the large amounts of data stored in HDFS, the hybrid encryption technique that has been developed includes both the Twofish and AES algorithms. In comparison, to more conventional encryption methods, the double encryption operation achieves a higher level of data protection. Twofish provides strong encryption capabilities, while AES ensures compatibility and efficiency.
Nowadays, security point of view symmetric block cipher algorithm Twofish is the most popular. Twofish has 128-bit block sizes and 128 to 256-bit variable key sizes. The Blowfish algorithm [18–20] is the algorithm that came before the Twofish algorithm. The Twofish and Blowfish algorithms have mostly identical structures. Pre-computed key-dependent S-boxes and a very complicated key schedule are two of Twofish's distinguishing characteristics. In n-bit keys are separated into two parts: the key’s first half is used as encryption keys, and the key’s other half is used for modifying the algorithm. The security performance of Twofish is better than AES [17, 18]. Twofish compared with AES, both algorithms may support key sizes of 128, 192, and 256-bit. In the Twofish algorithm, 16 rounds are fixed for any key.
The difference between Twofish and AES is that the substitution-permutation network used by the AES and Twofish uses a Feistel network to encrypt the data. The structure of the Twofish algorithm is more complicated but most secure.
The most common symmetric block cipher algorithm is AES. AES includes block sizes 128-bit, key lengths 128, 192, and 256-bit, and 10, 12, and 14 rounds to encrypt the data [17, 18, 21]. AES uses different length keys to encrypt and decode data, for example, a 128-bit length key used by AES-128.
We may test how various algorithms resist to compete with themselves. During the race, the most important concerns are security, defense against attacks, and performance. Evaluating different encryption algorithms also involves evaluating, implementation, scalability, and suitability, which are both important considerations. AES is more efficient than Twofish in terms of hardware requirements. However, AES requires less memory and fewer cycles to encrypt data. Speed points of view AES is better. In this paper, we integrate both algorithms and overcome their limitation. In this way, we provide better encryption algorithms in big data environments.
3.2. Hadoop Distributed Environment
In 2006, it was a component of Lucene's subproject called Nutch, which consisted of the framework of the distributed system and became an independent project. In addition to being a distributed storage system for big data files, Hadoop also possesses a robust data processing capability [22]. A distributed framework concept large distributed computing cluster can be formed that can be installed on a low-commodity hardware device. The HDFS and MapReduce programming models are the two parts that make up the Hadoop framework [23–25]. MapReduce framework is based on the parallel processing concept for handling massive data sets [23]. The idea of a Map function based upon mapping, and reduce based on protocol. This is the most significant characteristic of the MapReduce. Developers write map-reduce programs run on the distributed system and effectively manage large data.
Firstly, the file is divided into smaller parts known as blocks and saved into different datanode. Namenode and datanode are two nodes in the Hadoop framework. All types of operations are performed between two nodes. The workflow of map-reduce architecture is as follows: In the beginning, the namenode sends the access permission request to the client if this request is granted the file name will turn to the HDFS block ID where the blocks of the file are stored. List of block IDs taken by the client where the part of the file is stored. The map-reduce framework [26] processes files with the help of a mapper and reducer. The mapper converts the file into key-value pairs [27]. This key-value pair moved forward to the new partition and sorted the data based on keys. A combiner is used as an optional to reduce the work of the reducer. The combiner combines the sorted key-value pair and counts the values of the same key. In last, partitions will divide the key-value pair again and give it to the reducer. The map-reduce framework uses a shuffle operation. This operation gives the mapper’s result to the specific reducer. In this way, the process is finished. After that, the reducer works on the reduced function and uploads the output to the HDFS again. Figure 1 displays a Hadoop MapReduce architecture Ahmed et al. (2020).
3.3. Data Encryption Process
Data stored in HDFS is encrypted using the proposed scheme before being written to disk. Initially, a substantial amount of plaintext material, going to be encoded is cut up into sequential parts of plaintext. After that, the sequence components of plaintext are recognized in reverse chronological order. The first part of plaintext is identified as 1, and the next part with the number 2 is found after it. In this approach, the final components of plaintext are identified as 3.
The plaintext components and hybrid encryption technique subsequently their identification are allocated to the server in chance, and each job is truly a proposed scheme operation shown in Fig. 2. The steps are as follows: individual part of plaintext is encrypted with the proposed scheme in the map layer. Firstly, the plaintext part is encrypted with a Twofish key. The Twofish key is generated using its generator. Then this encrypted data is again encrypted with the help of the AES key. In this way, each plaintext parts are encrypted, and the correspondence of elements to that part is obtained simultaneously, with the help of the proposed encryption. Assuming that the ciphertext part grouping identities and plaintext parts are equal, the task is to recognize the encoded parts, depending on the identifiers allocated to their plaintext parts. The task completes its operation, sending the encrypted parts to the 4th host for execution. After getting the ciphertext portions of all Map jobs, the Reduce job initially arranged the ciphertext parts in sequence according to the ciphertext part grouping identities and obtained the merged ciphertext. Finally, the ciphertext is kept in HDFS, as shown in Fig. 2.
The Map-reduce framework-based distributed proposed encryption scheme is presented. In the standard encryption method, one computer is used to encode the data but in the parallel encryption method, multiple servers are used simultaneously to encrypt and calculate the data. This method is suitable for carrying out the encryption operations for huge plaintext data which is set in a manner in which both are efficient and quick.
3.4. Key Management:
The model includes a key management system for storing securely, generating, and distributing encryption keys for both Twofish and AES algorithms. Key rotation and revocation mechanisms are implemented to guarantee the security of encryption keys.
3.5. Data Decryption Process:
Encrypted data retrieved from HDFS is decrypted using the corresponding decryption keys. First, the large amount of encoded data sets is to be separated into 3 ciphertext chunks in series. Later, the parts of ciphertext that have been detected in consecutive order are additionally identified in sequential order, which means that the initial ciphertext is 1, and the next is 2 respectively. In this style, the last ciphertext part is identified as 3. The ciphertext part and proposed decryption scheme are given to the server to achieve tasks, and each job is decrypted by the map-based proposed decryption scheme. In this proposed scheme firstly, the ciphertext part is decrypted through the AES key then the Twofish key. In the decryption scheme to decode the ciphertext parts and get the matching plaintext parts, the task must first recognize the extracted parts according to their ciphertext parts, that is, the plaintext part grouping identities and the ciphertext parts are identical. Then the job will be forwarded to the 4th host for execution. After receiving the plaintext parts of all the Map jobs, the Reduce job initially arranged the plaintext parts in order according to the plaintext part grouping IDs and gained the merged plaintext data. Finally, the plaintext data is put into HDFS, as shown in Fig. 3.
3.6. Security Mechanisms:
This model incorporates security mechanisms to protect against possible attacks, including data interception, tampering, and unauthorized users. The proposed hybrid encryption model offers a comprehensive approach to securing HDFS data in a big data environment, leveraging the strength of Twofish and AES algorithms while harnessing parallel processing capabilities of a map-reduce framework for efficient processing. This scheme also improves the speed of encryption and decryption.
3.7. Pseudocode of Proposed Encryption Scheme
Map1(key, value):
// Generate Twofish key
twofishKey12 = generate TwofishKey1()
// Encrypt data using Twofish
encryptedTwofishData12 = TwofishEncrypt1(value, twofishKey12)
// Generate AES key
aesKey12 = generateAESKey1()
// Encrypt Twofish-encrypted data using AES
encryptedData12 = AESEncrypt(encryptedTwofishData12, aesKey12)
// Emit encrypted data with its associated key
emit(key, encryptedData12)
Reducer1(key,value)
//combine and sort encrypted data
Aggregrate = 0;
// read each line
for line in sys.stdin:
lineN = line.strip()
lineClass1 = EncryptedN(lineN)
aggregation + = lineClass1
//print ciphertext
Pseudocode of Proposed Decryption Scheme
Map2(key, values):
// For simplicity, assume only one value per key in this example
encryptedData12 = values[0]
// Decrypt data using AES
decryptedTwofishData12 = AESDecrypt(encryptedData12, aesKey12)
// Decrypt Twofish-encrypted data
decryptedData12 = TwofishDecrypt(decryptedTwofishData12, twofishKey12)
// Output decrypted data
emit(key, decryptedData12)
Reducer2(key,decryptedData12)
//combine and sort the decrypted data
Aggregrate = 0;
//Read each line
for line in sys.stdin:
lineN = line.strip()
lineClass1 = decryptedN(lineN)
aggregation + = lineClass1
//print plaintext