Figure 1. Shows the architecture of proposed system. In this proposed architecture just a virtual file system layer integrated as a wrapper at the top of the Original HDFS without altering the HDFS architecture. Client application interact with the system through VFS-HDFS API. This virtual file system layer consists of the Virtual File Table. File table is maintained per container (as per category). This file table per container maintains information such as File name, Offset, Length and Category of all the files stored on VFS. File table is a linked list of buckets. Each bucket refers to previous bucket. File entries in newly added buckets get precedence over entries from previous bucket. This will allow to modify file contents of already stored files.
Need of Classifiers?
Files containing similar contents have a high probability that those files are used by same client. For grouping similar files (as per their category) in a same container different text classifiers are used. For this experiment Naïve Bayes, Random Forest, J48 and Ensemble classifiers are used.
Motivation to use Ensemble Classification?
No single classifier is found to be sufficient for classifying text contents, therefore combination of multiple classifiers have been used.
Bucket Chain Methodology:
In this system, a single file called File Table stores all of the file's metadata. The tables in this file are connected together to form a table chain. Another file, known as a container file, is used to store the actual contents of other files. Each category has a single container file. File table and bucket file table both contain filename, offset, and length as well as other file metadata.
The category to which each file belongs is determined using an ensemble classifier as the file is being stored. The associated file table is likewise updated, and the file is then saved in a container for the category.
Each category's file table is loaded into memory at launch. The most recent file table is found first, and its contents are added to the in-memory file table. The next file table is then found, and the process is repeated until the first file table is reached. When doing this, if the metadata for a file is already present in the in-memory file table, that entry is removed (as it has become old due to updating or deletion of file).
Each iteration generates a fresh in-memory file table for each category to keep track of newly added or modified files. On this in-memory file table, all metadata creation and update operations are carried out in real time. The container file contains the actual contents of the files. The contents of the in-memory file table are added to the HDFS file table upon termination. This establishes a chain of buckets, each of which has a list of file tables that may be updated and deleted.
Pruning is a method for recovering space that has been occupied by updated or deleted files. This method creates fresh copies of file tables while omitting the changed or removed items. As a result, file tables and containers will take up less space.
By using caching, reading files from previously received blocks takes less time. Every cache entry has a tuple including the category, file, location, length, and the entire block content. This will speed up the process but require more RAM [6]