From the literature review, it is observed that a wide variety of models are available for ontology generation, and each of these models have their own limitations. These include limited security for the generated ontology, data redundancy, limited traceability capabilities, etc. In order to remove these drawbacks, this text proposes a blockchain based secure & efficient ontology generation model for multiple data genres using augmented stratification. The proposed model is depicted in Fig. 2, wherein different datasets, and their final ontological classification status is described.
From Fig. 2, it is observed that input structured/unstructured data is given to a pre-processing layer, wherein data-specific feature extraction process is applied. These features are given to a feature selection layer, which uses combination of linear support vector machine (LSVM), and extra trees (ET) classifier for finding most variant features. Extracted features are given to an ontology generation framework, which stores these features, along with tagged classes in order to generate an RDF based ontology.
Details about BOGMAS
For the purpose of ontology creation, the BOGMAS model employs a semi-supervised technique that can be applied to almost any structured or unstructured dataset. It reduces the number of redundant numerical features in the dataset by utilizing a variance-based technique (VBM), whereas textual characteristics are first transformed to numerical values using a typical word2vec model, and then it processes those numerical values using VBM. This model utilizes a mix of linear support vector machine (LSVM), and extra trees (ET) stratifiers for variance estimation. This not only makes the model incredibly efficient, but it also decreases the number of redundant features that are included in the output ontology. A correlation engine is provided with these feature sets together with their variances so that it may estimate connection strengths and provide an ontology. Every ontology record is protected by a blockchain system that is built on a changeable proof-of-work (PoW) architecture, which helps to infuse the system with transparency, traceability, and the capacity to do distributed peer-to-peer processing. An incremental OWL (W3C Web Ontology Language) format is used to express the created ontology. This format lends a hand in dynamically scaling the ontology in accordance with the data that is being received from different sources.
This ontology is stored on a PoW based blockchain model and stored on the database after elliptic curve-based encryption. Internal details of each of these models is described separate sub-sections of this text. Readers can refer these sections implement the proposed model in parts and, depending upon the requirements.
3.1. Pre-processing layer design
Initially, all input data is given to a pre-processing layer for feature extraction. Here numerical features are directly passed to the feature selection layer, while textual features are given to a word2vec model. This model is depicted in Fig. 3, wherein components like context builder, vocabulary builder, continuous bag of words (CBoW) engine, etc. are defined.
A large number of context-sensitive vocabulary models are available for this purpose. In this work, we are using Bidirectional Encoder Representations from Transformers (BERT) model, because of its coverage extensiveness, and reduced dependency on external sources. This block creates a vocabulary from input data and provides it to the context builder block. This block generates word pairs and finds neighbourhood combinations from these pairs. These combinations are counted and given to a numerical layer for evaluating number of occurrences for each pair. These occurrence values are considered as initial word2vec features, and are evaluated using backpropagation, hierarchical SoftMax, and negative sampling layers. These layers further reduce feature redundancy by removing non-action words, which assists in feature reduction, and accuracy improvement of the proposed feature extraction model. The extracted features are processed using a 2-layered neural network, where each layer maps input word pairs with respective features. The result of this model is a single context-sensitive feature vector for the entire sentence. This feature vector is given to a feature selection layer, where variance-based features are extracted.
3.2. Feature selection layer design
After features are extracted using Word2Vec layer, they are given to a variance-based selection layer. This layer uses combination of linear SVM, and Extra Trees (ET) classifier for removal of non-variant features from input dataset. Both these classifiers are used in their standard form and are given per-feature intra-variance value for training and validation. This value is extracted using Eq. 1 as follows,
$${F}_{int}=\sqrt{\frac{{\sum _{a=1}^{m}{(f}_{a}-\frac{\sum _{i=1}^{m}\sqrt{\frac{\sum _{j=1}^{n}{{(f}_{j}-\frac{\sum _{k=1}^{n}{f}_{k}}{n})}^{2}}{n-1}}}{m})}^{2}}{m-1}}\dots \left(1\right)$$
Where, \({F}_{int}, m, and n\) represents inter-variance feature value for feature \(f\), total number of features of current type, and total number of other features available in the dataset. This inter-variance feature value is an indicative of variance-level of this feature w.r.t. all other features in the dataset. This value is given to both ET and SVM classifiers for estimation of feature redundancy. Parameters for both these classifiers, along with reason for selection is depicted in Table 1 as follows,
Table 1
Parametric values for each classifier
Classifier | Feature | Reason |
---|
SVM | Kernel: Linear Tolerance = 0.1% Decision type = One v/s Rest | A linear regression kernel is used for evaluation of feature-to-feature variance and is accompanied with low error tolerance between current features, and rest of the evaluated features. |
Extra Trees | Number of estimators = 10 * Number of features Entropy criterion = Gini Impurity Random State = random (0, Number of features) Class weights = Intra variance of all features | The extra trees classifier is trained w.r.t. number of available features and is allowed to shift to any feature vector for variance checking. Gini impurity is used for one-to-one mapping, while class weights are initialized with intra-variance between different features. Due to which, the model can reduce dependency on default tree weights, and estimate redundancy with better efficiency |
Both these classification engines output their own set of features. A union of these features is used as the final feature vector, and can be obtained using Eq. 2 as follows,
$${F}_{out}=\bigcup {F}_{out}\left(SVM\right), {F}_{out}\left(ET\right)\dots \left(2\right)$$
Both these feature outputs are given to an ontology generation engine, which is described in the next sub-section of this text.
3.3. Ontology generation process
Upon feature selection, only the feature vectors which are non-redundant are extracted from input dataset. These features are given to a correlation engine, which evaluates relationships between them. Correlation value of each feature w.r.t. other features (\({Corr}_{{F}_{1}{F}_{2}}\)) is extracted using the Eq. 3,
$${Corr}_{{F}_{1}{F}_{2}}=\frac{\sum _{i=1}^{{N}_{f}}{F}_{1{i}_{int}}-{F}_{2{i}_{int}}}{\sqrt{{\sum _{i=1}^{{N}_{p}}({F}_{1{i}_{int}}-{F}_{2{i}_{int}})}^{2}}}\dots \left(3\right)$$
Where, \({F}_{1{i}_{int}}, and {F}_{2{i}_{int} }\) represents intermediate feature values for the compared features, and is estimated using Eq. 1, while \({N}_{f}\) represents number of extracted features for the given comparison. Correlation values for each feature pair is extracted, and their average correlation is evaluated using Eq. 4,
$$AV{G}_{corr}=\frac{\sum _{i=1}^{{N}_{f}}\sum _{j=1}^{{N}_{f}}Cor{r}_{i,j}}{{N}^{2}}\dots \left(4\right)$$
All features with correlation more than \(AV{G}_{corr}\) are clubbed together, while other features are stored in a separate group. These groups are combined with entity information, and an output RDF ontology is created using the format described in Table 2 as follows,
Table 2
RDF format used for ontology formation
Entity | Class | Grouped Features | Ungrouped Features | Time Stamp |
In this format, the entity is an application dependent entry, which can be ‘disease type’ for medical applications, ‘product type’ for E-commerce applications, etc. Class represents the category of this feature, while grouped and ungrouped features represent similar & dissimilar feature values, and Timestamp indicates the time at which this entry was generated. The RDF data is given to a blockchain based model for improving security, which is described in the next sub-section.
3.4. Blockchain based security model for storage
The RDF data is stored using a blockchain model, which ensures immutability, traceability, distributed processing, and improved trust-levels. In order to store the data into blockchain, a chain similar to the one described in Fig. 4 is formed, and the following operations are performed,
-
Every time a new entry is added to the RDF, a new block is created
-
The following information is added to the block,
-
Source of input data
-
Timestamp at which this data arrived in the system
-
A random nonce value, which is used to form uniquely identifiable hashes for each block
-
RDF data generated from section 3.1, 3.2, and 3.3
-
Hash value of previous block (this value is blank for the Genesis block)
During addition of a new block, a random nonce number is generated for hash calculation. After generation of this nonce value, the following Eq. 5 is evaluated,
$$Hash=SHA256(Source, Timestamp,$$
$$RDF data, Nonce)\dots (5)$$
If this hash value is already present in any of the blocks, then a new random nonce is generated again, otherwise it is used for blockchain creation. The block is also encrypted using ECC, where the following encryption curve is used,
$${y}^{2}={x}^{3}+5x+4\dots \left(6\right)$$
Here, curve constants are selected based on multiple evaluations of the model and observing delay for each curve type. This is a standard secp256 curve and is proven to have high encryption efficiency. This curve can be observed from Fig. 5, wherein its nature is visualized. Using this curve, and standard ECC model, each block was encrypted before storage. Due to which, the model is observed to be highly secure, and possess lower delay, and better network & representation efficiency when compared with existing approaches. This evaluation is done in the next section of this text, wherein the proposed BOGMAS model is compared with BOG [3], MAOM [8], and BTBP [11] models, which have similar representation capabilities.