Design of a blockchain-based secure and efficient ontology generation model for multiple data genres using augmented stratification in the healthcare industry

Ontology generation is a process of relationship analysis and representation for multiple data categories using automatic or semi-automatic approaches. Thus, the main contribution of this paper is the design of a blockchain-based secure and efficient ontology generation model for multiple data genres using augmented stratification (BOGMAS) that can overcome existing issues. The BOGMAS model uses a semi-supervised approach for ontology generation from almost any structured or unstructured dataset. This model uses a combination of linear support vector machine, and extra trees stratifies for variance estimation, which makes the model highly efficient, and reduces redundant features from the output ontology. The generated ontology is represented using an incremental OWL (W3C Web Ontology Language) format, which assists in dynamically sizing the ontology depending on incoming data. The performance of the proposed BOGMAS model is evaluated in terms of precision and recall of representation, memory usage, computational complexity, and accuracy of attack detection. It is observed that the proposed model is highly efficient in terms of precision, recall and accuracy performance, but has incrementally higher computational complexity and delay of ontology formation when compared with existing approaches. Due to this incremental increase in delay, the proposed model is observed to be applicable for a wide variety of real-time scenarios, which include but are not limited to, medical ontology generation, sports ontology generation, and internet of things ontology generation with high-security levels.


Introduction
Ontology generation is a multidomain task that involves database analysis, rule evaluation, attribute checking, classbased analysis, class-level variance-based relationship estimation, ontology acquisition, and post-processing. These tasks require efficient models for the design of each operation, and a combination of these efficient designs results in an effective ontology generation model. A typical model that generates ontologies from relational databases wherein database schema is available for OWL generation.
Due to the availability of relational data schema, this model can evaluate table-to-table mapping, attribute relationships, hierarchy rules, and attribute level constraints. Each of these rules is given to an ontology generation model [1], which groups similar database entities together, and generates an ontology tree from these cluster relationships. But real-time datasets are generally non-structured and do not provide relationship information due to multiple applicationspecific issues [2]. To remove this drawback, and to establish entity-based relationships, a wide variety of system models are proposed by researchers over the years. A review of these system models is discussed in the next section of this text, which will allow readers to evaluate various advantages, nuances, limitations, and future research scopes in these models. Inspired by these observations, Sect. 3 proposes the design of BOGMAS, which is a blockchain-based secure and efficient ontology generation model for multiple data genres using augmented stratification. This model uses a combination of language processing, clustering, variant feature detection, relationship classification, OWL generation, and blockchain-based security for improving the efficiency of ontology generation.
Finally, this text concludes with some interesting observations about the proposed model and recommends methods to improve its efficiency.

Literature review
A wide variety of algorithmic models are available for ontology generation, and each of these models is applied to different fields for application-specific deployments. For instance, the work in [3,4] proposes industrial ontology generation with domain identification support and biomedical ontology generation (BOG) for Variant Call Formats (VCFs). Applications of these models are observed in [5,6], wherein researchers have used control flow graph generation, and online repository generation using ontology generation models. An advanced model for ontology creation is proposed in [7], wherein researchers can generate ontologies from unstructured property graphs via a deep learning approach. Inspired by this, the work in [8] proposes a multiple-aspect ontology model (MAOM), which assists in improving decision support for human-computer interfaces (HCIs).
Extensions to these models are discussed in [9][10][11][12], wherein Compact Brainstorm Algorithm (CBSO), personal feedback generation using student ontologies, blockchainbased time-based protocol (BTBP), and machine learning for art-ontology are defined. Similarly, the work in [13][14][15] discusses the use of ontologies in e-commerce applications, the effect of different attacks on ontology systems, and fuzzy ontologies that use blockchain for improved security. Ontologies can be used for multimedia data representation [16], Gastroenterology, and other medical fields which include infectious disease ontologies [17,18], transaction ontologies based on smart contracts [19], and health record management ontologies [20], which assist in reducing system dependency on external sources, thereby improving their query performance. Similarly, the work defined in [21][22][23][24] also proposes ontology models for renewable energy sources, keywordbased search applications, eLearning applications, and internet of medical things (IoMT) applications, where high efficiency of system design with minimum system complexity is needed. Efficient models are also proposed in [25][26][27][28] which discuss the use of secure blockchains for real-time scenarios under different applications. Based on this observation, the next section describes the design of a blockchain-based secure and efficient ontology generation model for multiple data genres using augmented stratification.

Proposed blockchain-based secure and efficient ontology generation model for multiple data genres using augmented stratification
From the literature review, it is observed that a wide variety of models are available for ontology generation, and each of these models has its limitations. These include limited security for the generated ontology, data redundancy, limited traceability capabilities, etc. To remove these drawbacks, this text proposes a blockchain-based secure and efficient ontology generation model for multiple data genres using augmented stratification. The proposed model is depicted in Fig. 1, wherein different datasets, and their final ontological classification status is described. From Fig. 1, it is observed that input structured/unstructured data is given to a pre-processing layer, wherein a data-specific feature extraction process is applied. These features are given to a feature selection layer, which uses a combination of linear support vector machine (LSVM), and extra trees (ET) classifier for finding most variant features. Extracted features are given to an ontology generation framework, which stores these features, along with tagged classes to generate an RDF-based ontology.

Details about BOGMAS
For ontology creation, the BOGMAS model employs a semisupervised technique that can be applied to almost any structured or unstructured dataset. It reduces the number of redundant numerical features in the dataset by utilizing a variance-based technique (VBM), whereas textual characteristics are first transformed to numerical values using a typical word2vec model, and then it processes those numerical values using VBM. This model utilizes a mix of linear support vector machine (LSVM), and extra trees (ET) stratifies for variance estimation. This not only makes the model incredibly efficient, but it also decreases the number of redundant features that are included in the output ontology. A correlation engine is provided with these feature sets together with their variances so that it may estimate connection strengths and provide an ontology. Every ontology record is protected by a blockchain system that is built on a changeable proof-of-work (PoW) architecture, which helps to infuse the system with transparency, traceability, and the capacity to do distributed peer-to-peer processing. An incremental OWL (W3C Web Ontology Language) format is used to express the created ontology. This format lends a hand in dynamically scaling the ontology by the data that is being received from different sources.
This ontology is stored on a PoW-based blockchain model and stored on the database after elliptic curve-based encryption. Internal details of each of these models are described in separate sub-sections of this text. Readers can refer to these sections to implement the proposed model in parts and, depending upon the requirements.

Pre-processing layer design
Initially, all input data is given to a pre-processing layer for feature extraction. Here numerical features are directly passed to the feature selection layer, while textual features are given to a word2vec model. This model is depicted wherein components like context builder, vocabulary builder, continuous bag of words (CBoW) engine, etc. are defined.
A large number of context-sensitive vocabulary models are available for this purpose. In this work, we are using Bidirectional Encoder Representations from Transformers (BERT) model, because of its coverage extensiveness, and reduced dependency on external sources. This block creates a vocabulary from input data and provides it to the context builder block. These layers further reduce feature redundancy by removing non-action words, which assists in feature reduction, and accuracy improvement of the proposed feature extraction model. The extracted features are processed using a 2-layered neural network, where each layer maps input word pairs with respective features. The result of this model is a single context-sensitive feature vector for the entire sentence. This feature vector is given to a feature selection layer, where variance-based features are extracted.

Feature selection layer design
After features are extracted using the Word2Vec layer, they are given to a variance-based selection layer. This layer uses a combination of linear SVM and Extra Trees (ET) classifiers for the removal of non-variant features from the input dataset. While the variance-based method (VBM) could have been used here, Variance-based Method (VBM) is a statistical technique used for feature reduction in machine learning and data analysis. It aims to remove features that have low variance, meaning that their values are similar across all samples in the dataset. The idea behind this method is that features with low variance are less likely to provide meaningful information for prediction or classification tasks, and can therefore be discarded without impacting the performance of the model. VBM is commonly used in pre-processing stages for improving the efficiency and interpretability of predictive models. But the efficiency of this method is highly limited, due to which both these classifiers are used in their standard form and are given per-feature intra-variance values for training and validation scenarios. This value is extracted using Eq. 1 as follows, where F int , m, and n represent the inter-variance feature value for feature f , the total number of features of the current type, and the total number of other features available in the dataset. This inter-variance feature value is indicative of the variance level of this feature w.r.t. all other features in the dataset. This value is given to both ET and SVM classifiers for the estimation of feature redundancy. Parameters for both these classifiers, along with the reason for selection are depicted in Table 1 as follows, Both these classification engines output their own set of features. A union of these features is used as the final feature Both these feature outputs are given to an ontology generation engine, which is described in the next sub-section of this text.

Ontology generation process
Upon feature selection, only the feature vectors which are non-redundant are extracted from the input dataset. These features are given to a correlation engine, which evaluates the relationships between them. Correlation value of each feature w.r.t. other features (Corr F 1 F 2 ) is extracted using the Eq. 3, All features with a correlation more than AVG corr are clubbed together, while other features are stored in a separate group. These groups are combined with entity information, and an output RDF ontology is created using the format described in Table 2 as follows.
In this format, the entity is an application-dependent entry, which can be 'disease type' for medical applications, 'product type' for E-commerce applications, etc. The class represents the category of this feature, while grouped and ungrouped features represent similar and dissimilar feature values, and Timestamp indicates the time at which this entry was generated. The RDF data is given to a blockchain-based model for improving security, which is described in the next subsection.

Blockchain-based security model for storage
The RDF data is stored using a blockchain model, which ensures immutability, traceability, distributed processing, and improved trust levels. To store the data in the blockchain, a chain similar to the one is formed, and the following operations are performed, • Every time a new entry is added to the RDF, a new block is created • The following information is added to the block, • Source of input data • Timestamp at which this data arrived in the system • A random nonce value, which is used to form uniquely identifiable hashes for each block • RDF data generated from Sects. 3.1, 3.2, and 3.3.
• The hash value of the previous block (this value is blank for the Genesis block) • Each block is encrypted using the elliptic curve cryptography (ECC) model for improved security.
During the addition of a new block, a random nonce number is generated for hash calculation. After the generation of this nonce value, the following Eq. 5 is evaluated, Hash = S H A256(Source, T imestamp, R DFdata, N once) (5) If this hash value is already present in any of the blocks, then a new random nonce is generated again, otherwise, it is used for blockchain creation. The block is also encrypted using ECC, where the following encryption curve is used, Here, curve constants are selected based on multiple evaluations of the model and observing delay for each curve type. This is a standard secp256 curve and is proven to have high encryption efficiency. This curve wherein its nature is visualized. Using this curve, and the standard ECC model, each block was encrypted before storage. Due to this, the model is observed to be highly secure, and possess lower delay, and better network and representation efficiency when compared with existing approaches.

Result analysis and comparison
The proposed BOGMAS model was evaluated on a wide variety of datasets, including heart disease (http://www. informatics.jax.org/disease/DOID:114), blood reports (https://bioportal.bioontology.org/ontologies), EEG (https:// maayanlab.cloud/Harmonizome/gene_set/eeg+abnormality/ GWASdb+SNP-Phenotype+Associations), e-commerce (https://www.sciencedirect.com/science/article/pii/S23 52340922000968), social media (https://ieee-dataport. org/documents/ontosnaqa-multi-domain-ontology-socialnetwork-analysis), and news datasets (https://www.v7 labs.com/open-datasets/visual-sentiment-ontology). These datasets were combined to form a large dataset, and then given to the model for RDF generation. A total of 100 k values were evaluated on the dataset, evaluation was done in terms of precision of representation, recall of representation, delay needed for representation, and memory size required for representation. To evaluate the performance w.r.t. standard models, this performance was compared in terms of accuracy (A), precision (P), recall (R), and delay levels, with BOG [3], MAOM [8], and BTBP [11] approaches for different test set sizes (TSS), and results were tabulated in Tables 1, 2. From the accuracy values, it can be observed that the proposed model is 19% more accurate than BOG [3], 35% more accurate than MAOM [8], and 34% more accurate than BTBP [11], which makes it useful for highly secure network applications (Fig. 2). This increase in accuracy is due to the use of  blockchain for storage, which reduces the probability of any network attack. Similar observations are made for precision (P) values, and can be observed from the following Fig. 3. From the precision values, it can be observed that the proposed model is 15% more efficient than BOG [3], 18% better than MAOM [8], and 29% better than BTBP [11], which makes it useful for high-precision data representation applications. Similar observations are made for recall (R) values, and can be observed from the following Fig. 4.
From the recall values, it can be observed that the proposed model is 16% more efficient than BOG [3], 22% better than MAOM [8], and 31% better than BTBP [11], which makes it useful for high recall data representation applications. The value of recall is very high than existing models due to the use of variance-based representation, which makes the model highly efficient in real-time scenarios. Similar observations are made for representation delay, and can be observed from the following Fig. 5.
From the delay values it can be observed that the proposed model is 2% slower than existing implementations. But as the delay difference is not very large, the model is applicable for real-time system design.
The model was evaluated on Proof-of-Stake (PoS) based blockchain that uses smart contracts but can be used with any other blockchain type with minimum reconfiguration operations. Several steps were taken to reduce the possibility of attacks like the 51% attack on the blockchain network, these include, Hash-rate distribution was increased because it is more difficult for one entity to control more than 50% of the network's processing power in a more decentralized network with a broader range of miners. The network was protected against 51% of assaults by increasing the number of nodes in the system. It was crucial to remember that although these  precautions may not entirely remove the possibility of a 51% attack, they may cut it down and make it more difficult for an attacker to pull off a successful strike. In terms of encryption complexity, it was observed that even though the model uses high-computation encryption operations, its performance was not degraded due to high-speed feature selection operations. These advantages are due to the use of variance-based representation, and because of reduced redundancy in the output RDF data representation, which makes the model highly effective for real-time applications.

Conclusion and future work
Due to the utilization of a variance-based approach for feature selection, and blockchain for security improvement, the proposed model can reduce storage costs, and improve the efficiency of data representation. It is observed that the proposed model is 15% more efficient than BOG [3], 18% better than MAOM [8], and 29% better than BTBP [11] in terms of precision of data representation, while it is 16% more efficient than BOG [3], 22% better than MAOM [8], and 31% better than BTBP [11], which makes it useful for high recall data representation applications. Similarly, in terms of storage cost the proposed model is 30% more efficient than BOG [3], requires 28% lower space than MAOM [8], and 16% lower space than BTBP [11], which is due to the use of enhanced feature selection capabilities of the system. Security of the proposed model is also very high, which is due to the use of blockchain for securely storing ontological data. It is observed that the model is 19% more accurate than BOG [3], 35% more accurate than MAOM [8], and 34% more accurate than BTBP [11], against Masquerading and Sybil attacks, which makes it useful for highly secure network applications. Furthermore, in the future researchers can also aim to develop recommendation models based on the proposed approach and estimate its performance in different application scenarios.
Author contributions Conceptualization, SP; methodology, SP; software, Dr. BK; validation, SP, Dr. BK, and AKC; formal analysis, SP; investigation, SP; resources, SP; data curation, Dr. BK; writing-original draft preparation, SP; writing-review and editing, Dr. BK; visualization, AKC; supervision, Dr. BK. SP and Dr. BK wrote the main manuscript text and AKC prepared Graph. All authors reviewed the manuscript.
Funding There is no funding related to this paper.
Data availability Data will be made available on request.

Conflict of interest
No funding was received to assist with the preparation of this manuscript.
Ethical approval This research did not contain any studies involving animal or human participants, nor did it take place in any private or protected areas.