Analysis Crystal Structure of Sars-cov-2 Nsp3 Macrodomain Based on Optimal Multi Level of Deep Neurocomputing Technique

. In an attempt to improve the analysis crystal structure of sars-cov-2 nsp3 macrodomain, a new deep learning neural network architecture called (DLSTM) is established in this work which combines a novel meta-heuristic optimization algorithm called (Lion-AYAD ) and deterministic structure network (DSN) with Deter-mined set of rules (Knowledge Constructions (KC)) for each protein’s generation from synthesis tRNA based on the location of each component (i.e., U, C, G and A) in the triples of tRNA and other KC related to SMILE Structures. LSTM is one of the deep learning algorithms (DLA) from type neurocomputing contain specific feature not found on other DLA is memory also it proves their ability to give results with high accuracy in prediction problem but on other side LSTM required to determined many parameters based on try and error concept and have high complexity of computation therefore This work attempting to solve this gap through suggest new tool to determine the structure of network and parameters through one optimization algorithm called Lion-AYAD. that searching of the optimal (objective function, #Hidden Layers, #nodes in each Layers and wights for four gate unit in each layers) called DSN. With trained bidirectional DLSTM on the DNA sequence to generated protein get very pragmatic results from determined which protein active and inactive in injury sars-cov-2. on other side trained bidirectional DLSTM on SMILES to analysis crystal structure of sars-cov-2 nsp3 macrodomain get very high reconstruction rates of the test set molecules were achieved 95%. In general Lion-AYAD is one of optimization algorithm determined the set of rules to avoiding incorrect interactions of materials, finally add the KC that include apply four rules through synthesis each triplet tRNA to generated proteins and five Rules through synthesis each SMILE Structure.

main parameters to successful in reach of the goal.Therefore, need to avoid these limitations (i.e., high computation and time complexity of that algorithm).The complexity is one of the main characteristics of DNA sequence and as programmer it is very difficult to extraction useful knowledge from it or work directory on that sequence; Therefore, to avoid those limitations need efficiency technique to split that sequence into multi subsequences automatically and find the frequency for each subsequence to extraction usefu l proteins from it.

Related Works
The issue of prediction of protein from DNA synthesis is one of the vital topics that are directly related to the lives of people and the continuation of healthy life in general.Since the subject of this researcher is t o find a modern predictive way to deal with this type of data that is huge and operates within the field of data series.This section shown multi previous works deal with the same problem and analysis from seven points, Suprativ Sahaa and Tanmay Bhattacharyab 2020,[ Suprativ Sahaa, Tanmay Bhattacharyab 2020] design a new a hybrid algorithm that combine a Fuzzy and neural network for classification unknown sequence of protein into their family throw four stages.at first stages calculate mean value as featu re from the input protein sequence by applying "6-letter Exchange Group Method" to unknown input that represent protein sequence.Second stage is build a decision tree to represent each group of protein as a branch for classify unknown protein based on its features, third stage apply method of encoding called "The 2 -gram encoding "that identify the occurrences of two consecutive amino acids in a sequence.Final stages neighborhood analysis is applied.The proposed work uses (accuracy) and (time) for measuring performance of algorithm.this study implemented in 497 different test sequences and show high accuracy level (90%) with low execution time (192ms compere with original time execution 1704ms).Our work similarity with this work in the idea of predicting protein and evaluation measurement but differ from method used to discover protein based on intelligent data analysis.
Asad Khan et. al., 2020,[AsadKhan 2020] proposed a new method to predict the existence of m6A in RNA sequences this method used statistical and chemical properties of nucleotides and called (m6A-pred predictor) and uses random forest classifier to predict m6A by identify features that was discriminative.The proposed work uses (accuracy) and (Mathew correlation coefficient values) for measuring performance ofalgorithm.this study show high accuracy level (78.58%) with Mathew correlation coefficient values (79.65%) of 0.5717.Our work similarity with this work in evaluation measurement but differ from method used to discover protein based on intelligent data analysis and techniques used.
Prerna Sharma et.al,.( 2020),[ Prerna Sharma 2020] design a novel meta -heuristic algorithm called improved grey wolf optimization (IGWO), this represent a new version of traditional grey wolf optimization (GWO) that used for feature selection.In this proposed used to predict protein structure by four machine learning classifiers.The proposed work uses (accuracy) for measuring performance of algorithm.this study predect protein strecture by Atrfital nural netork classifier ith accuracy about (91%).Our work similarity with this work in the idea of predicting protein from RNA sequences and evaluation measurement and used of optimization method but differ from method used to discover protein based on intelligent data analysis.Md. Rafiqul Islam et. al,. (2020), [Md. Rafiqul Islam 2020] design a new method for detect and optimizing protein folding based on chemical reaction optimization (CRO) based meta -heuristic population and Hydrophobic-polar HP cubic lattice model witch responsible for increasing the performance of the algorithm.This work implemented in four stages includes decomposition, on-wall ineffective collision, synthesis and inter-molecular ineffective collision.After this stages algorithm apply repair mechanism which transforms invalid solutions into valid ones by removing overlapping in cubic lattice points.The proposed work uses (accuracy) for measuring performance of algorithm.Our work similarity with this work in the idea of predicting protein from RNA sequences and evaluation measurement and used of optimization method but differ from method used to discover protein based on intelligent data analysis.
Imran Ahmed1 andGwanggil Jeon 2021, [Imran Ahmed1, Gwanggil Jeon 2021] implement a new model based on artifcial intelligence to perform genome sequence analysis of human that infected by COVID-19 and other viruses that like covid-19 example SARS and MERS and Ebola and middle east respiratory syndrome.The system helps to get important information from the genome sequences of different viruses.Th is done by extracting information of COVID-19 and perform comparative data analysis to original RNA sequences to detect gene continue virus and their frequency by count of amino acids.atend of method, classifier based machine learning called support vector machine used to classify different genome sequences.The proposed work uses (accuracy) for measuring performance of algorithm.this study implement high accuracy level (97%) for COVID-19 and (95%) for.Our work similarity with this work in the idea of predicting protein and evaluation measurement and method that based AI but different in dataset used.Shehu et. al,.(2021) [ Shehu 2021] design a novel algorithm based on deep learning to discover a protein family named (Deep PPF).This method discover a best feature from sequence and (word2vec) to capture distributional dependencies throw nucleotides on the cluster of o rthologous groups (COG) and phage of orthologous groups (POG) datasets .The proposed work uses (accuracy) and (Mathew's correlation coefficients) for measuring performance of algorithm.this study show high accuracy level and Mathew's correlation coeffic ients of (97.62%) and (88.45%) and (83.09%) for family, subfamily and, sub -subfamily hierarchical levels respectively .Our work similarity with this work in the idea of predicting protein and evaluation measurement but differ from method used to discover protein based on intelligent data analysis.Wang et. Al.,(2021) [Wang 2021] proposed ML model called (light GBG) to predict formulation of mRNA that deliver a vaccines by applying a computational methods to speed up the lipid nanoparticles (LNP) that us ed to deliver vaccine.The proposal method used R for measuring the performance of the algorithm.They found that proposal method is performed a good in term of (R > 0.87).In comparison with our work, differ from evaluation measure and idea of predict protein that based on Intelligent data analysis.

Methodology
The methodology construct from three major stages shown in algorithm 1 and Figure 5: The objectives of this work are studying, analyzing and suggesting a solution for the main problems related to generation proteins from DNA Sequence, the problems are including: ▪ Investigation the capability of LION-AYAD to split the DNA sequence into muti subsequence after convert it into mRNA sequence and compute the frequency of each subsequence, in addition; to reduce the time of this process by generation buffer contain only the different subsequence to used it in build prediction model of proteins and set collection of rules to determined which protein affect in increase the disease through make it a passive or active.▪ LSTM is one of the deep learning algorithms from type neurocomputing contain specific feature not found on other deep learning algorithm is memory also it proves their ability to give results with high accuracy in prediction problem, on other side LSTM required to determined many parameters based on try and error concept and have high complexity of computation as explain in   For i in rang(1 to Max_interation) 2: For node of hidden layer h in H do 3: Calculate fitness function as equation ( 1 For the number of hidden layer h in H do 33: Calculate hidden layers velocity as equation ( 3) 34: Calculate hidden layers position as equation ( 4)

35:
End for 36: gnBest = best n in N 37: For the number of nodes in each hidden layer n in N do 38: Calculate the velocity of nodes in each hidden layer as equation ( 5) 39: Calculate the position of nodes in each hidden layer as equation ( 6) 40: End for 41: gwBest = best w in W 42: For random weights w in W do 43: Calculate the velocity of weight in the memory cell and each gate as equation ( 7) 44: Calculate the position of weight in the memory cell and each gate as equation ( 8) 45: End for 46: gbBest = best b in B 47: For random bias b in B do 48: Calculate the velocity of bias in the memory cell and each gate as equation ( 9) 49: Calculate the position of bias in the memory cell and each gate as equation ( 10

Implementation and Results
This section will be shown the results and details in each stage.

2.1
First Case study At the beginning, we must now 64 RNA triple can be generated from different length of DNA sequence to generated twenty types of proteins as explain in table 4A and Figure 6.

Second Case study
We take a set of labeled datasets on the experimental activity of small molecules as simplified molecular-input line-entry system (SMILES) formula on COVID-19 to implementation the proposed method on it through build structure-activity relationship for COVID-19 targets.The description of main features of Molecular shown in table 4b while the analysis crystal structure of SARS-COV-2 NSP3 macrodomain as table 4c.

Table (4e): Samples of SMILES related to Testing Dataset Table 4g: Relationship among Structure and SMILES formula of Testing Dataset
When apply bidirectional DLSTM with the parameters shown in table 3 get the accuracy 95% for training dataset and 91% for testing dataset.

Third Case Study
In this case study take the real and true interactions to prove the performance of the suggest model.These interactions were characterized by being balanced and accurate results of analysis and DLSTM appear in figure 7.
▪ Step 1: Draw the graph show the relationship among the materials ▪ Step 2: Apply LION-AYAD to split that sequence into multi subsequence and compute the frequency for each subsequence.inthese steps put the different subsequence into buffer to work on it, this step reduces the computation required in the next steps.

Evaluation stages
In this section will show Results of evaluation DLSTM for Training /Testing Dataset for each case study based on five measures as explain in table 10 and 11.

Conclusion and Future works
Analysis crystal structure of SARS-COV-2 NSP3 macrodomain is one of the hot subjects today form DNA Sequence or SMILE Structure.To determined which of the generation protein from that analysis is active or inactive in cause the disuses; therefore; this research will use the intelligent analysis based on the optimization and deep learning algorithm plus knowledge construction to satisfy that goal.
Lion-AYAD is optimization algorithm used to split any structure (i.e, DNA, SMILES, OR any reaction) based on rules into multi subsequence than find the frequency for each on and get the copy from different subsequence and but into buffer to process it later through DLSTM .This step reduces the computation complexity require compare with work on all the data.
DSN is determined the parameters and activation function of DLSTM, the advantage of DSN is reduce the time of execution LSTM, through find the optimal parameters to DLSTM rather based on try and error concept in choose the parameters.
DLSTM is a develop of LSTM by DSN, Lion-AYAD used to determine the optimal (number of hidden layers, number of nodes in each hidden layer, weight, bias, and activation function), the advantage of DLSTM capable to deal with huge data and contain memory cell to save information at the long term, the limitation of DSTM contain on huge number of parameters.
Finally add the KC of model increase the accuracy of results and make it as pragmatic model; these KC include apply four rules through synthesis each triplet tRNA to generated proteins and Five Rules through synthesis each SMILE Structure.
Evaluation is the process of calculating the amount of error from the actual value and its predicted value, there are different types of measures related to coefficient matrix used including (i.e., Accuracy, Precision, Recall, F and, Fb).While the dataset used in this paper very sensitive and critical but the results shown in table 10 and table 11 prove the intelligent predictive model is pragmatic model in that field to determined the active and inactive (i.e., passive portions) .
Based on combination two of computation techniques represent by (Lion-AYAD & DLSTM) reduce the time of searching and enhance their performance (i.e., reduce the computation).While, the following points may be good ideas for future work.
It is possible to develop the model to become as a recommendations system through using one of the mining algorithm such as the FP-growth algorithm to find all subsequences with their frequency .Used other datasets for SMILS structures or DNA Sequence or chemical reactions.

First
Stage related to split DNA sequence into multi subsequences and save the different subsequence into buffer to work on it.▪ Second Stage related to build predictor; this stage includes multi steps: (a) determined the main structure and objective function into LSTM using one of the optimal techniques called whale optimization technique.(b) training the DLSTM that build based on parameters results from the above step to predict the proteins.(c) investigation the results through four rules related to knowledge contractions.▪ The Final Stage using different types of measures to evaluated the results based on confusion measures (i.e., five measures) ➢ Accuracy(A)Accuracy: is the percentage of correct predictions in a classification method or model.It is the ratio of all "true" to all observations.It is the percentage of true positive outcomes to the total number of positive predictions made by the classifier.Recall refers to how many real positives are accurately predicted; it is the ratio of the number of positives to the total number of components in the positive class.This measure is based on both measures: precision and recall.as eq. of beta -factor multiplied by Precision and Recall divided by beta -squared multiplied by Precision plus Recall?

Figure 6 :
Figure 6: Results of DLSTM for training dataset into actives and inactive Proteins in injury sarscov-2 , 0} ▪ Step 4: the main loop of the algorithm starts and continues until termination criteria are satisfied.Within this loop, the objective values for the population of search agents are evaluated as: OF (SA1) = 1159, OF (SA2) = 266, OF (SA3) = 26526, OF (SA4) = 4122.

Figure 7 :
Figure 7: Results of DLSTM for training & Testing dataset for case number three shown the relationship between number of iteration and time

Table 1 :
benefit of whale compare with another optimization algorithms shown in Table(2).that searching of the optimal (objective function, number of hidden layers, number of nods in each layer and wrights) called DSN ▪ Determined set of rules (Knowledge Constructions "KC") for each protein's generation from synthesis tRNA based on the location of each component (i.e., U, C, G and A) in the triples of tRNA shown in final stage of LION-AYAD.Comparison among Main Prediction Techniques related to Neurocomputing Table (1).This work attempting to solve this gap through suggest new tool to determine the structure of network and parameters through one optimization algorithm, the

Table 2 :
Compare among optimization Techniques

Table 4c : Relationship among Molecule, Structure and SMILES formula
Training dataset include 5555 records explain in table (4d); Table (4e) shown Relationship among Structure and SMILES formula of Training Dataset.while testing dataset include 1614 SMILES samples of it explain in table (4f); Table (4g) shown Relationship among Structure and SMILES formula of Testing Dataset.

Table 4 .
population of search agents

Table 5 .
First iteration in population of search agents

Table 6 .
Cumulative growth of population of search agents

Table 7 .
Cumulative growth of population of search agents

Table 9 .
Cumulative growth of population of search agents

Table 10 .
Results of evaluation DLSTM for Training Dataset

Table 11 .
Results of evaluation DLSTM for Testing Dataset