Using amultifeature modal fusion approach, we improve the deep learning architecture. To begin, we employ a commonly used approach of image captioning i.e. preprocessing. The design of the parameter model is then thoroughly examined, including the optimizer chosen, the strategy for training a small batch, and how to avoid overfitting. Finally, a flowchart of the entire visual description is presented from a macro perspective.
A. Preprocessing
Noises are common in images which degrades the quality of the image. For that, Dynamic Low Rank Regularization (DLR2) algorithm is proposed. In this algorithm, regularization weight is computed that adaptively change the pixel values by presence of SNR. Further, it helps to smooth the image with the use of noise variance metric. A general noise model can be formulated by follows,
$$G\left[n,m\right]=F\left[n,m\right]U\left[n,m\right]+\alpha [n,m]$$
1
Where \(G\) and \(F\) are the measured and original images, respectively. \(\alpha\)represents the additive noise component, \(n,m\) are the radial and angular coordinates, respectively. And \(u\) is the multiplicative component of the noise. For images, amplitude SNR is a value of contrast\(r\) as follows,
$$r=\frac{\mu }{\sigma }$$
2
Where \(\mu\) is the standard deviation and \(\sigma\) is the standard deviation. In proposed method, DLR2 method for noise suppression performs well to recover even complex feature structures by preserving both local and global information of TRUS images even when SNR is low.
The DLR2 method utilizes the Noise Variance metric that detects the speckle noises over each pixel effectively, which is computed by follows,
$${\sigma }^{2}=\frac{1}{n}\sum _{i=0}^{n}{ɖ\_\mu }_{i}^{2}$$
3
For the above equation, \(n\) is the sum of pixels in the images and \({ɖ\_\mu }_{i}\) is the denoised image. Then, DLR2 method is utilized to measure the group of regions \(₰\) for the given input image \(i\) and the corresponding observation \({\mathcal{o}}_{i}\) is computed as follows,
\(₰\left(i\right)\) = \(arg\underset{₰\left(i\right)}{\text{min}}\frac{1}{2\lambda }{‖{\mathcal{o}}_{i}₰(i‖}_{2}^{2}+Rank \left(₰\left(i\right)\right)\) (4)
Where \(\lambda\) is the regularization weight value for \({ɖ\_\mu }_{i}\). Due to the images variance in terms of SNR and noise, \(\lambda\) is dynamically adjusted and computed for individual image. Further, \(\text{l}\text{o}\text{g}\_determinant\) for rank penalty which computed under Low Rank Regularization by follows,
\(₰\left(i\right)\) = \(arg\underset{₰\left(i\right)}{\text{min}}\frac{1}{2\lambda }{‖{\mathcal{o}}_{i}₰\left(i\right)‖}_{2}^{2}+\text{log}\left{₰\left(i\right)₰\left(i\right)}^{T}\right\) (5)
Therefore, in this way speckle noise pixels are removed in images and this step result accurate object feature detection and segmentation.
Contrast Enhancement is carried out by ModifiedCLAHE (MCLAHE) algorithm which overwhelms the problems that exists in the conventional CLAHE. It enhances the contrast and also removes blurriness of the image. Histogram concept provides good quality result for imaging. In addition, this step overcomes the blurriness and low contrast of the image issues. For that reason, the proposed NSPBS employs an effective histogram algorithm.
A \({\mathcal{C}}_{ʟp}\) is computed for the histogram slop of the input image and it is trimmed. Further, clipping is performed by cuts off the peak value of each histogram of each block and also clip pixels are redistributed for each gray level. A higher value of \({\mathcal{C}}_{ʟp}\) provides the contrast enhanced image.
$${\mathcal{C}}_{ʟp}=\frac{{N}_{\mathbb{P}}}{{Ԃ}_{ӷ}}(1+\frac{\rho }{100}{\tau }_{Slp})$$
6
Where \({N}_{\mathbb{P}}\) is the number of pixels on each region of the input image, \({Ԃ}_{ӷ}\) is the dynamic contrast range of images in the dataset, \(\rho\) is the clipping factor and \({\tau }_{Slp}\) represents the maximum slope. When \({\mathcal{C}}_{ʟp}\)computed then, power law transformation is applied for the images. A power law transformation is applied on the contrast enhancement step in which \({\mathcal{C}}_{ʟp}\) is computed.
$${O}_{g}={p}_{1}{{I}_{g}}^{{p}_{2}}$$
7
Where \({O}_{g}\) is the gray level intensity for the output image, \({I}_{g}\) is the gray level of the image, and \({p}_{1}\) and \({p}_{2}\) are the positive constant variables. When the value falls below within 1, then the MCLAHE maps the dark input pixels to the wide range of output pixels whereas wide range of bright pixels are matched with the narrow dark output values. Similarly, when \({p}_{2}\) is falls by greater than 1, it matches the wide range of dark input pixels to the narrow range of output rate and a narrow range of image is compared with the wide dark range of output values.
In this way, contrast level of image is improved by the power law based CLAHE algorithm. Further, it removes the blurriness from the feature regions in the given input image. In the following, MCLAHE algorithm pseudocode is available.
Pseudocode for MCLAHE

Input: Input image (\({\mu }_{i}\))
Begin
For (i\(\leftarrow 1:n\)) do
Estimate\(\to {\mathcal{C}}_{ʟp}\) for each \({\mu }_{i}\);
If (\({p}_{2}<1\))do
Match\(\to\)Narrowrange of dark inputto wide range of output;
Match\(\to\)Extensiverange of bright inputto narrow range of
output;
Else
Match\(\to\)Extensiverange of bright inputto dark range of
output;
Match\(\to\)Narrowrange of bright inputto wide range of output;
End If
End For
End
Output: Contrast and edge enriched image

With these steps, we can obtain the better object feature detection because a successful preprocessing retains more useful information. As a result, efficiency of the upcoming processes is increased i.e. feature detection and segmentation.
B. Semantic Features Extraction using Mask RCNN
We propose novel algorithm named Mask RCNN which performs better in semantic feature extraction as compared to the existing CNN, UNet algorithms. Architecture for MaskRCNN is depicted in Fig. 1.
Mask RCNN follows semantic segmentation which understands the image in pixel level. Mask RCNN accurately segments the object region with the aid of two layers as Region Proposal Network (RPN) and Feature Pyramid Network (FPN). From these two layers, object boundaries are detected for multiple views i.e. angular information. In Mask RCNN, anchor points are generated and adjusted using aspect ratio of the target image.
In this paper, Mask RCNN is applied for object feature prediction. For the input image, multiview boundaries are predicted and then fusion is implemented for object feature prediction. Hence, proposed Mask RCNN attains higher accuracy in object feature detection as compared to the patch or region level feature detection. We considered the task of feature detection as a binary captioning problem, distinguishing any type of field feature pixel as represented by the reference data from any type of nonfeature pixels. Multiorientation images are commonly seen in ultrasound imaging. It predicts the offsets from the given point.
a. ROI Align
ROI align is a significant task in Mask RCNN that increases the object feature detection. To obtain the desired region information from the contrast enhanced images, both foreground and background blocks are classified by optimally. Then the area for Region of Interest (ROI) is constructed for multiple orientations as follows,
$${\mu }_{i}=\left({0}^{0},{45}^{0},{90}^{0},{135}^{0},{180}^{0},{225}^{0},{270}^{0},{315}^{0},{360}^{0}\right)$$
8
There are \(n\) anchor points are generated in RPN and also bilinear interpolation \({BL}_{\left(i\right)}\) is carried in RPN. Let assume that four anchor points are generated which is used find the feature pixels and the anchor points are follows,
$${BL}_{\left(i\right)}= \left\{{AP}_{1},{AP}_{2},{AP}_{3},{AP}_{4}\right\}$$
9
From the base points, \({BL}_{\left(i\right)}\) is represented by follows,
$$ROI\left(X,Y\right)\approx \frac{ROI\left({Ap}_{1}\right)}{\left({X}_{2}{X}_{1}\right)\left({Y}_{2}{Y}_{1}\right)}*\left({X}_{2}Y\right)\left({Y}_{2}Y\right)+$$
$$\frac{ROI\left({Ap}_{2}\right)}{\left({X}_{2}{X}_{1}\right)\left({Y}_{2}{Y}_{1}\right)}*\left({X}_{2}X\right)\left({Y}_{2}Y\right)+$$
$$\frac{ROI\left({Ap}_{3}\right)}{\left({X}_{2}{X}_{1}\right)\left({Y}_{2}{Y}_{1}\right)}*\left({X}_{2}X\right)\left({Y}_{2}Y\right)+$$
$$\frac{ROI\left({Ap}_{4}\right)}{\left({X}_{2}{X}_{1}\right)\left({Y}_{2}{Y}_{1}\right)}*\left({X}_{2}X\right)\left({X}_{2}C\right)$$
When the optimum positions are computed from the ROI aligned area, then the captioning pixels are made to show the feature and nonfeature pixels for the produced feature map. Then the mask is generated on the feature map for feature area. Further, estimated optimal positions are joined to obtain the feature. The procedure is executed for all orientations of angle and compute the distance between the feature pixels & nonfeature pixels. Since feature pixels information from multiple angles and orientations are crucial in missing part detection and also it addresses the feature incompleteness issues. Hence, a mask RCNN is not only extracts the initial feature; it also gives the contextual information about the given image in multiple orientations.
b. Fusion Operation
In the fusion operation, we simply use the reshape function to obtain the initial feature\({B}_{i}\) from the multiple angle based images. For that scaling operation is implemented for \({B}_{i}\) as \({B}_{i}\left(s\right)\) below.
$${ B}_{i}\left(s\right)={F}_{S}\left(\omega ,{m}_{i}^{\text{'}}\right)$$
10
Where \({m}_{i}^{\text{'}}\) is the rotated feature maps with angle \({\theta }_{i}\) and \({F}_{S}\) is the scaling operation and \(\omega\) is the weighted result of oriented information.
C. Loss Function
Once the fusion is implemented after the multiviews tracking, in this paper the multiviewing loss function for mask RCNN is implemented as follows,
$$\mathcal{l}= {\mathtt{l}}_{C}+{\mathtt{l}}_{box}+{\mathtt{l}}_{Mask}$$
11
where \({\mathtt{l}}_{C}\), \({\mathtt{l}}_{box}\) and \({\mathtt{l}}_{Mask}\) are the loss values for feature / nonfeature class prediction, bounding box align and mask detection, respectively. Further, this loss is predicted between the ground truth and initial feature pixels detection class of the image. It is defined as,
$$\mathcal{l}\left({P}_{Bb}, {GT}_{Bb},{P}_{C},{GT}_{C}\right)= {\mathcal{l}}_{cls}\left({P}_{C},{GT}_{C}\right)+\phi [{GT}_{C}\ge 1]$$
12
where\({P}_{Bb}\) and \({GT}_{Bb}\)represents the predicted bounding box, and ground truth bounding box, respectively. And \({P}_{C},{GT}_{C}\) are the predicted and ground truth classes, respectively.
According to the image modality, image features are extracted from the medical image. For instance, Shape Feature Extraction is discussed as follows: from the image patch contains of compactness and smoothness, which is classified as follows,
$${ H}_{Shape}= {w}_{smooth}*{h}_{smooth}+{w}_{compact}*{h}_{compact } \left(13\right)$$
Overall Heterogeneity
from the image patch, the overall heterogeneity is described as follows,
$$H={w}_{color}*{h}_{color}+{w}_{shape}*{h}_{shape } \left(14\right)$$
Where, \({w}_{color} and {w}_{shape}\) represent the effect of color and shape respectively. We proposed to use the LSTM model to derive the relevance of each image feature for a given collection of image features V. In order to produce the appropriate wording for every time  series data. LSTM reads the input and produces the output sequences texts as the content created during the text generating process. Our research has provided a multilayer LSTM method that incorporates bidirectional and linguistic LSTM for the better result in some assessment indicators. The textual generating method is made up of three LSTM layers such as forget \({F}_{t}\), input \({I}_{t}\)and output \({O}_{t}\). On each time period, the cell state \({C}_{t}\) and the current hidden state \({H}_{t}\) is identified using Old Hidden State \({H}_{t1}\) and the current timestamp \({X}_{t}\).
$${I}_{t}={\vartheta }_{G}\left[{W}_{I}\left({H}_{t1},{X}_{t}\right)+{B}_{I}\right]$$
15
Where \({\vartheta }_{G}\) represents the logistic sigmoid function, which is defined by:
$$f\left(x\right)=\frac{1}{1+{e}^{x}}$$
16
This logistic sigmoid function is used to compute the probability for all feature vectors to classify the each caption. It output ranges in 0 and 1.
$${O}_{t}={\vartheta }_{C}\left[{W}_{O}\left({H}_{t1},{X}_{t}\right)+{B}_{O}\right]$$
17
\({\vartheta }_{C}\) represents Hyperbolic Tangent Function, and \(\)represents the Elementwise Multiplication.
$${F}_{t}={\vartheta }_{G}\left[{W}_{F}\left({H}_{t1},{X}_{t}\right)+{B}_{F}\right]$$
18
Then we compute the cell states and cell outputs in the following:
$${C}_{t}= {F}_{t}{ C}_{t1},{X}_{t}+ {I}_{t}{ Q}_{t}$$
19
$${H}_{t}= {O}_{t}{\vartheta }_{C}\left({C}_{t}\right)$$
20
Finally, the softmax classifier is used to find the probability of feature vectors for each label.
Figure 3 shows the LSTM construction for feature extraction. With the use of LSTM network, we define the network parameters and their values.
LSTM network parameters

Values

Batch Size

32

Number of LSTM Layers

3

Units for Each Layer

256

Optimizer for learning rate

Adam Optimizer

Max feature vectors length

776

In general, LSTM networks are failed to learn and extract the features in 1direction (Forward). In this paper we propose bidirectional LSTM with Adam optimizer is used. It is the optimum solution to resolve the conventional LSTM drawbacks. Our proposed LSTM architecture learn the features from inherent structure and it can read the feature sequence at any direction (forward/backward). Our proposed feature extraction model using BiLSTM is depicted in Fig. 4.