Encoding:The first step uses a convolutional neural network to extract features from the input image.
Transformation:The feature vector of the image in the A domain is converted to a feature vector in the A1 domain by combining the non-similar features of the image. There are 6 layers of Reset modules, each of which is a neural network layer composed of two convolutional layers, which can achieve the goal of retaining the original image features while converting.
Decoding:The work of restoring low-level features from the feature vectors is performed using a deconvolution, and finally the generated image is obtained.
The goal of CycleGAN is to convert image A to another domain to generate image A1 and convert A1 back to A, where output image A1 is similar to the original input image A to form a meaningful mapping that does not exist in the unpaired data set. The advantage of CycleGAN is its ability to train two image sets without pairing.
2.1.2 DENSENET
DenseNet[13] is a convolutional neural network framework with dense connectivity proposed by Huang Gaoren in 2017. In its architecture, there is a direct connection between any two layers of network. The input of each layer of network is a combination of the output of all previous network layers, which enhances the propagation of features. It alleviates the problem of gradient disappearance, reduces network parameters and encourages feature reuse. It has been widely used in medical image field. The specific structure diagram is shown in Figure 2.
See Figure 2
Each Dense Block is a convolution layer with BN layer joined by ReLU layer joined by convolution core (1*1) and BN layer joined by ReLU layer joined by convolution core (3*3). Many basic structures constitute a complete DenseNet. Each block is composed of convolution layer and pooling layer. Densenet uses Block stacking structure to enhance the transfer of features and make more effective use of features, thus reducing the number of parameters to a certain extent and alleviating the phenomenon of gradient disappearance.
2.1.3 RESNET
ResNet[14] is a convolutional neural network framework proposed by He et al in 2015. It adds a shortcut on top of the original architecture to enable direct connection between the mappings of layers. That is, the output of one layer of network is the sum of traditional output and the output of a previous layer, which solves the degradation problem. ResNet alleviates the gradient vanishing and gradient explosion problems caused by the increased depth of the network, and thus protects the entirety of the data. The principle of the model is shown in Figure 3:
See Figure 3
As can be seen from Figure 3, the expected network layer relationship mapping output is H(x), where x refers to the input value, and those layers that are increased more are fitted to another mapping, residual, F(x) = H(x)-x. Therefore, optimizing the residual map F(x) is easier than optimizing the original map H(x). With the existence of x, the network no longer has the gradient vanishing problem often caused by large depth of the model.
2.1.4 CRNN
CRNN is a model proposed by Shi et al[15] to deal with sequence-like objects in images, which consists of DCNN and RNN. DCNN is used to extract sequence features from the input image. RNN has the advantage of processing sequence data, and can achieve better recognition accuracy from the extracted sequence features. The specific model structure is shown in Figure 4:
See Figure 4
As can be seen from the figure, CRNN consists of three parts, convolutional layer, recurrent layer and transcription layer. The convolutional layer extracts a sequence of features from the input image. Next to the convolutional layer is the recurrent layer which is for predicting each frame using the feature sequence from the convolutional layer. Finally, the predictions of the recurrent layer are converted into a tag sequence using the transcription layer at the end of CRNN.
CRNN abandons the fully connected layer used in traditional neural networks to obtain a more compact and efficient model. It can acquire input images of different sizes, produce predictions of different lengths, and run directly on coarse-grained labels. It is not necessary to specify each individual element in detail during the training phase.
2.2 PITUITARY TUMOR SEQUENCE DATA AMPLIFICATION USING CYCLEGAN
A problem often encountered in MR images of pituitary tumors is under-sampling in a single domain (e.g., T1 or T2). This can be caused by various reasons, such as data missing or simply under sampling. To resolve this issue, our main idea is to use images from other domains (which may come from different image modalities) to generate a set of new images through domain conversion. The set of new and old images forms an augmented set of images which provides a better sample for the domain.
Particularly, we use CycleGAN for data augmentation. First, two domain converters are designed and trained based on the CycleGAN architecture to allow inter-domain conversion from T1 to T2 and from T2 to T1. Then, the generated MR images from domain conversion are added to the original sets of images to form augmented T1 and T2 sequences.
2.2.1 MULTIPLE SEQUENCE OF PITUITARY TUMOR MR IMAGES
As mentioned above, the MR images of one patient usually include spatial sequences from different modalities, such as T1WI, T2WI, T1C and T2FLAIR, etc. In this paper, we mainly use T1 and T2 spatial sequence images.
For each patient i,we denote its T1 spatial sequence is as T1i = {t1,1i,…,t1,Ni}, where t1,ni represents the n-th slice/frame in the T1 spatial sequence, and its T2 spatial sequence as T2i = {t2,1i,…,t2,Ni}, where t2,ni represents the n-th slice/frame in the T2 spatial sequence. The number of slices per sequence is N (12 in this paper). To classify the pituitary tumors, we combine the T1 and T2 spatial sequences of each patient i to obtain a spatial sequence of multiple sequences, which is denoted as:
Ti = {t1,1i,t2,1i,t1,2i,t2,2i.…,(t1,Ni,t2,Ni)} (1)
The total number of slices in a multi-sequence spatial sequence is 2N (24 in this paper).
2.2.2 TRAINING DOMAIN CONVERTER BASED ON CYCLEGAN
In this paper, we use the CycleGAN framework to design and train the domain converter. CycleGAN is essentially a cyclic network consisting of two mutually symmetric GANs. On top of the original GAN, additional loop constraints are added to force the image to be converted into its original image format so as to reconstruct itself. This allows images to be converted from one domain to another domain without needing to pair them. The architecture of our domain converter is illustrated in Figure 4. In our design, we need to train the T1-to-T2 generator Trt1,n;θ1 and the T2-to-T1 generator Trt2,n;θ2, as well as the T1 domain discriminator Dis(t1,n;θ3) and T2 domain discriminator Dis(t2,n;θ4), where in θ1, θ2, θ3 and θ4 are the to-be-determined parameters in the deep neural network.
See Figure 5
The training of the above network mainly consists of two steps:
1) The training of the Discriminator: Fixing the values of θ1 and θ2, update the values of θ3 and θ4. This is for discriminating the authenticity of the image. If the input is MR image data from the real domain, the label is 1, and if the input is an MR image data generated by the generator, the label is 0. In short, the role of the discriminator is to score pictures. If the input pictures are real pictures from the original dataset, they will get high scores. Otherwise (i.e., they are generated fake pictures), their scores will be low. The network of this part is depicted in Figure 6, where the convolution layer is consists of Conv2D, Leaky ReLU, Instance Normalization, and the digits represent the size and number of the convolution kernel.
2) The training of the Generator: Fixing the values of θ3 and θ4, update the values of θ1 and θ2. This is for inter-domain conversion of images. After getting the input MR images from T1 or the T2 domain, the generator sends them to the corresponding domain converter to generate MR images of the other domain. The generated images are then again sent to the corresponding domain converter to generate MR images of the original domain. After being converted twice, the obtained MR images are forced to be as similar as possible to the original ones. The network of this part is shown in Figure 7, where the convolution layer consists of Conv2D, Leaky ReLU, Instance Normalization. The first three de-convolution layers consist of UpSampling2D, Conv2D, ReLU, Instance Normalization, and the last de-convolution layer consists of UpSampling2D, Conv2D, Tanh. The dashed line represents the superimposing operation between the corresponding network layers, and the digits are the size and number of the convolution kernel.
See Figure 6
2.3 SEMI-SUPERVISED CLASSIFICARION METHOD FOR THE IMAGE TEXTURE OF PITUITARY TUMORS BASED ON ADAPTIVELY OPTIMIZED FEATURE EXTRACTION
To improve the efficiency of feature extraction for determining the softness level of pituitary tumor, using ResNet we propose in this paper an Auto-Encoder-based deep neural network model for feature extraction. Since the weight of the features common to all input data could be reduced during the training process, our proposed model can enhance the weight of the features unique to each MRI spatial sequence (i.e., the features of pituitary tumor), and meanwhile reduce the dimensionality of the features of each slice. This can greatly accelerate the operational speed of the subsequent classifier. Therefore, it is essential for our classification method to use the proposed Auto-Encoder-based framework for feature extraction.
2.3.1 ENCODER AND DECODER BASED ON DENSE BLOCK AND RESIDUAL BLOCK
For encoder, we use Dense Block to enhance the feature propagation ability of MRI spatial sequences, rely on the convolutional layer and pooling layer to reduce the dimensionality, and combine them to form an encoder for extracting the common features of MRI spatial sequences. As shown in Figure 8, the encoder uses two dense blocks in the training process (only one is shown in the figure). Due to the fact that the feature maps are superimposed during the training process, it enhances the propagation ability of pituitary tumor features, which consequently improves the accuracy and reliability of feature extraction.
See Figure 8