5.1. Ultrasound Data Acquisitions
In ultrasound imaging, the degree of lung involvement is related to several typical sonograms. A-line is a horizontal reverberation artifact of pleura caused by multiple reflections, representing the normal lung surface [19]. B-line represents the interlobular septum, which is denoted by a discrete laser like vertical hyperechoic artifact that spreads to the end of the screen, and it can be represented as B1-line [20]. Fusion B-line is a sign of pulmonary interstitial syndrome, which shows a large area filled with B-line in intercostal space, and it can be represented as B2-line [15]. Pulmonary consolidation is characterized by a liver like echo structure of the lung parenchyma, with a thickness of at least 15 mm [16] as shown in Fig. 5.
We use three datasets from four medical centers to build and evaluate the model: Ultrasound images collected by Stork ultrasound system (Stork Healthcare Co., Ltd. Chengdu, China) at Ruijin Hospital, Mindray ultrasound system (Mindray Medical International Limited, Shenzhen, China) at Shanghai Public Health Center, Philips ultrasound system (Philips Medical Systems, Best, the Netherlands) at Wuhan Sixth People's Hospital and Hangzhou Infectious Disease Hospital. Stork dataset was collected with H35C (2-5MHz) convex array transducer, Mindray dataset with SC5-1 (1-5 MHz) convex array transducer, and Philips dataset with Epiq 5, Epiq 7 C5-1 (1-5MHz) convex array transducer.
5.2. Feature map extraction by traditional methods
As shown in Fig. 5, different ultrasound sonograms represent different degrees of lung involvement. The data for this study comes from multiple centers and multiple devices. In order to make the diagnostic model more robust, we used traditional image processing methods to extract features that are not sensitive to imaging parameters, and then put the extracted feature map together with the original ultrasound image into the deep learning model. Since gradient field is highly sensitive to parallel echo rays of A-line, and K-Means clustering is highly sensitive to the laser beam-like echo bars of B-line as shown in Fig. 6, we extracted the gradient field and K -Means clustering images as two feature maps.
5.3. SE_ResNeXt classification model
Overview of the proposed SE_ResNeXt for lung congestion degree classification is provided in Fig. 7. Take one input as an example, after obtaining the gradient field and K-Means clustering information, we combine these two types of information as additional channel information with the original image as a three-channel input (W × H × 3). Then perform a squeeze operation on the input image, that is, global average pooling to encode the entire spatial feature on a channel as a global feature.
The squeeze operation gets the global description feature; another operation is required to capture the relationship between the channels, namely the excitation operation.
Among them,dimension reduction coefficient is a hyperparameter. The excitation operation can learn the nonlinear relationship between channels. Finally, the learned activation value of each channel (sigmoid activation) is multiplied by the original feature on:
The entire network has learned the weight coefficients of each channel, which makes the model more discriminative to the characteristics of each channel. A-line can learn more from gradient field channels, and B-line can learn more from K-Means clustering channels, which can reach the channel attention effect.
In order to fully take the advantages of channel attention, we choose ResNeXt as the backbone network for classification. ResNeXt [21] is a combination of ResNet [22] and Inception [23], which improves accuracy through wider or deeper networks. Each of its blocks is a measurable dimension in addition to the width and depth dimensions. It inherits the strategy of repeating layers of ResNet, but increases the number of paths, and uses split conversion and merge strategies in a simple and scalable manner. So in this classification task we adopt Inception's split-transform-merge idea to widen the network, which basically does not change the complexity of the model while increasing the accuracy. In addition, the topology of the network is the same for every aggregated topology that also reduces the design burden. The specific network block is provided in Fig. 8.
Detailed procedure are as follows: (1) Extract most common 6 types of datasets in Fig. 5 from the training set in equal proportions randomly to prevent sample imbalance and ensure that each category can be learned. (2) Enhance the data by rotation and normalize the intensity of the image. (3) Select the classifier with the best performance and test it on the test set to obtain the corresponding prediction results.
5.4. Establishment of scoring standards
We predicted the patient's per part ultrasound video of multiple examinations through the trained SE_ResNeXt, and classify and score sonograms according to [24]. A-line indicates that the patient is normally ventilated, with a score of 0; A & B-line indicates that the patient has mild lung ventilation loss, with a score of 1; B1-line indicates that the patient has moderate lung ventilation loss, with a score of 2; B1 & B2-line indicates that the patient has severe lung loss of ventilation, with a score of 2.5; B2-line indicates that the patient has very severe loss of lung ventilation, with a score of 3; Consolidation indicates that the patient has a solid lung change characterized by dynamic air bronchial signs, with a score of 4. After the classification result is quantified, the sum is divided by all the frames to obtain the final lung function severity score, which is 0 to 4.
5.5. Training Strategy
For the Stork, Mindray, Stork & Mindray and Stork & Mindray & Philips dataset, we use 3-fold cross-validation to verify the performance of the classifier. All the images are resized to 32 × 32, and a training batch consist of 128 randomly selected images. We regularize the model by using dropout during training, and the neural network parameters are then learned by maximizing log-likelihood using the Momentum optimizer with an initial learning rate of 0.1, then every 30 epochs, the learning rate dropped by 10 times, stochastically minimizing the cross-entropy between annotated labels and predictions.