Enhancing Intelligent Anemia Detection via Unifying Global and Local Views of Conjunctiva Image with Two-Branch Neural Networks

Background: Anemia is one of the most widespread clinical symptoms all over the world, which could bring adverse effects on people's daily life and work. Considering the universality of anemia detection and the inconvenience of traditional blood testing methods, many deep learning detection methods based on image recognition have been developed in recent years, including the methods of anemia detection with individuals’ images of conjunctiva. However, existing methods using one single conjunctiva image could not reach comparable accuracy in anemia detection in many real - world application scenarios. Results: To enhance intelligent anemia detection using conjunctiva images, we proposed a new algorithmic framework which could make full use of the data information contained in the image. To be concrete, we proposed to fully explore the global and local information in the image, and adopted a two branch neural network architecture to unify the information of these two aspects. Conclusions: Compared with the existing methods, our method can fully explore the information contained in a single conjunctiva image and achieve more reliable anemia detection effect. Compared with other existing methods, the experimental results verified the effectiveness of the new algorithm.

extra cost and intervention of professionals, and brings obvious limitation to its wide application 8 in daily life [2][3][4]. 9 In recent years, in order to improve the convenience of anemia examination, many studies have 10 been carried out to investigate the effective approaches of detecting anemia in a noninvasive 11 manner [4][5][6][7]. One of the most promising directions was to deploy artificial intelligent algorithms 12 to discover the correlation of anemia with images of human surface organs, including fingernails, 13 retinas, conjunctivas, etc. In other words, instead of testing blood samples, artificial intelligent 14 algorithms for image classification and recognition are applied in anemia diagnosis [8][9]. Due to 15 their non-invasive manner and non-requirement of professional blood testing, such approaches are 16 less expensive, and seem much attractive than the traditional ones. 17 In this paper, we focus on developing more effective algorithms for one of such approaches, 18 that is, to detect anemia with photos of conjunctiva [10][11][12]. Comparing to other similar 19 approaches like diagnosis with retina images, detecting with conjunctiva images requires no extra 20 professional equipment, which brings much convenience to users. Therefore, it can be easily 21 deployed in Web-based platforms like APPs, and people can take photos and upload them at any 22 time to get a diagnosis result immediately. 23 However, there is some room for improvement in existing algorithms of detecting anemia with 24 conjunctiva images. One of the main issues is that, as was demonstrated in previous reports and in 25 our testing experiments, the existing algorithms could not reach comparable performance with 26 human experts in many real-world applications and could be easily affected by image quality. 27 Such drawbacks might limit its potential application prospects. 28 To further enhance the performance of anemia diagnosis with the photos of conjunctiva, we proposed in this paper a new algorithmic framework to solve such problems. As was discussed in 30 detail later in this paper, the main issue which prevents previous algorithms from achieving higher 31 accuracy may lie in that, they cannot fully exploit the information of the image, especially the 32 local information. Inspired by such analysis, we put forward a new perspective which could unify 33 both global and local information of the image data, and proposed a simple yet effective deep 34 convolutional neural network architecture to fuse them, and obtained a much robust and reliable 35 deep learning model for anemia detection using only one single image of an individual's 36 conjunctiva. 37 for Detection of Anemia). We illustrated the overall workflow of GLUDA model in Fig.1.

57
The GLUDA model benefits from several key aspects, among which the two-branch architecture 58 plays a central role in unifying both global and local information of a single conjunctiva image. 59 We choose EfficientNet as the two sub-network structure for constructing the whole network. 60 Recently proposed by Google Brain Team, EffiecientNet is an efficient convolutional neural 61 network, which can scale up CNN models by balancing network depth, width, and resolution in a 62 systemic way. Formally, denote the input data image as an input tensor X, then the output of 63 X (as a conjunctiva image), the two sub-network Neti (i=1,2) in GLUDA can be formally referred 71 to as: 72 (3) 75 Note that we would construct the two EfficientNet sub-branches with exactly the same 76 structure, and therefore they would share the same predefined tensor shape < , , > in each 77 layer. But the weight parameters would be different after training. Denote the concatenation 78 operation as , and the fully connected layer as , the output of the GLUDA network can be 79 written as : 80 Net(X, d, w, r) = ⊙ ( ( 1( , , , ), 2( ( ), , , ))).

(4) 81
To bring all the formal discussions above, the overall training framework of the GLUDA model  To obtain the optimal parameters for the whole network, compound scaling method and a small grid search could be applied. Briefly speaking, the compound scaling method would set = 93 ∅ , = ∅ , = ∅ , and α • 2 • 2 ≈ 2, with user-specific coefficient ∅ and search optimal ≥ 94 1, ≥ 1, ≥ 1 for appropriate depth, width and resolution settings for the network. We reported the experimental results of both series in Table 1, Fig. 2 and Fig.3. Row 2 to row 4 106 in Table 1 and Fig. 2 showed testing results of models without data augmentation. And row 6 to 107 row 8 in Table 1 and Fig. 3 showed testing results of models using data augmentation, where 108 random rotation, cropping and flipping and applied and the size of the augmented training set are 109 extended to 1396 in total. In each series of the experiments, GLUDA models outperformed 110 EfficientNet_Origin and EfficientNet_ROI, respectively. When data augmentation was not 111 adopted, GLUDA model achieved the highest scores among all the three models, with AUC = 112 0.9087, Acc = 0.7937 , and EfficientNet_ROI obtained the lowest score, with AUC = 113 0.9305, Acc = 0.8313. The sensitivity and specificity scores and their corresponding 95% CI intervals are also the highest for the GLUDA model. When data augmentation was applied before 115 training, we could observe that all three models could achieved higher scores comparing to their 116 counterpart model without augmentation, and among the three models, GLUDA could still 117 obtained the highest scores in AUC, Acc, Sen, and Spe, respectively. According to such 118 comparison results, GLUDA model could fuse global and local views of conjunctiva images in an 119 effective approach, therefore outperformed the other two models which only use global 120 (EfficientNet_Origin) or local (EfficientNet_ROI) views of conjunctiva images. 121  decision-making process: as a professional, when using a conjunctiva image for assisting anemia 131 diagnosis, one would first make a preliminary judgment on the overall image of conjunctiva, 132 considering the potential risk of whether the corresponding individual has anemia. Then, in order 133 to further verify or reject this judgment, one will continue to carefully observe the regions of 134 interest (ROI), and make comparative analysis with the whole image, and finally get a 135 comprehensive judgment of the anemia diagnosis. 136 Several key steps in the GLUDA model can be summarized as follows: 137 Firstly, in addition to the original image data serving as the global view, the region of interest 138 (ROI) from the original image, serving as the local view data. Formally, for any conjunctiva image 139 data , we perform ROI extraction algorithm to produce the local view data, denoted as: 140 141 where () denotes any appropriate ROI extraction algorithm. The obtained local view image is then resized to the same size of , which is also denoted as ′ for convenience. and ′ is 143 then treated as two equal parts to form a tuple as the input of the model, i.e., 144 145 and the learning task is reframed to obtain a mapping function f which produce the following 146 mapping for any ( , ) with high generalization ability: 147 Secondly, the overall framework comprises two branches of deep convolutional neural 149 networks, which server as feature extractor for the original conjunctiva image and the ROI 150 conjunctiva image, respectively. The network block architecture of each branch is identical to that 151 of the EfficientNet model, which comprises convolutional layers, mobile inverted bottleneck 152

MBConv layers and Pooling layers as its building blocks [25-27]. There are various versions of 153
EfficientNet model, and we adopt EfficientNet-B0 in our study for convenience but note that other 154 versions can substitute for EfficientNet-B0 here without substantial influence. 155 Thirdly, the extracted higher-level features of the two branches are fused together, and are 156 passed through fully connected layers to learn a more comprehensive prediction. To be concrete, 157 for each view of the image data, after passing through the layers of the branch, a set of higher-158 level features are obtained. To fuse these two sets of features, a simple concatenation operation 159 was carried out, which would concatenate these two sources of features into a unified one. The 160 fused features are then eventually passed through the FC layer to output a prediction of anemia. 161 Note that alternative feature fusion operations other than concatenation might also be 162 applicable here, but simple concatenation operation has well served the task here and works well 163 in practice.

166
One of the aspects worth emphasizing is that, we could observe from the results that, even 167 when training without data augmentation, the obtained GLUDA model could still outperformed 168 EfficientNet_Origin and EfficientNet_ROI trained with data augmentation. We plot the comparing 169 results in Fig.4. Such contrast is impressive, considering that in this case GLUDA only used 532 170 conjunctiva images for training, which is far less than the 1396 augmented images for training 171 EfficientNet_Origin and EfficientNet_ROI. Given that data augmentation could provide extra 172 information to boost model training in many cases, it might be reasonable to conclude that the way 173 GLUDA fuse two views of conjunctiva images had successfully draw into valid discriminative 174 information for training, which would be missing when only a single view of conjunctiva images 175 was used. Such augmentation of information is parallel to that of data augmentation, since the 176 performance of GLUDA could be further improved when trained with augmented set, which could 177 be found in Figure 4. The above approaches mainly rely on traditional image processing technology which is prone to 219 be affected by the image defects in color and other aspects, and often needs experts' intervention 220 to justify the image quality, making them instability and labor-intensive. Therefore, many 221 researchers choose to switch to deep learning approach for more stable and intelligent detection. 222 Such approach aims at training deep learning models with a set of annotated images, which would 223 give automatic diagnosis without any intervention of professionals.  As was confirmed by previous clinical research and empirical studies, the rich information 262 contained in conjunctiva images can really provide a sufficient basis for the diagnosis of anemia. 263 Therefore, such learning task is empirically solvable, and all we need to do is to design more 264 effective models based on the characteristics of the data and the task. images are resized to 224*224*3 before training. The whole data was randomly divided into three 282 sub-sets, i.e., training set, verification set and test set. We randomly select around 60% of the 283 whole anemic images and around 60% of the whole non_anemic images to form the training set, 284 and randomly select about 20% of the whole anemic images and about 20% of the whole 285 non_anemic images to form the verification set, and the rest of all the images are put together to 286 form the test set. Note that we divided the whole data set in this manner to ensure that the 287 proportion of positive and negative instances remains almost unchanged across the three sub-sets. 288 The information of the data set with size of each class is showed in Table 2 . 289

292
Considering that the sample sizes for different classes are unbalanced in our data set (the sample 293 size of non_anemic conjunctiva images are more than two times of the size of anemic ones), we 294 use focal loss [28] as loss functions for our GLUDA model and other comparing models. 295 Formally, for every conjunctiva image, if the model output the probability of the ground truth 296 label as , then its focal loss would be: 297 where γ is a user-defined balanced parameter. As is universally acknowledged, focal loss function 299 is specially designed for training problems with class imbalance. For all the models trained in this 300 section, we simply set γ to be 0.5 across all experiments. (1) EfficientNet_Origin: EfficientNet model trained using the original conjunctiva images; 319 (2) EfficientNet_ROI: EfficientNet model trained using the ROI of the original conjunctiva 320

images. 321
The training details and parameter settings of GLUDA model and the comparing models are 322 described as follows. First, as is pre-processed above, we set the input data format as 224*224*3, 323 and use the default network structure of EfficientNet_B0 to form the framework of 324 EfficientNet_Origin, EfficientNet_ROI, and the corresponding building blocks of GLUDA. We 325 adopted transfer learning with the pre-trained EfficientNet_B0 to fine tune the models. We use 326 Adaptive Moment Estimation (Adam) as the optimizer in training all the models, and set the 327 learning rate as 0.003, the learning rate decay factor as 0.99 and the batch size as 16, respectively. 328 Secondly, all the models are trained using the training set showed in Table. 1. And we set the total 329 number of epochs to be 50, and use AUC on the verification set as reference value in performing 330 model selection to output the final models. Thirdly, the AUC, Sen, Spe and Acc of the obtained models (EfficientNet_Origin, EfficientNet_ROI and GLUDA, respectively) on test set are 332 calculated and output for performance comparison. This study procedure was approved by the Ethics Committee of Third affiliated Hospital at Sun 337 Yat-sent University, and written informed consent was acquired from all participants. 338

339
All participants gave written informed consent for the publication of findings. 340

341
The data sets used and/or analyzed during the current study are available from the corresponding 342 author on reasonable request. 343