Clothing attribute recognition algorithm based on improved YOLOv4-Tiny

Aiming at the problem of low accuracy of clothing attribute recognition caused by factors such as scale, occlusion and beyond the boundary, a novel clothing attribute recognition algorithm based on improved YOLOv4-Tiny is proposed in this paper. YOLOv4-Tiny is used as the basic model, firstly, the multi-scale feature extraction module Res2Net is adopted to optimize the backbone network, the receptive field size of each layer of the network is increased, and more abundant fine-grained multi-scale clothing feature information is extracted. Then, the three feature layers of the output of feature extraction network are up-sampled, and the high-level semantic features and shallow features are fused to obtain rich shallow fine-grained feature information. Finally, K-Means clustering algorithm is employed to optimize the anchor box parameters to obtain the anchor box that is more compatible with the clothing object, and to improve the integrating degree between the clothing attribute characteristics and the network. The experimental results demonstrate that the proposed method outperforms the original YOLOv4-tiny network in terms of accuracy, speed, and model parameters, and is more suitable for deployment in resource-limited embedded devices.


Introduction
With the development of computer vision, the analysis and understanding of clothing images is a very active research topic in recent years, and how to detect and recognize clothing images is a current research hotspot, which can be applied to many fields such as clothing image retrieval [1], matching recommendation [2], pedestrian description [3], and workwear detection and recognition [4].
Traditional target detection extracts features such as color, texture and edges of target objects in images by artificial design operator, and then locates and classifies the object [5]. However, the multiple morphologies revealed by garments bring great difficulties to feature extraction, and the B Meihua Gu gumh2001@163.com Wei Hua 1059383458@qq.com Jie Liu 2513629347@qq.com 1 School of Electronic Information, Xi'an Polytechnic University, Xi'an, 710048, China existence of multiple garments in a single image with different sizes, occlusions, scaling and viewing angles makes both traditional image feature extraction methods and classification recognition models face great challenges [6]. The rise of deep convolutional neural networks provides new ideas for the recognition of complex targets, among the typical deep learning target detection algorithms, one category is region recommendation-based target detection, the representative algorithms are: R-CNN [7], Fast R-CNN [8], Faster R-CNN [9], SPP-NET [10], etc., and the other category is regressionbased target detection, through using the end-to-end idea, the image is normalized to a uniform size and input into the convolutional neural network, and the category and location information of the target object are predicted by regression, the representative algorithms are: YOLO [11] series, SSD [12] series, etc. In recent years, more and more scholars have carried out research on clothing recognition based on deep learning.
Zhang et al. [13] propose an optimized residual-based convolutional neural network clothing classification algorithm to improve the accuracy of multi-category clothing recognition by adjusting the order of batch normalization layer, activation function layer and convolutional layer in the network, and using a parallel pooling structure of "pooling layer + convolution layer", and replacing the fully connected layer by the global mean pooling layer. However, this method has low accuracy for clothing recognition with complex background. Lu et al. [14] propose a novel deep residual network model to improve the recognition accuracy of clothing images by improving the arrangement of "BN + ReLU + convolution layers" in the traditional residual block, introducing the attention mechanism and adjusting the structure of the convolution kernel of the network, but the recognition speed is low because the network has more than 58 million parameters. Liu et al. [15] propose a cross-domain clothing retrieval method combined with attention mechanism. Based on deep convolutional neural network, the attention mechanism is introduced to reallocate the weight of different features to enhance the important features and suppress the unimportant features of clothing images, which can effectively deal with the interference of complex background and clothing deformation caused by viewpoint and pose, but the problem of clothing image recognition with occlusion still cannot be solved.
In this paper, according to YOLOv4-Tiny model [16], we propose a novel clothing attribute recognition algorithm. Firstly, the Res2Net module is used to optimize the backbone network, the receptive field size of each layer of the network is increased, and the fine-grained multi-scale clothing feature information is extracted. Then, the three feature layers of the feature extraction network output are up-sampled and fused, the high-level semantic features and shallow features are fused and passed into the FPN network through two feature channels. Finally, K-Means clustering algorithm is employed to optimize the anchor box parameters and obtain the anchor box that is more compatible with the clothing target. The performance of the proposed method is tested on the DeepFashion2 dataset [17], from comprehensive evaluations, it shows that the proposed method achieves high accuracy and outperforms the other compared algorithms.
The remainder of this paper is organized as follows: Sect. 2 analyzes the related work of YOLOv4-Tiny model. In Sect. 3, we describe the proposed method of the multi-scale feature extraction network, the fine-grained feature fusion and the anchor box parameter optimization. In Sect. 4, we perform the experiments to demonstrate the effectiveness of the proposed method. Finally, Sect. 5 draws conclusions.

Related work
YOLOv4-Tiny network is a simplified version of YOLOv4, which belongs to the lightweight model, with only one tenth of the original number of parameters and faster detection speed. YOLOv4-tiny is characterized by multi-task, end-toend, attention mechanism and multi-scale. Compared with the other versions of lightweight models, it has significant performance advantages [18].
CSPdarknet53_tiny is used as the backbone feature extraction network in YOLOv4-Tiny, with the activation function LeakyReLU. CSPdarknet53_tiny uses the CSPnet structure, where the backbone part is a stacking of residual blocks, and the other part connects the input feature layer to the output layer by jumping, and the backbone network is fused by the concat method. The two effective feature layers of size 26 × 26 and 13 × 13 after processing through the backbone network are passed into the feature pyramid network (FPN) separately, and then the effective feature layer of size 13 × 13 is convolved and fused with the effective feature layer of size 26 × 26 by up-sampling, finally two feature channels are obtained for target prediction.
YOLOv4-Tiny is applied to the recognition task of clothing images, the model is trained and tested on the DeepFash-ion2 dataset, and the recognition results are shown in Fig. 1. It can be seen that there are problems such as missed detection or incorrect recognition of clothing targets in the images due to the factors such as scale, occlusion and beyond the boundary. The specific reason of missed detection or incorrect identification is: the positional information of 26 × 26 shallow feature layer is strong, but the semantic information is weak; the semantic information of 13 × 13 deep feature layer is strong, but the localization information is weak.
The class activation mapping(CAM) [19] visualization of YOLOv4-Tiny is shown in Fig. 2. The bright part of the thermal diagram represents the high prediction attention intensity of the model. It can be seen that the bright part of CAM of FPN network layer mainly focuses on the lower part of the dress, indicating that YOLOv4-Tiny algorithm can only extract a small amount of single clothing feature information.

Optimization ideas
In order to improve the clothing attribute recognition accuracy, we improve YOLOv4-Tiny model as shown in Fig. 3, the optimization idea is as follows: 1. To improve the representation of multi-scale clothing feature information, the multi-scale feature extraction module Res2Net is used to optimize the backbone network of YOLOv4-Tiny, the 3 × 3 convolutional layers in the original network Resblock_body are replaced with smaller groups of convolutions, while connecting different convolution groups layer by layer in a hierarchical residual-like style, which enable the variation of receptive fields at a more granular level, and both local and global fine-grained clothing attribute features can be captured. 2. To retain more shallow features and narrow the semantic and resolution gap between shallow features and deep feature maps, the feature fusion network structure is optimized, three feature layers (13 × 13, 26 × 26 and 52 × 52) outputted from the feature extraction network are upsampled, the deep semantic features and shallow locating features are fused, and more rich shallow fine-grained feature information can be obtained in the prediction network. 3. To enhance the fit between clothing attribute features and the network, K-Means clustering algorithm is adopted to cluster and analyze the clothing dataset. The width and height of the clothing target box are clustered by using the intersection ratio as a metric function, and the intersection ratio between the prior box and the bounding box is increased, and the nearest target is classified. the anchor box that is more suitable for clothing attribute can be obtained through constant learning and iteration.

Multi-scale feature extraction optimization
In YOLOv4-Tiny, deepening the depth of the feature extraction network can extract more semantic information, but too many convolution operations will lead to information loss in the feature extraction process and reduction of recognition accuracy. In order to improve the clothing attribute recognition accuracy, the multi-scale feature extraction module Res2Net [20]is used to optimize the YOLOv4-Tiny backbone network. Comparison between the original module and Res2Net module is shown in Fig. 4, the original n channel 3 × 3 filter is replaced by a series of smaller filter groups with w channel (let n = s × w, where s denotes the scale dimension). Figure 4a shows the structure of one 3 × 3 filter in n channel, and Fig. 4b shows the structure of one smaller filter group of w channel, the number of feature groups in the Res2Net block is called scale dimension (s), here s is temporarily set to 4. Different convolution groups are connected layer by layer in a way similar to residual connection, so that the receptive field is constantly changing, so as to extract more fine-grained multi-scale clothing features. As can be seen in Fig. 4, the small filter groups are connected in a hierarchical residual-like style to increase the number of different scales that the output features can represent. The input feature maps are divided into several groups, each group of filters first extracts features from a group of input feature maps, the output features of the previous group are then sent to the next group of filters along with another group of input feature maps, this process repeats several times until all input feature maps are processed. Finally, all feature maps are concatenated and sent to a set of 1 × 1 filters for information fusion. On any possible path in which the input feature map are transformed into the output feature map, the equivalent receptive field increases after passing through a 3 × 3 filter, and many equivalent feature scales are eventually obtained due to the combination effect. This split-mixed connection structure allows the output of Res2Net modules to contain different numbers, sizes, and scales of receptive fields and their combinations. In the structure of Res2Net shown in Fig. 4b, behind the 1 × 1 convolutional layer, the feature map is divided into s subsets, denoted by x i , where i ∈ {1, 2, ……, s}, the number of channels in the subset is 1/s of the original, and each feature map subset x i has the same spatial size as the original feature map set. Except for x 1 , each feature map subset x i has its corresponding 3 × 3 convolutional layer, denoted by K i (), and the output of K i () is defined as y i . The feature map subset x i and K i-1 () are summed and sent to K i () together for processing, and y i is denoted as According to Fig. 4b and Eq. (1), it can be seen that each 3 × 3 convolution kernel K i () can receive the feature information of all previous feature map subsets {x j , j ≤ i}. After each feature map subset x j passes through a 3 × 3 convolutional kernel, the output result can have a larger receptive field than x. Because of the combination effect, the output of Res2Net module contains different numbers of different sizes as well as different scales of receptive fields and their different combinations.
The Res2Net module is integrated into the backbone network of YOLOv4-tiny, and the optimized network is called YOLO-Res2Net, whose structure is shown in Fig. 5. Firstly, the input image of size 416 × 416 is convolved to generate a feature map of size 208 × 208, a feature map with size 104 × 104 is obtained through convolution-normalizationactivation function operation. Then, Res2Net module group is used to extract feature from feature map to extract more abundant feature information, Res2block is formed by stacking different number of Res2Net modules. The numbers of module stacks used for the optimized network in the red dashed box in Fig. 5 are 3, 4, and 6 respectively. Firstly, Res2block1 performs 3 stacks of Res2Net module to obtain a feature layer of size 52 × 52, then Res2block2 obtains a feature layer of size 26 × 26 by performing 4 stacks of Res2Net module, and finally Res2block3 obtains a feature layer of size 13 × 13 by performing 6 stacks of Res2Net module. To verify the role of the scale dimension in YOLO-Res2Net for clothing attribute recognition, models at different scales are tested on the DeepFashion2 Dataset, the experimental results are shown in Table 1. It can be seen that the clothing attribute recognition accuracy of YOLO-Res2Net improves with the increase in scale dimension, in the case of s equals to 2, 3, 4, residual connection structures between network hierarchies can generate a series of rich equivalent scale sets, and rich feature information is conducive to better recognition accuracy. However, when the scale is 5 or 6, due to the limitation of the test image size, rich multi-scale feature information cannot be extracted and only limited performance can be improved. Considering the complexity and recognition accuracy of the model, the scale dimension of YOLO-Res2Net is set as 4.

Fine-grained feature fusion
In order to make full use of the high-level semantic features and the shallow geometric detail features in the backbone network, and to narrow the semantic and resolution gap between the shallow features and the high-level feature maps, the feature fusion network is optimized. To verify the influence of the fusion of different feature layers on clothing attribute recognition, tests are carried out on the DeepFash-ion2 dataset, the experimental results are shown in Table 2. The results show that the clothing recognition accuracy is the highest when the three-layer fusion feature is adopted, while when the four-layer or five-layer fusion feature is adopted, more redundant information is fused in clothing attribute feature, and the recognition accuracy is reduced. Therefore, the method of fusing three feature layers is finally adopted, that is, feature layers of size 13 × 13, 26 × 26 and 52 × 52 in the YOLO-Res2Net backbone network are fused to obtain more fine-grained feature information without increasing the network prediction channels, and the comparison of feature fusion network structure is shown in Fig. 6.
In Fig. 6, the original feature fusion structure is in the left dashed box, and the optimized feature fusion structure is in the right dashed box. After the improvement, the finegrained feature information in the 52 × 52 feature layer is fused into the prediction network while the prediction of the

Anchor box parameter optimization
YOLOv4-Tiny divides the input image into several grids, if the center of the actual border of the clothing attribute is in one of the grids, then this grid is responsible for predicting the clothing attribute. Therefore, the rationality of anchor box setting is very important for model performance. . DeepFashion2 dataset is used to identify the clothing attributes, preset anchor box parameters are used in training, the intersection ratio between anchor box and boundary box is less than the threshold value 0.5, leading to many missed checks, so the anchor box needs to be reselected. K-means clustering algorithm [21] is used to conduct clustering analysis on DeepFashion2 dataset, and cross ratio is used as a metric function to cluster the width and height of clothing target box, the intersection ratio of anchor box and boundary box is increased, and the nearest target is classified. Through continuous learning and iteration, the values of each cluster center are gradually updated until the cluster center remains unchanged. Anchor box parameter optimization can make the model parameters close to the experimental dataset, and reduce the loss while the recognition accuracy is

Experimental results
The experimental platform uses the Pytorch deep learning framework, and the operating system is Ubuntu 20.04.5, the CPU is Inter Core i5-9400F, and the GPU is Nvidia GTX 2070. The training iterations are 100, we observe the impact on network performance by gradually changing the batch   Table 3.

Ablation experiments
The DeepFashion2 dataset is used to train and test YOLOv4-Tiny and the improvement methods, where YOLOv4-Tiny is the baseline model, Model-Optimized-A, Model-Optimized-B, and Model-Optimized-C represent the models after integrating the Res2Net module, feature fusion optimization  Table 4. The results show that the recognition accuracy of the optimized models has been improved to different degrees, with an increase of 5.12%, 2.82% and 1.93% through optimizing YOLOv4-Tiny with multi-scale feature extraction optimization, feature fusion optimization and achor box parameter optimization, respectively. As can be seen from Table 4, through the comprehensive use of the optimization methods, the recognition accuracy of the model can be improved significantly. By optimizing the bone backbone network structure, feature fusion and anchor box parameters simultaneously, the mAP value of Model-Optimized-E is the highest, which is 6.75% higher than that of the original model. In addition, Grad-CAM is used to visualize the class activation mapping of the main models mentioned above. In the visualization examples shown in Fig. 7, stronger CAM areas are covered with brighter colors. Due to stronger multi-scale ability, fine-grained feature fusion and anchor box parameter optimization, the optimized model-based CAM result has more concentrated activation map on the clothing area compared with the original one, which indicates that the proposed methods can improve the feature extraction ability of the network for clothing attributes.

Qualitative analysis
To verify the effectiveness of the proposed method for clothing attribute recognition, qualitative analysis is conducted in three aspects: different scales, different degrees of occlusion, and different degrees of out-of-bounds. Figure 8 shows the CAM visualization of different methods for different-scale clothing images, it can be seen that stronger CAM areas are covered with brighter colors. Compared with YOLOv4-Tiny, the proposed method-based CAM results have more concentrated activation maps on different scale clothing images, due to stronger multi-scale ability, the proposed method has activation maps that tend to cover the whole clothing target on small-scale and large-scale clothing images, while activation maps of YOLOv4-Tiny only cover parts of the clothing target. Figure 9 shows the recognition results of the proposed method and YOLOv4-Tiny for clothing images with three different scales, where the first row shows the recognition results of YOLOv4-Tiny, and the second row shows the recognition results of the proposed method. The results show that YOLOv4-Tiny misses short sleeve dress, short sleeve top and skirt in the small-scale clothing images, misses short sleeve top and misidentifies skirt as short sleeved top in the medium-scale clothing images, and misses long sleeved top and short sleeve top in the large-scale clothing images. In contrast, the proposed method can detect and recognize the clothing images with different scales accurately. Figure 10 shows the CAM visualization of different methods for clothing images with different degrees of occlusion. Compared with YOLOv4-Tiny, the proposed method-based CAM results have more concentrated activation maps on the clothing images with different degrees of occlusion. The results show that the proposed method can reduce the influence of occlusion on clothing attribute information extraction. Figure 11 shows the recognition results of the proposed method and YOLOv4-Tiny for three kinds of clothing images with different occlusion degrees, where the first row shows the recognition results of YOLOv4-Tiny and the second row shows the recognition results of the proposed method.
As can be seen, in Fig. 11a, the arms and the leather bag cause slight occlusion to the clothing respectively, causing the false detection of YOLOV4-Tiny. In Fig. 11b, YOLOV4-Tiny mistakenly detected the vest dress as a short sleeve top due to the shielding of the hat, while the shielding of the pants by the leather bag leads to the missing detection. In Fig. 11c, shorts are seriously blocked by the short sleeve top, resulting in missed detection, short sleeve dress is mistakenly detected as skirt because it is seriously blocked by the curtain. In contrast, the proposed method can obtain accurate recognition results for clothing objects with different occlusion degrees. Figure 12 shows the CAM visualization of different methods for clothing images with different out-of-bounds degrees. It can be seen that the proposed method-based CAM results have more concentrated activation maps on the clothing images with different degrees of out-of-bounds. Figure 13 shows the recognition effects of the proposed method and YOLOv4-Tiny on three clothing targets with different out-of-bounds degrees. The results show that YOLOv4-Tiny mistakenly detects the short sleeve dress in Fig. 13a as a skirt, misses the short sleeve top and skirt, misses the long sleeve top and long sleeve dress in Fig. 13b that are partially out-of-bounds, and misses the pants and long sleeve dress in Fig. 13c that are out-of-bounds. In contrast, the proposed method can identify the garments with different out-of-bounds degrees accurately.

Quantitative analysis
To further evaluate the performance of the proposed method, whose mAP, number of parameters (Param) and frames per second (FPS) are compared with several other lightweight target detection models such as FBNet [22], GhostNet [23], ShuffleNet [24], MobileNet [25] and YOLOv4-MobileNet [26]. These models are all trained and tested on the Deep-Fashion2 dataset, and the experimental results are shown in Table 5. The results demonstrate that the proposed method has a competitive accuracy index, while the number of parameters and FPS are better than the comparison models. For example, compared with the lightweight network YOLOV4-Tiny, the number of parameters is reduced by 5.39%, FPS is increased by 9 f/s, and the precision index mAP is increased by 6.75%.
In particular, compared with YOLOv4-MobileNet, which is the latest improved lightweight network based on YOLOv4, under the condition that the accuracy of the proposed method is 2.01% higher than that of the optimal lightweight network algorithm, the light weight index of the proposed method is optimal. The comprehensive comparison shows that the proposed method has the highest recognition accuracy in the clothing attribute recognition task, and also has obvious advantages in the indexes of lightweight network structure.

Conclusion
In this paper, a novel clothing attribute recognition method based on improved YOLOv4-Tiny is proposed. Experimental results show the effectiveness of our method, which can significantly improve the recognition accuracy of clothing Fig. 13 Clothing image recognition results with different degree of out-of-bounds. a Without out-of-bounds. b Medium out-of-bounds. c Large out-of-bounds images with different scales, different occlusion degrees and different out-of-bounds degrees. At the same time, the proposed method also obtains the optimal lightweight index, the FPS reaches 213, and the number of model parameters only 4.98 M, which is more suitable for the resource-limited edge devices and mobile devices. The proposed method not only provides the algorithm basis for clothing retrieval, clothing matching and other applications, but also contributes a little to the intelligent clothing industry.