Complex Scene Understanding and Recognition Based on the Explainable Machine Learning in AI Models


 Information fusion is an important part of numerous neural network systems and other machine learning models. However, there exist some problems about fusion in scene understanding and recognition of complex environment, such as difficulty in feature extraction, small sample size and interpretability of the model. Deep reinforcement learning can combine the perception ability of deep learning with the decision-making ability of reinforcement learning to learn control strategies directly from high-dimensional original data. However, It faces these challenges, such as low optimization efficiency, poor generality of network model, small labeled samples, explainable decisions for users without a strong background on Artificial Intelligence (AI). Therefore, at the level of application and theoretical research, this paper aims to solve the above problems,the main contributions include: (1)optimize the feature representation methods based on spatial-temporal feature of the behavior characteristics in the scene, deep metric learning between adjacent layers and cross-layer learning theory, and then propose a lightweight reinforcement learning network model to solve these problems of the complexity of the model to be explained, the difficulty of extracting feature and the difficulty of tuning parameter; (2)construct the self-paced learning strategy of the deep reinforcement learning model, introduce transfer learning mechanism in the optimization process, and solve the problem of low optimization efficiency and small labeled samples; (3)design the behavior recognition framework of the multi-perspective deep knowledge transfer learning model, construct a explainable behavior descriptor, and solve the problems of poor network generality and weak 1explanation of network. Our research is of great theoretical and practical significance in the fields of artificial intelligence and public security.


INTRODUCTION
In recent years, skynet and other widely deployed video surveillance systems have provided an efficient means of data acquisition for scene analysis and recognition, and are also an important part of smart city and urban security monitoring. The understanding and recognition of complex scenes is one of the important technologies to realize artificial intelligence [1][2]. This paper mainly makes an in-depth study of crowd density estimation in surveillance Complex Scene Understanding and Recognition based on the Explainable Machine Learning in AI models scenes, face occlusion recognition in front of ATM and the recognition of group abnormal behavior in video scenes. Its purpose is to determine what targets and behaviors exist in a scene, so it can be widely used in banks, stations, office buildings and other public places. On the other hand, in the last few years, the interest in deriving complex AI models capable of achieving unprecedented levels of performance has been progressively displaced by a growing concern with alternative design factors, aimed at making such models more usable in practice.
Considering that these behaviors seriously threaten the safety of society and individuals, it will undoubtedly provide more protection for social security if these data can be automatically analyzed and processed and an early warning is issued. Given the multimodal nature of the data and the complexity of the scene, we expect to use popular machine learning algorithms to solve this complex scene understanding and recognition problem. Compared with other scene analysis [3][4][5], the scene analysis and recognition of this project is more difficult, and the difficulty in extracting target and behavior features, small sample size and interpretability of the model are the bottlenecks and keys to solve this problem.
Daniel Wolpert believed that the reason for the evolution of the brain is not for thinking and feeling, but for controlling motion, which is the core idea of the Deep Reinforcement Learning. The problem model studied by it includes an environment and an agent interacting with the environment [6][7][8][9]. The goal of reinforcement learning is to design a behavioral strategy for the agent that maximizes its benefits in interacting with the environment. Google DeepMind used this strategy in 2016 to get computers to go beyond the level of top professionals. However, the development of object detection and behavior recognition in complex environment is slow, that is mainly because : (1) the efficiency of model optimization algorithm is not high; (2) Lack of interpretability of constructed model; (3) Small sample with label; (4) Complex and unrestricted scenes; (5) High-dimensional video data leads to difficulty in parameter adjustment. Curriculum Learning and self-paced Learning represent the recently proposed Learning strategy [10][11]. Their core idea is to simulate the cognitive mechanism of human beings, they first learn simple and general knowledge structure, then gradually increase the difficulty degree and transition to more complex and professional knowledge. These two methods have similar conceptual learning paradigms, but differ in their specific learning schemes.
In curriculum learning, the curriculum is predetermined by prior knowledge and remains fixed after that, this kind of approach relies heavily on the quality of prior knowledge and ignores feedback about the learner. In self-paced learning, the course is dynamically determined to adapt to the learners' Learning rhythm. However, the self-paced Learning was unable to process previous knowledge, making it prone to over-fitting. For many computer application problems, it is necessary to build very accurate and easy proofs to understand machine learning models. Especially in the field of scene analysis and recognition, there is a growing demand for artificial intelligence (AI) methods that not only perform well, but are reliable, transparent and interpretable. This would allow non-professional persons to have possibilities to understand how and why a artificial intelligent algorithm makes that kind of decision, which will increase reliability of machine learning models in AI systems.
The main problem for explainability is to show enough justification for a ML model so that non-professionals know why a conclusion was drawn, and tell the non-professionals know when a model will perform well or not, and Under what conditions this model can be trusted. Mainstream machine learning algorithms, especially deep learning models, are now ubiquitous and they has been widely used in many fields: visual recognition, bioinformatics, scene analysis, etc. The good performance of these models is largely due to their strong approximation and estimation properties, and a large number of training samples. An important obstacle, however, is interpretability: these models are often seen as black boxes, allowing little insight into how to make predictions. There does seem to be a trade-off between performance and interpretability, because the best-performing methods are often the least interpretable, while the most interpretable methods has long been regarded as not the best. Figure 1 qualitatively illustrates the tradeoff between the performance and interpretability of a machine learning algorithm. When a researcher wants to ensure the accuracy of the model, model control is often considered at this time, which may come from security, regulatory, or fairness considerations. In this case, the model is interpreted against these properties to check that they meet the requirements and to ensure that the model can be safely designed. In recent years, with the rapid development of deep learning, great progress has been made in solving various scene analysis tasks. However, the deep neural network used in the deep learning model is often used as a "black box", that is, the model only gives the final classification results, without making an understandable explanation for the classification results and decisions of the model [12]. Although many models with respect to machine learning have appeared continuously [7][8][9][10][11][12], robust scene analysis under explainable model is still very difficult issue. A big problem is the uninterpretability of deep model. As shown in Fig.2, Based on the interpretability of the model, we consider the recognition problem in a variety of scenarios, constructing a model that accommodates different scene is not an easy task.
So, in this paper, our research starts from the problems of difficult feature extraction of targets and behaviors in complex scenes, small labeled samples, non-restrictive scenes and interpretability of the model, By constructing the deep reinforcement network model based on transfer learning and the self-step learning mechanism, the traditional deep feature learning is further exploited from the explainable view, and the classification effect is improved by constructing the deep reinforcement network based on discriminative Fisher vector, from the frontier of application and algorithm of innovation, it is clearly a challenge and a new attempt.
Main contributions of this paper are described as follows: First, our work focuses on the robust construction of smart scene analysis system, this system can provide explainable decisions for users without a strong background on Artificial Intelligence. We propose novel methods that aim at solving the model interpretability problem, mainly for neural networks models by enhancing features importance and adding type of explanations; Second, based on the traditional manifold learning algorithm, a new feature learning process is constructed by means of multi-model modeling and simple cascading deep reinforcement learning network, so as to solve the problem of multi-modal data classification in complex scenes. The measurement mechanism and cross-layer learning method of deep learning adjacency layer are studied, the theoretical model and framework of sample selection and labeling based on self-step learning strategy are proposed, and reinforcement learning combined with transfer learning strategy are also proposed; In the detection of targets and behaviors, the existing deep network model is fully used to mine the feature information and deep knowledge of scene targets and behaviors. Hierarchical rough and fine expression strategy cascade are adopted, and the multi-scale lightweight learning method is coordinated to complete the efficient detection of targets and behaviors in the scene. Based on SPP algorithm, a kind of supervised dimensionality reduction algorithm based on sparse representation and non-parameter discriminant analysis is proposed, interpretability of Model is improved; Last, In order to effectively recognize targets and behaviors in a complex environment, this paper proposes a behavior recognition framework based on multi-perspective deep transfer learning. Dense trajectory is used to describe behavior characteristics, the self-paced learning strategy is adopted to solve the problems of small number of labeled samples and model interpretation. A nonlinear model based on deep transfer learning is proposed to solve the problem that it is difficult to distinguish the perspective-related features from behavior-related features, the interpretability of the model is further improved.
The remainder is designed as follows. In Section 2, some related works are summarized. The constructed explainable machine learning models and corresponding learning algorithms are given in Section 3. Designed experiments and discussion are depicted in Section 4. At last, we conclude our contribution and give future research in Section 5.

Related work
Explainable AI (XAI) refers to those Artificial Intelligence techniques aimed at explaining, to a given audience, the details or reasons by which a model produces its output [13]. To this end, XAI borrows concepts from philosophy, cognitive sciences and social psychology to yield a spectrum of methodological approaches that can provide explainable decisions for users without a strong background on Artificial Intelligence. Therefore, XAI targets at bridging the gap between the complexity of the model to be explained, and the cognitive skills of the audience for which explainability is sought. Interdisciplinary XAI methods have so far embraced assorted elements from multiple disciplines, including signal processing, adversarial learning, visual analytics or cognitive modeling, to mention a few.
Although reported XAI advances have risen sharply in recent times, there is global consensus around the need for further studies around the explainability of ML models. A major focus has been placed on XAI developments that involve the human in the loop and thereby, become human-centric. This includes interpretable reasoning of models, neurosymbolic reasoning or systems based on fuzzy rules, etc.
This paper mainly verifies the interpretability of the model through three scenarios, namely crowd density estimation in the monitoring scene, face occlusion detection in front of ATM, and crowd abnormal behavior recognition in the video scene.

crowd density estimation
At present, the main methods for crowd density estimation are detection-based approaches and regression-based approaches [14][15][16][17][18]. We found that both methods have their advantages. Detection-based methods work better when the crowd density is low. When the crowd density is high, the effect of regression method is better. It is challenging to generate an accurate crowd distribution diagram, and one of the major difficulties is discretization. People do not occupy only one pixel in the image, and the density diagram needs to maintain the continuity of local neighborhood.
Other difficulties include the variety of scenes and camera angles. The main reason for the small size of the crowd 6 density estimation database is the large amount of image marking, which requires marking each head of the dense crowd in the image. Popular crowd density estimation based on CNN network [15][16][17][18] mainly adopts multi-scale structural network. Although good performance has been achieved, there are two problems: when the network gets deeper, it is easy to fall into local optimal, and the decision is is short of interpretability.
The application of deep learning scheme in crowd counting has achieved substantial progress [18,19,20,22]. In order to accommodate the multi-scale changes, most of the previous methods adopted multi-column CNN [26,27,29,33] or multi-branch blocks to extract multiple features under different scenes, and then finally fused the obtained features to obtain the final density estimate map, but there are still many problems that have not been perfected solved. 1) Existing methods use multi-column stacked convolutional networks to generate density maps, and do not consider the relationship between the feature layer channels, only the accumulation in space. 2) Existing methods only use the traditional Euclidean Loss optimization of the proposed model, the traditional Euclidean loss itself presents some shortcomings, it is difficult to deal with the blurred images that indicate high sensitivity of outliers. Therefore, it should be noted that convolution kernels of different sizes multi-scale spatial and temporal characteristics, each sub-convolution module independently minimizes the regression loss of its current scale, thus, the final fusion density map is predicted. However, when designing multiple subnets, there is no collaborative regression itself, which results in that the final fusion density map is not optimal in a certain sense, the resulted density map is of low quality, resulting in fuzzy results , 3) The existing methods do not focus on the consistency of generated estimated density map at multiple scales, that is to say, the sum of the crowd number of local patches does not necessarily correspond to the total number of their original patches .
Early crowd density detection methods were mostly based on image processing and computer vision application technologies [23,24,25,37,39]. With the development of related fields, people can use computer technology to analyze image data and obtain valuable information from the images. As a result, crowd density estimation based on computer vision is gradually emerging [44,45,46,47].
Traditional methods mainly refer to methods based on image processing and some shallow classification or regression learning models [13,14,15,17,21]. Main ideas of such methods generally include image acquisition, image preprocessing, feature extraction, feature analysis and classification, and calculation of the final crowd density map.
Among them, the most critical step is the process of image feature extraction and analysis [67,68,69]. According to the extracted feature category, it can be divided into direct method and indirect method [70,71,72]. Direct method: refers to the method based on detecting human features and counting, common features have direction Gradient histogram, Hear wavelet features, etc. After successfully obtaining features, some mainstream classifiers are then adopted, such as: Radial Basis Function , Support Vector Machine, Ada-boost, etc. Feature analysis scheme is used to get a completed detection model. The indirect method is a regression-based method, which generally constructs a function mapping relationship between population density-related features and population density levels through regression models.
Common regression models include Gaussian process regression, linear regression, and ridge regression.
Traditional methods often require proper image preprocessing according to different application scenarios. The advantages and disadvantages of such methods are all very obvious. The advantage is that the processing speed of regression methods is faster, and it is easier to conduct reasonable and effective algorithm research and improvement on the model. The disadvantage is that such methods require a large number of positive and negative sample training sets to get a good training model [65,66], and the scope of application is relatively narrow and only be used in low-density scenes with sparse crowds, also it is not suitable for scenes with dense crowds and severe occlusion, as well as scenes with complex and changeable backgrounds.
However, it can be seen in the relevant literature that the crowd density estimation detection method based on deep learning also presents some defects in network structure. They did not grasp the relationship between sub-block fusion and the original image density map, nor did they consider the inherent connection between layers, the main problem is that the short of interpretability.

face occlusion detection in front of ATM
With the rise of ATM-related crime, enhancing security through surveillance technology has been at the top of the agenda in academia and industry. Although cameras are usually installed in ATMs to capture images of the user's face, the feature is limited to recording subsequent criminal investigations. Therefore, facial occlusion detection becomes very important to prevent ATM-related crimes. Traditional methods of solving this problem usually include location, segmentation, feature extraction and recognition [16][17][18][19][20][21][22][23][24][25][26][27]. At present, there are also some studies at home and abroad on intelligent monitoring of ATM, but most of them are used to detect unconventional behaviors such as running, squatting, jumping, fighting, prying ATM and installing illegal equipment before ATM [16]. Xia et al. [21] proposed a robust and effective facial occlusion detection method based on convolutional neural network and multi-task learning.
Compared with the previous method, the multi-task learning strategy is added to improve the performance of the learning algorithm by learning multiple task classifiers jointly. However, different tasks in this algorithm need to be trained separately, so the implementation is complicated and the training takes a long time. Chen et al. [26] introduced an anti-occlusion face perception detector that detects the occlusion face and divides the occlusion area at the same time, and introduced a countermeasure training strategy to detect the blocked face area. However, this algorithm has a high requirement for background segmentation and high computational complexity, so it is easy to get into local optimization. Zhao et al. [27] proposed a robust automatic decoder model based on the LSTM model, which can effectively detect partially obscured faces even in the field. However, this algorithm is sensitive to complex backgrounds and is greatly influenced by the head posture and movement.

crowd abnormal behavior recognition
In recent years, the detection of abnormal behavior in the population has attracted considerable attention in the field of public safety [32,35]. Real-time monitoring of abnormal behaviors in the scene can not only reduce the cost of human monitoring, but also deal with emergencies in a timely manner. One of the biggest challenges in computer vision, given the difficulty of defining group anomalies, is to define and analyze groups of people. Early researchers tried to do these efforts, such as providing and judgment of mathematical analysis model of social force model [28], for example, the abnormal behavior prediction model based on trajectory judgment [31], the conventional detection method is limited to block out crowd behavior (covered) between pedestrians, crowded, low resolution, ignore the interaction of people.
prospect effect map for abnormal behavior in the population. This algorithm uses the adaptive mixed Gaussian model to block the video frame image, and combines the obtained foreground region to calculate the motion effect diagram of the moving foreground object block, but the algorithm is very dependent on the segmentation of the foreground, so it is not robust. In view of the low accuracy and real-time performance of crowd monitoring in public places, Hu et al. [30] proposed a detection method of crowd abnormal behavior based on motion saliency graph. Because light flow is easily affected by noise, the false alarm rate of this algorithm is very high. Mousavi et al. [33] proposed a novel video descriptor, called directional trajectory histogram, to identify abnormal conditions in crowded scenes. Instead of using the standard method of estimating the motion vector from just two consecutive frames, they divided the video sequence into space-time cuboids and collected the tracks of the crowd for statistical data. Similarly, the algorithm is based on optical flow characteristics, and the false alarm rate is relatively high. Rojas et al. [34] proposed the Gaussian mixture model (GMM) to simulate the behavior of abnormal population and fully consider the characteristics of abnormal population behavior. This algorithm can automatically adapt to environmental changes and online learning without the need to track the population and large-scale training data. However, this method only computes feature points in the region of interest, and general interest points are difficult to determine.

Explainable machine learning scheme
As machine learning becomes more widely used in more areas, understanding the reasons behind model decisions has become the trend for future model development, which is more convenient for government legislation and project security.
At present, explanations provided by different algorithms are fragmented and independent, which makes it difficult to determine reasonable decisions and explain model structures. In addition, in the design of interpretable classifier, the selection of optimal training set, correlation selection of heat graph, semantic analysis, model visual interpretation and error analysis cannot be combined compulsively. Moreover, the text interpretation can't match the characteristics of a certain layer of deep learning network, and it lacks of continuous interpretibility. In general, the text explanation generated for classification comes from training data based on model annotations. Up to now,data labels are set manually and are very subjective, and it doesn't take into account the differences between the different elements. Therefore, it is not possible to determine the relevant region of the image that is most useful for classification. In most cases, experts are encouraged to use their attribute label data as interpretable evidence. Existing interpretable artifical intelligence models can provide the basis behind the classification. However, in the existing classification model,there is no mechanism for identifying potential misclassification of classifiers. Warning users about misclassification will help prevent errors from entering the system.
One of the reasons for misclassification is the reduction of distance between classes. Some outliers or edge elements of a class can share the common characteristics of adjacent classes. However, there is no mechanism to ensure the number of subclasses of a given class and whether it makes sense to merge closely related subclasses of two adjacent classes into a new class and implement the correct classification. Hagras [51] proposes a solution based on database transaction model interpretation, whose explanation is on the basic of logical structure or reasoning. The static structure makes it unsuitable for the deep network classifier. In the constructed model, the system dynamically give appropriate explanations from stored vocabularies, which in turn are generated based on model learning. It provides a consistent view of models and interpretations beyond the scope of existing technology.
Reference [52] proposes an interpretable model in which they are interactively validated through visual features and similarity. Moreover, k-means clustering was used to analyze the similar features, so that the average features obtained had greater robustness and relatively low time complexity. In addition, it does not consider the importance or relevance of model, nor does it cluster with respect to output classes. Samek et al. [53] conducts the explanatory demonstration of the model mainly from the aspect of correlation calculation. By observing the change of the connection mode of the network, the hidden layer is explained visually. However, the feature learned by the network layer are not described in detail. Mao et al. [54] proposed an image interpretation and generation method based on visual features. Take an image signature with a fixed length of 8000 to generate a caption. In this model, the correlation of features is firstly determined and the signature generation is carried out on this basis. Since eigenvalues can be of any length, strategies that follow highly correlated features are interpretable. As the power of interpretation becomes more important in intelligent decision-making, AI systems are no longer there to serve as black boxes. Decision makers of AI services have the right to know the reasons behind their decisions so that they can better play to their strengths.  Figure 3, and the research of this paper is also carried out according to these contents one by one.

Explainable scene object detection model of combing self-paced learning and deep reinforcement learning
The deep network model is designed for learning across model characteristics.This paper proposes a new application in deep network, using deep network to learn multi-mode. In particular, this paper demonstrates that cross-modal feature learning -if multimodal features are present during feature learning, better features can be learned for a modal (multimodal learning, monomodular-tested). In addition, the paper designed how to learn a shared feature between multiple modes and evaluate it on a particular task --the classifier was trained with only audio data but tested on video-only data. Multi-mode explainable deep network model can be seen in Figure 4. This model consists of two streams, one for video information and the other for audio information. The structure of the two streams is identical, each consisting of eight layers (including the input layer). there is only one modal representation, it is necessary to integrate observable variables that are not observed. Therefore, this paper proposes a deep self-coding model to solve the above problems.Inspired by the noise-reducing self-coding model, this paper proposes a training two-mode deep self-coding model (Figure 4), which uses an extended (extended single-mode input) but noisy data set. In fact, the model is still required to reconstruct the two modes when one mode uses zero as input and the other uses the original value as input when expanding.Therefore, one-third of the training data is input only by video, one-third of the training data is input only by voice, and the last third has both video and voice. This model can be viewed as an example of multitasking learning.
When designing the intensification strategy, the paper uses the Q network to interact with its environment during the data generation phase. The system looks at the current scene, which consists of audio and video frames, and takes actions using the -greedy strategy. This environment in turn provides scalar rewards.Interaction experiences are stored in replay memory M. Replaying M preserves N recent experiences, which are then used to update the network parameters during the training phase. In the training stage, the network structure will use the data stored in replay memory M to train the network.Assume that the superparameter n represents the number of experiences replay, and for each experience replay, a mini-cache B containing several interactions is randomly sampled from the finite size replay memory M. The model will be trained by sampling from cache B, and the parameters of the network will be updated iteratively in the direction of The Behrman target.The algorithm is divided into two phases to avoid latency. Therefore, this paper divides the algorithm into two stages: in the first stage, the robot collects data through limited time interaction with human beings; In the second stage, it enters the stage. During this rest phase, the training phase is activated to train the multimodal depth Q network.
In this paper, for the sake of the regularization model and make it thin, it is needed to make each unit has a hidden layer using the regularized the expected activation function of punishment, the form of the regularized punishment is the need to focus on research, it determines the cell activation function of hidden layers on the sparse sex (whether the function of the activation of the hidden layer unit is activated or not).
In order to avoid non-convex optimization problems from falling into poor local solutions, the proposed network optimization method adopts multiple random initializations to train the model, and then chooses the initialization network with the best effect to construct the model. However, this method is too adhoc and the calculation cost is too high. Self-learning is just the best solution to non-convex optimization problems.The curriculum learning is to simulate the cognitive mechanism of human beings by first learning simple and universal knowledge structure and then gradually increasing the difficulty to learn more complex and specialized knowledge. However, self-learning has been improved in course learning. Instead of assigning prior knowledge to sample learning sequence in advance, the learning algorithm itself determines the next learning sample in each iteration.  represents the I layer voting value of the nth part, which is an invisible random vector. It should be noted that during the training, the values of V and S will be adjusted according to the feedback learning between layers, and these two values are the part that needs to be studied in the model design. In the above designed deep model, in order to increase the interpretability of the model, we assume that Y represents the detected window. Then, from the perspective of probability, we can get the data distribution p(y) of Y, which can be expressed by the following formula:  (2) Where formula (2) can be calculated using the average field theory. In addition, we plan to design an optimization algorithm for adjacent layers, which requires training parameters layer by layer. The probability distribution is as follows:

Density regression
As mentioned in the introduction, most previous methods use L2 loss to optimize their networks and generate predicted density maps. Most methods use multi-column convolution kernels to generate density maps, and then join multiple columns of density maps to form the final density map. It is assumed that a path i is calculated forwardly as Si, so the loss of its full path can be defined as follows: . || ) ... The definition of such loss will have the following main problems: 1. The first concern is that most of the methods only use multi-column convolution kernels to generate density maps, without taking into account the internal relationship of the network feature layer, only to extract features in the spatial stacking network of the sub-network. We have noticed that some of the information extracted inside the feature layer is useful, and some is basically useless , so if you just blindly pass the extracted feature layer to the next layer, in a certain sense, it makes the semantic information extracted in the subsequent deep layers worse, thus the generated density map will be fuzzy and unreliable in comparison.
2. Secondly, it can be noted that even if different convolution kernels are used to extract semantic information at different depths, the method of using multiple columns of convolution kernels does not achieve a real collaborative way, but in a competitive manner. Because each sub-network is implementing one thing: the density map generated by the sub-network under the convolution kernel is closer to the real density map, which leads to the competition of multiple convolution kernels since the features extracted under different convolution kernels are not suitable for another convolution kernel.
3. Thirdly, the population density estimation with multi-scale information lacks the constraint of cross-scale. Since the input scale of each sub-network is different, previous methods do not consider relationship of the generated density map and previous original size, and find that there is a certain loss in the fusion density map generated by the subnet and the density map of the original size. Fig. 6 shows the structure of our adversarial generation network. In our method, the generator network G learns from input images of different scales and generates corresponding density maps. This is an end-to-end mapping. Specifically, we used U-net as the encoder-decoder structure to construct the generator. In order to deal with the scale changes, we use two different size, namely G-large and G-small. Above two generators of different sizes cooperate with each other.

Network architecture diagram
G-larger extracts large-scale feature of target, and G-small focuses on small-scale information of target. For G-large generator, eight network layers are used in the encoder part, each layer contains a batch normalization layer and corresponding Leaky-Relu activation layer, and among the eight convolutional layers in each encoder part add feature self-learning layer for better feature extraction, and deconvolution in the subsequent decoder, each layer also has batch normalization layer and relu activation layer (the last layer not), skip connections are added during the deconvolution operation, these layers are connected after the self-learning of the corresponding convolution layer features. This is also a process to obtain a clearer contour map. G-small and G-large have similar structures. The structures of G-large and G-small generators could be shown in Table 1. The size of the input picture is 720*720 and 200*200 respectively, and the ratio of the input and output part is consistent.  3*3*64decv, stride 1 8 3*3*64decv, stride 1 10-15 4*4*64decv, stride 2 9-13 4*4*64decv, stride 2 16 6*6*1decv, stride 2 14 4*4*1decv, stride 2 As the previous Fig. 6 also illustrates our discriminator structure, input the corresponding patch density map (both generated and ground truth), for patches of different sizes, the density map they generate is consistent with its patch size.
The discriminator contains five convolutional layers, a batch normalization layer and corresponding Leakrelu layer (the last layer not), which serves as feature extraction layer. And the tanh function is placed at the end of deep network structure to return the probability score to -1.0~1.0, which means that if it is True (it is closer to 1.0),otherwise it is False (it will be closer -1.0), it can be seen from Fig. 5 that the D-large and D-small have the same network structure.

Feature self-learning
Convolutional neural network is composed of several convolutional layers, non-linear layers and down-sampling layers, so it can describe the image by capturing the feature of the image from the whole cognitive field. However, it is very hard to construct a powerful network to capture better features. The difficulty comes from many aspects.
Therefore, we hope to select better features between different channels, that is, redirect the features to subsequent convolutional layers to obtain better features. As shown in Fig. 7, the feature self-learning module mainly includes two steps [48]. First, the crowd density map is computed based on the spatial dimension, and then The two-dimensional eigenmatrix is transformed into a one-dimensional number, each real number has the global receptive field of the current channel in a sense. And it has input features and output dimensions. The number of channels is matched, which is useful for characterizing the global probability distribution of input sample and obtaining the global receptiveness field that is close to network layer. The specific definitions are as follows: Where c  represents the c-th convolution kernel, H and W are the image resolution, and i and j denotes the pixels in the image that are long and wide.
After obtaining the global description feature after the first step, the next step is to grab the relationship between the channels. In the grabbing stage, it is necessary to ensure that it is flexible at first, and the nonlinearity relationship between the various channels is needed to determine. The second point is that the learned relationship is not mutually exclusive, allowing multi-channel features instead of one-hot form, so a sigmoid-style gate mechanism is used to generate a weight for each network channel, where the learned model parameter w is used to reflect the correlation between the modeling layers, which makes the effective channel information be enlarged. At the end, the weight output through the gate mechanism is regarded as the optimal channel selection scheme, and then each channel are re-weighted to the other layers to complete the channel size restoration, so that the function before the calibration channel selection is defined as follows: , In order to further optimize the model and improve the transformation ability of the model, the bottleneck structure with two fully connected layers is adopted, the first fully connected layer plays a role of dimensionality reduction operation could be finished by one of the two fully connected layers, the dimensionality reduction coefficient r is a hyperparameter, and then activated by ReLu. The last fully connected layer restores the original dimensions.

Cross-scale consistency estimation
As mentioned in the introduction, the traditional method uses L2-based regression to train a multi-scale network and finally forms a fused prediction density map. It also mentions why it is only based on L2 regression will make the final density map fuzzy estimation. In order to solve such problems and make the final density clearer, we use adversarial loss. It comes from the generative adversarial network. The adversarial generative network involves the model generator G and model discriminator D. The two are like playing a minmax game: the image genera-ted by the training generator G deceives model D, while the aim of training D is to distinguish the synthesized image from the actual image, if it is inconsistent, it is Fake, and if it is consistent, it is True [41]. In our method, this adversarial loss is defined as follows: Where x indicates the training patch and y indicates the corresponding ground heat map. The aim of G is to minimize this goal, and D tries to maximize it.
Therefore, compared with the traditional regression loss, the advantages of using adversarial loss are as follows. The traditional pixel-by-pixel Euclidean loss is based on large deviations between pixels, so when facing sharp edges or outliers, it will make the feature map fuzzy, and thus the generated density map will become fuzzy. But the adversarial loss discards the large deviation between the existing pixels. It is a binary judgment for each pixel, either true or false.
Using this adversarial loss will stimulate the distribution of true values. In other words, if the generated picture tends to be blurred, the discriminator D will tend to -1.0 to avoid the blurred picture and generate a clearer picture because of its excitation.
Because of the lack of punishment based directly on the ground real image, only using adversarial losses can sometimes lead to abnormal spatial structure. As suggested in previous work, we also use two common losses to smooth the solution. Details are as follows: • Euclidean loss: In our model, the L2 loss is also adopted to change the estimated density map G into the discriminator model D, and to approximate the basic facts in the sense of L2. Assuming that W×H resolution image with c channels is given, we design the following rule about defining the pixel-by-pixel loss: So the overall first-stage loss could be defined as follows: Among them, λe and λp are the weights that redefine Euclidean loss and perceptual loss. After previous work, we set λe =λp = 150.

Cross-scale consistency constraints
As mentioned earlier, the basis of the L2 loss is adopted, and the perceptual loss is added to make the generated density map better consistent with the ground real density map, but the problem that needs to be solved is the cross-scale, so we used the cross-scale consistency regulator to improve the robustness and generalization between the child patch and the parent patch density map, that is to say, this constraint is to reduce the residual error generated by the child and father patch in the population density estimation. As mentioned, the original method did not notice that each word network works in a specific way, and above sub-networks could not perform well in a cooperative way, making the resulting density maps prone to inconsistent results. More specifically, as can be seen from Fig. 5, during our model training process, the patches are sent to G-large and G-small, respectively, to obtain the estimated heat map P-parent and generated by G-small At the end of the subnetwork, the four sub-pictures P-child are spliced to form P-concat. The constraint of cross-scale consistency is between P-concat and P-parent. In general, given the W×H density map with c channels, the cross-scale consistency constraint based on L2 loss is defined as follows: Where ) ( c P prt denotes the pixels in the parent block heat map, and ) ( c P cnt denotes corresponding pixels after the child block density map is spliced, C=3. By continuously optimizing this constraint, the gap between the density map of the parent block and the child block will be forced to decrease. We only pay attention to the total number of people who obtain the entire image, and this constraint is to deal with multi-scale problems, so we can see that this constraint can be more general applied.
The ultimate goal is to combine the four loss functions mentioned above to achieve our final loss function: Among them, λc is a predefined weight to achieve cross-scale consistency with respect to constructed loss function.
It's worth noting that if λc is 0, the two normal forms will be generated independently.

Data set
Our captured data. We set up cameras in the hallway of a teaching building on the Campus of Shanghai Jiao Tong University, and recorded 12 video clips, each one contains 2 minutes clip. Students and teachers are coming in and out of the building, sometimes carrying a schoolbag, sometimes in groups. This dataset is made up of 28800 frames, and a cycle is set to 300 frames.
UCF CC 50. The UCF CC 50 dataset [11], is a very representative testing samples that contains 50 annotated pedestrians samples, which contain various characters and scenes. The character is between 94 and 4543. Sliding window scheme is used (the size of the sliding step can be set, which will cause the sample size obtained to change) for data expansion, and split into 50% for cross-validation to evaluate the proposed method.
Shanghai-Tech. The Shanghai-Tech dataset is created by Zhang [14], and contains 1198 annotated pedestrians samples, which are captured by street View cameras and webcams. Our proposed model is trained and tested on this dataset. In order to increase the training samples, we adjust the samples images to 720*720 resolution and shape 200*200 sub-blocks from the picture. Fig. 8. The comparison between the predicted value and the real value during the training process. The first column denotes the real images, the second column denotes the real heat maps, the third column denotes the prediction heat maps generated by the adversarial network without feature self-learning, and the fourth column the prediction density map generated for the adversarial network generated after the feature self-learning is added

Experiment details
In the algorithm module, the input part is the image pair composed of the image and the corresponding density map.
For G-large, the original image is input, while for G-small, the corresponding quarter small image is used to set the overall network. The initial learning rate is 0.00005 to update the parameters in our network. For the population suppression rate -people-thr, it is set to generate more data samples during the data expansion phase. The people-thr set in UCF CC 50 is 15. The corresponding setting in shanghai_partA is 0, and then a sliding window 200*200 is used to randomly crop an image block of a specific size to expand our current data set. The number of iterations of our model in UCF CC 50 and Shanghai_partA is 500 epochs, and the training and testing of our network are implemented on torch, as shown in Fig. 8, we can view the predicted heat map and the real heat map during the training process.
As can be seen from Fig. 8., the prediction density map we get after adding feature self-learning is closer to the true value, so we can see from Fig.9 and Fig.10 below that we can more accurately describe the feature after adding feature self-learning Loss reduction and corresponding changes in MAE accuracy, we also give the performance of net similarity, as is shown in Fig.11. Also we give the performance of the whole system in Fig. 13, through observing performance of our constructed system with the ground truth, we can draw that the performance of our designed system is very stable. Some problems mainly derive from occlusions and the interaction of other goals.

Comparison with the latest technology
The proposed model is evaluated with several latest models on our constructed dataset and two benchmarks, and the results are given in Fig.12, Tables 2 and 3. From all the tables, we notice that our method is always much better than the previous method. Table 3 gives a comparison of the Shanghai-Tech Part-A data set, and their images are closer to the real monitoring screen than other data sets. Our proposed SFL-GAN has achieved considerable improvement compared to the existing technology. Table 2 indicates that proposed method obtained the best MAE and competitive MSE among the five latest methods in the UCF-CC-50 dataset.  [14] 467 498 Y.Zhang [27] 377 509 Vishwanath A [37] 322 341 D.B.Sam [33] 318 439 SFL-GAN(ours) 290 443

Conclusion
In this work, we propose to construct a crowd density prediction machines via self-learning generative adversarial network in the framework of soft computing, the proposed model is used to obtain more accurate data collection develop an intelligent prediction machines for crowd detection and counting. Considering the decentralized, secure and trusted features of the IoT, an IoT generation algorithm for data under surveillance scenarios is designed. Moreover, based on the serious occlusion problem of the crowd, aiming at different crowd densities, a novel self-learning generative adversarial networks model is first proposed, then a feature self-learning layer to the generator, to avoid the blur of the generated image, the anti-loss is constructed in the following step. Finally, in order to avoid excessive cross-scale loss, the cross-scale consistency criterion is used to optimize the final fusion density map. The proposed model can processes the data from real video scene, the modules we designed can effectively save and improve the speed of the system, so all the steps can be operated in an effective way and the effectiveness of the system is very high.
In the following research, we will pay more attention to the improvement of real-time algorithm and effectiveness of IoT-based hardware, also be able to generate density maps closer to the ground truth during the density map generation phase based on more effective machine learning approaches.