Efficient crowd counting model using feature pyramid network and ResNeXt

Crowd counting is one of the most challenging issues in the computer vision community for safety and security through surveillance systems. It has extensive range of applications, such as disaster management, surveillance event detection, intelligence gathering and analysis, public safety control, traffic monitoring, design of public spaces, anomaly detection and military. Early approaches still encounter many issues like non-uniform density distribution, partial occlusion and discrepancies in scale and perspective. To address the above problems, feature pyramid networks are introduced in deep convolution networks for counting the individuals in the crowd. The designed network has extracted the features at all resolutions and is constructed rapidly from only one input image. This method achieves outperformance results compared to the well-known networks on three standard crowd counting datasets.

and urbanization around the world, has by implication made enhancement in the crowd.Huge social gathering occasions of individuals can be seen at shrouded zones, for example, in building corridors, air terminals and arenas and also in open zones like walkways, parks, sports occasions, political meetings and public exhibitions.Figure 1 shows some pictures of the above locations.The motivation behind the social occasions has a significant impact on the group's huge scope properties and practices.Subsequently, the investigation of group elements and practices is a subject of extraordinary premium in numerous logical explores in psychology, sociology, public administrations, security and computer vision.
Crowd disruption is a general cause of crowd catastrophes due to pushing, mass-frenzy, group smashes, and overall loss of control (Brockmann et al. 2014).Numerous misfortunes delineate this issue like Water Festival stampede 2010 in Colombia where 380 people died approximately (Illiyas et al. 2013;Wang et al. 2013).Another renowned group pulverize scenario that has been concentrated a lot occurred in 2010 Love Parade music festival in Germany, where 21 members died and more than 500 were effected (Krausz and Bauckhage 2012;Helbing and Mukerji 2012;Shah et al. 2007).Sample images of the above two scenarios are shown in Fig. 2. Some other destructive crowd scenarios of are presented in Table 1.To forestall such destructive mishaps, early programmed recognition of basic and strange circumstances in huge scope Intelligent visual reconnaissance at region under perception is widely concentrated by researchers of computer vision domain (Shah et al. 2007;Hu et al. 2004).Intelligent visual reconnaissance includes exact information preparing, proficient data combination and requires many less human administrators.Actually, it has an extraordinary benefit contrasted with the customary CCTV innovations which require an enormous count of individual administrators, more human asset cost, to continually screen reconnaissance cameras.Analysing the crowd is quite possibly the most difficult assignments in such smart visual observation frameworks.It may be utilized for automated discovery of basic group level, recognizing and checking individuals, and identifying of abnormalities and disturbing group blemishes.Moreover, it may be utilized for tracing the people or a gathering of individuals in a group (Aggarwal and Ryoo 2011).Counting crowd stream is a significant video-outline dissecting measure in crowd examination since density of the crowd is one of the essential portrayals of the crowd status.Automatic pro-cedures for crowd density assessment and tallying got a lot of attention in security control and assumed a fundamental part in crowd management.It also very well can be utilized for estimating the solace level of the group and recognizing possible danger to forestall over crowd disasters.In visual checking frameworks, the group size is one of the significant essential pointers for identifying dangers like revolting, savage dissent, battling, mass frenzy (Junior 2010;Dittrich et al. 2012;Chen et al. 2010).
Crowd estimation and analysis has assortment of important applications some of which are as follows: • Safety observing The utilization of video observation cameras for security and well-being points in different places, for example sports arenas, places of interest, shopping centres and air terminals, has empowered simpler checking of group in such situations.• Disaster management Many situations including group social affairs, for example games, music shows, public showings and political meetings, countenance the danger of group interrelated calamities, for example, rushes that can be hazardous.In the above-said cases, crowd examination can be utilized as a compelling apparatus for early congestion recognition and proper administration of group, subsequently, possible repugnance of any catastrophe (Abdelghany et al. 2014;Almeida et al. 2013).• Devise of public spaces Crowd examination on accessible public spots like air terminals, train stations, shopping centres and other public structures (Chow and Ng 2008) can uncover significant plan inadequacies from group well-being and accommodation perspective.• Virtual conditions Crowd investigation strategies can be utilized to comprehend the basic marvel accordingly empowering us to set up numerical models that can give precise reproductions.These numerical models can be additionally utilized for recreation of group wonders for different applications, for example PC games, embedding special visualizations in film scenes and planning clearing plans (Gustafson et al. 2016;Perez et al. 2016).• Forensic investigation Crowd investigation also be utilized to look for suspect and casualties in occasions like besieging, shooting or mishaps in huge social events.Conventional face discovery and acknowledgement calculations can be speeded up utilizing crowd examination strategies which are more skilled at dealing with such situations (Barr et al. 2014).• Strategic planning in defence In military and security applications, people counting plays a crucial role in analysing the crowd for taking the strategic decisions at the time of war according to the crowd observed.
In this work, we plan a deep network named as ResNeXtFP network with feature sharing to check individuals and gauge full-resolution density maps.Convolutional kernels of 3 × 3 at pyramid layers are for extracting the features while looking after resolution.Skip associations are used to coordinate multi-scale semantics and to expand scale insight capacity with less boundaries than multi-section models.The remainder of this paper is arranged as follows: Sect. 2 quickly presents some new procedures on swarm tallying, followed by the specific information about our model and planning courses of action in Sect.3. In Sect.4, three public testing swarm datasets are clarified alongside the presentation measures and result relationships among top tier methodologies and our own.Finally, Sect. 5 finishes up the paper.

Review of literature
Based on the characteristics of the available methods for estimating the crowd density and counting the individuals, they are categorized into direct and indirect methods.The direct methodologies attempt to fragment and identify every person in the crowd scenes and afterward counting them by considering a classifier.In this strategy, counting the individuals can be done as long as individuals are effectively segmented yet the interaction can be more perplexing when an extreme group or impediments happened.In the indirect methodologies, individuals counting is conveyed typically utilizing the estimations of certain highlights with learning calculations or statistical examination of the entire group to accomplish count measure.Indirect strategies are further classified as pixel-based, texture-based and corner-point analysis techniques.In later years, the achievement of convolutional neural networks (CNNs) in a variety of computer vision tasks has propelled scientists to utilize their capabilities for getting nonlinear functions from crowd pictures to their relating density maps or count of individuals.An assortment of CNN-based techniques has been introduced in the literature.CNN-based techniques are categorized by considering the property of the network and training process.In view of the property of the networks, the methodologies are categorized into the accompanying classes: fundamental CNNs, scale-aware models, context-aware models, multitasks systems.There exists another type of categorization, with respect to the inference strategy into two categories which are patch-based inference, whole picture-based inference techniques.
To handle the issue of crowd density estimation, the use of CNNs was started by Wang et al. (2015) and Fu et al. (2015).Wang et al. proposed a regression model based on a deep CNN for counting individuals from pictures in very dense groups.They embraced Krizhevsky et al. (2012) in the framework where the last completely associated layer of 4096 neurons is supplanted with a solitary neuron layer for foreseeing the check.Moreover, to lessen bogus reactions, backgrounds like buildings and trees in the pictures, training data are expanded with extra negative instances whose ground-truth count is set zero.In an alternate methodology, Fu et al. projected to characterize the picture into one of the five classes: very high, high, medium, low and very low density as opposed to approximate the density maps.Multi-stage ConvNet by Sermanet et al. ( 2012) was received for enhanced move, scale and mutilation invariance.Likewise, they utilized a course of two classifiers to accomplish boosting in which the first explicitly tests misclassified pictures though the subsequent one reclassifies the misclassified instances.Zhang et al. (2015) examined the active techniques to recognize that the performance diminishes radically when applied to another scene that is not quite the same as the training data.To conquer this issue, they proposed a mapping function from the given image to crowd count.To accomplish this, they train their network alternatively on two related objective functions which are crowd counting and estimating the density.By training the network alternatively to optimize these two objectives, one can acquire better neighbourhood optima.To adjust this network to another new scene, the  trained network is fine-tuned by considering the training samples that are similar to the new scene.The main point to be noted in their approach is that the network is adjusted to new picture with no additional labelled data.
Assessing count of the crowds stays a difficult task because of the issues of scale varieties, non-uniform circulation and typical backgrounds.Zhang et al. (2019) proposed a multi-goal consideration convolutional neural organization (MRA-CNN) to address the task of crowd counting.Aside from the task of counting, the authors used an extra grouping task at density level during training and consolidated highlights learned for the two tasks, consequently shaping multi-scale, multi-logical highlights to adapt to the scale variety and non-uniform appropriation.In addition, they used a multi-resolution attention (MRA) model to create score maps, in which head areas are with superior scores to train the network to have more attention on head locales and stifle non-head regions paying little mind to the unpredictable backgrounds.In the generation of score maps, atrous convolution layers are utilized to extend the receptive field with less number of parameters, accordingly getting more elevated level highlights and giving the MRA model more exhaustive data.The designed network was examined on ShanghaiTech, WorldExpo'10 and UCF datasets to show the viability of proposed network.
Scale variety due to viewpoint contortion is as yet a difficult task for examining the crowds.To tackle this issue, an atrous convolution spatial pyramid organization (ACSPNet) is introduced by Ma et al. ( 2019), to perform swarm checks and density maps for sparse and crowded situations.Dilated convolutions sequenced with expanding dilation rates are used to misrepresent the responsive field and to keep up the goal of extricated highlights.Atrous spatial pyramid pooling (ASPP) is utilized to resample data at various scales and consists of global setting.The proposed ACSPNet is evaluated by authors on five benchmark datasets for crowd counting, and they claimed that our strategy accomplishes minimum absolute error and squared error.
Applying the crowd counting on aerial images with the help of an embedded system is a difficult assignment, because of high-quality pictures, low computing resources, and restricted memory.To handle this issue, Chen et al. (2021) proposed an effective deep learning model named Flounder-Net.In the Flounder-Net, a novel interleaved group of convolutions is used to dispense with the duplication of the network.Wang et al. (2020) proposed an effective encoderdecoder framework, named MobileCount, which is explicitly intended for high accuracy continuous group counting on versatile or installed gadgets with restricted computation assets.For the encoder part, MobileNetV2 is custom fitted to altogether diminish FLOPs at somewhat cost of performance drop, which has four bottleneck blocks before a maximum pooling layer with stride 2. The plan of decoder is roused by Light-weight RefineNet, which added performance of counting by 10% expansion of FLOPs.The proposed framework accomplishes similar counting performance with 1=10 FLOPs with various benchmarks when compared to existing methods.Finally, the authors proposed a distillation method with multi-layer network to additionally lift the performance of the MobileCount without expanding its FLOPs.
Crowd counting has attracted far attention in the domain of computer vision.However, it is amazingly difficult on account of the changing scales and densities.Various existing strategies centred on improving the multi-scale portrayal by using multi-section or multi-branch models with various kernel sizes.Nonetheless, those networks cannot retrieve the feature maps with enormous receptive fields because of limit of profundity.Also, the significance of using the staggered highlight data in a deep network is disregarded.To handle this task, Zhu et al. (2021) proposed a multi-scale feature aggregation network (MFANet) for precise and effective group counting, and it tends to be trained in an end-to-end manner.A fundamental part of the network is the scale level aggregation module (SLAM), which can separate multi-scale highlights and utilize multi-level feature data for more precise assessment.The best performance has observed when six SLAMs are stacked together and applied in the network.Exploratory outcomes show that the proposed MFANet accomplishes cutting edge execution in group counting and localization of the group.
Numerous CNN-based counting methods achieve great performance.But, these strategies just focus on the neighbourhood appearance highlights of group scenes, however overlook the huge reach pixel-wise relevant and group consideration data.To handle the above issues, Gao et al. 2019 presented the spatial-/channel-wise attention models into the conventional regression CNN to assess the density map, called as "SCAR".It comprises of two modules, specifically spatial-wise and channel-wise attention models.At last, two kinds of attention data and customary CNN's feature maps are incorporated by a concatenation operation.The authors claimed that the outcomes show that the proposed technique accomplishes cutting edge results.
In any case, existing networks which are based on CNN basically centre with respect to improving precision yet once in a while think about the simplicity of network.In particular, they have the accompanying limits: (1) taking high computational intricacy (Wen et al. 2019;Yang et al. 2020), (2) having such a large number of parameters (Hossain et al. 2019;Fang et al. 2019), (3) biased with fixed size of picture as input (Zhang et al. 2015;Shen et al. 2018).These limitations cut off the applicability of the techniques as embedded systems with restricted memory and computational force and limit the versatility of the network to an assortment of imaging equipment.

Proposed methodology
This section describes the proposed network by considering the concept of feature pyramid network based on ResNeXt network for counting the individuals in the work.

Feature pyramid network
Recognizing objects in various scales is generally a typical issue specifically in the case small objects.We can utilize a pyramid of the same picture at various scales to recognize objects.Notwithstanding, handling various scale pictures is tedious and the memory requirement is too high for training all at the same time.On the other hand, we can make a pyramid of feature and utilize them for object discovery.Nonetheless, feature maps nearer to the input layer are made out of low-level designs that are not powerful for exact object identification.Feature pyramid network (FPN) (Lin et al. 2017) is intended to extract the features for such pyramid concept in view of improving the accuracy and speed.It creates different feature map layers with preferred quality data over the ordinary feature pyramid for object recognition.FPN consists of bottom-up and top-down strategies.The bottom-up strategy is the standard convolution network for extracting the features.As we go up, the resolution diminishes.With all the more significant level designs distinguished, the semantic incentive for each layer increments.The bottom-up strategy is the feed-forward calculation of the convolution network used in the background.It consists of number of modules each has numerous convolution layers.As we climb, the spatial measurement is decreased.It is considered that single pyramid level is for every stage.The outcome of the final layer is used as the reference set of feature maps for advancing the top-down pathway by parallel association.In top-down strategy, the higher-resolution feature is up-sampled spatially coarser, however semantically more grounded, including maps from upper levels of pyramid.All the more explicitly, the spatial goal is up-sampled by a factor of 2 utilizing the closest neighbour.Every lateral connection consolidates feature maps of a similar spatial size from the bottom-up and top-down strategies.The outline of FPN is shown in Fig. 3.

Proposed ResNeXtFP network architecture
The structural design of the network which is designed in this paper for crowd counting is called ResNeXt-based feature pyramid network (ResNextFPNet) shown in Fig. 4.
In this work, ResNeXt is used as a backbone network for FPN.We planned the model to have the option to use more than one output with various scales from FPN for counting the individuals in the crowd.To oblige this capability, the FPN is designed in such a way that each feature pyramid from FPN As illustrated in Fig. 3, the backbone convolution network that is utilized is ResNeXt (Xie et al. 2017).The structure of ResNeXt basically mirrors the ResNet (He et al. 2016).The architecture of the ResNeXt is an extension of the deep residual network which changes the standard residual block with the one that uses a strategy called "split-transform merge" which is adopted from the Inception models.Basically, as opposed to performing convolutions over the full feature map, the block's information is projected into a progression of lower-dimensional portrayals of which we independently apply a couple of convolution channels prior to merging the outcomes.The single block of ResNeXt is illustrated in Fig. 5.
For every path, Conv1×1, Conv3×3, Conv1×1 are done in sequence.The internal measurement for every path is meant as d (here d = 4).Cardinality of the block which represents the number of paths is denoted as C (here C = 32).We can summarize the dimensions of each Conv3×3 as 128 (i.e.d × C = 4 × 32).The dimension is expanded straightforwardly from 4 to 256, and afterward added together, and furthermore added with the path of skip connection.The number of parameters in ResNeXt is C × (256 ), with C = 32 and d = 4.All the blocks details of ResNeXt are included in Table 2.
In particular, FPN just follows up on the feature activation result by the residual block yield at each phase of ResNeXt, which are indicated as C2, C3, C4, C5 .It is a general requirement that feature maps of similar size only can be combined, so the significant upper-level feature maps should be up-sampled prior to being combined with the bottom-level feature maps.Subsequently, we utilize a closest neighbour interpolation which will successfully lessen the checkerboard impact which exists in deconvolution and simple interpolation approaches.Besides a convolution kernel of 1×1 size to reduce the dimensions and a 3 × 3 convolution to additional concentrate the low-layer data, a layer of ReLU is used to acquire the nonlinearity between the two convolution layers.At that point, the high-level semantic features are combined with the low-level semantic features through the addition for element-wise, and acquire the combined feature map with a 3 × 3 convolution and two layers of ReLu.The above procedure is repeated until the best resolution feature map is produced.At the end, a bunch of multi-scale feature maps relating to the merged one of each layer are produced which is characterized as P2, P3, P4, P5 .It is significant that P5 is acquired by C5 with a 1 × 1 convolution and a 3 × 3 convolution.This whole cycle of merging the down inspecting and up-sampling features is shown in Fig. 3 as 'merging unit (MU)'.The entire procedure used in 'MU' is illustrated in Fig. 6.

Experimental investigations
This section first explores datasets considered for validating the proposed framework and evaluation metrics used for that validation.Then, the discussion on comparison of the proposed framework results with other frameworks by considering three popular crowd counting benchmark datasets is presented.

Datasets
The validation of the proposed framework is done with respect to the two standard datasets for the domain of crowd counting.The details of the datasets are as follows: • ShanghaiTech dataset Zhang et al. (2016)   Summary of the datasets used in shown in Table 3.

Performance metrics
Regularly the evaluation of crowd counting models was done by mean absolute error and mean-squared error metrics.

Mean Absolute Error
where N is the count of test images, Z i is the ground truth of the persons in the ith image, Z i is the predicated count in the ith image.MAE specifies the accuracy of the evaluation, and MSE specifies the robustness of the evaluation.Lower value is better for both MAE and MSE.

Discussion on the results
The performance of the proposed ResNeXtFP network is demonstrated with a four challenging benchmark crowd counting datasets and compared with general CNN architectures.The implementation of the proposed network is done in python because of the wide availability of the libraries and frameworks for deep learning.To build the FPN and deep learning architectures, Keras and TensorFlow are used in the backend.Experiments were done on DELL Power Edge R740 Server with 2 X Intel Xeon Gold 6226R-2.9G,16 C, 32T, 22 M Cache, NVIDIA Quadro RTX 6000, 24 GB GDDR6.
The evaluation of the projected framework is done with earlier well-known networks on the ShanghaiTech dataset.The efficiency of the projected framework is compared with four CNN-based networks.The experimental results are demonstrated in Table 4 and Table 5 for   The comparison of the proposed framework is done with previous state-of-the-art networks on UCF-CC-50 dataset.For this dataset also the performance of the proposed framework is compared with four CNN-based networks.The detailed results are illustrated in Table 6.The results show that the proposed framework attains the lowest MAE when compared to other CNN-based frameworks.On UCF-CC-50 dataset the existing multi-column CNN (Zhang et al. 2016)   With respect to the three datasets, the proposed ResNeXtFP Network outperforms some of the existing CNN-based networks with minimum MAE and MSE.

Conclusion
In this work, we introduce a feature pyramid network named ResNeXtFP network for counting the individuals in medium or high-level crowd visible in a still image.The convolutions in of the background network are utilized to extract the multi-scale features, creating density maps with unaltered resolution.By utilizing the benefit of skip associations, our network can diminish excess features and total multi-scale data.The projected network further uses features at various scales, contemplating global semantics.Results on three datasets show that our projected ResNeXtFP network can accomplish best in class exhibitions.In the future work, profound measurement learning approaches might be worried to more readily recognize heads and the excess foundation data.

Fig. 1
Fig. 1 Images of various crowded scenes.a parade, b sports stadium, c musical concert, d political rally

Fig. 2
Fig. 2 Pictures before crowd crush of Water Festival stampede and Love Parade music event 2010

Fig. 6
Fig.6The merging unit of FPN Part-A and Part-B datasets of ShanghaiTech dataset, respectively.The results show that the proposed framework achieves the lowest MAE in both Part A and Part B compared to other CNN-based frameworks.On the part-A dataset the existing multi-column CNN (Zhang et al. 2016) has 110.2 and 173.2 as MAE and MSE, respectively.The cascaded-MTL framework (Sindagi and Patel 2017) achieves 101.3 and 152.4 as MAE and MSE, respectively.The switching-CNN (Sam et al. 2017) got 90.4 and 135 MAE and MSE, respectively.The proposed model performs superior than CP-CNN (Sindagi and Patel 2017) which implements VGG-16 as backbone has 73.6 and

Table 1
Some example destructive crowd scenarios

Table 2
Details of ResNeXt architecture

Table 3
MAE and MSE, respectively.The proposed model performs better than CP-CNN (Sindagi and Patel 2017) which has 20.1 and 30.1 MAE and MSE, respectively.Finally, the proposed ResNeXtFP network achieves 14.3 and 21.9 MAE and MSE which are better when compared to the four existing frameworks.
has 377.6 and 509.1 as MAE and MSE, respectively.The cascaded-MTL framework(Sindagi and Patel 2017)

Table 6
Comparison of MAE and MSE on UCF-CC-50 dataset achieves 322.8 and 397.9 as MAE and MSE, respectively.The switching-CNN (Sam et al. 2017) got 318.1 and 439.2 MAE and MSE, respectively.The proposed model achieves better than CP-CNN (Sindagi and Patel 2017) has 295.8 and 320.9 MAE and MSE, respectively.Finally, the proposed ResNeXtFP network achieves 269.6 and 312.3 MAE and MSE which are better when compared to the four existing frameworks.