In this section, a proposed federated based FED-AT –VIDEO net is presented. The proposed framework consists of three important phases such as Video collection unit, federated training networks (Cloud/server model) and summarization methodology. In the first phase, videos from the multi-view camera are collected and stored in the database. In the second phase, important distant objects are extracted by the U-Net based saliency methods which are then used as the inputs to the proposed federated gated–attention networks to extract the secured and non-redundant features. Finally, the secured intelligent features are used for the summarization in the cloud-servers. The complete framework with its own functioning phases is demonstrated in Fig. 1.
3.1 Multi-View Video Collection Unit(MVVC) :
This module captures the multi view videos from the different video sources installed in the industrial environments. For an efficient construction of a video collection unit, raspberry pi model 4 is used as the main processor interfaced with cameras. Four Raspberry Pi (RPI) interfaced with cameras are used in this experimentation which collects the video data from different angles and feeds to the cloud server using the WIFI networks in which the process of segmentation, training and summarization is deployed. The proposed MVVC performs all the computations in real time with six frames per second(fps). Each video from the sources is stored in the server as tag-lines ID databases by differentiating each video frame by its source and time stamps.
3.2 Cloud Server Model :
In this phase, object saliency segmentation, federated learning process and finally summarization technique are deployed and a detailed description of each module is presented.
3.2.1 Object Saliency Segmentation Process :
Processed in this portion are the MVVC's outputs. The MVS considers object segmentation to be a crucial step that guarantees the highest level of summary quality. Since these camera generates multiple videos that contains humans and other uncertain activities, it is necessary to segment and detect the original objects that can aid in better summarization. From recent studies, many variants of Convolutional neural networks have been exploited for the segmentation and object detection. To obtain the better accuracy with less time of response, saliency based optimized tuned U-NETS(SBTU-nets) is proposed in this research. The optimization and fine tuning process with saliency extraction assist in enhancing the segmentation accuracy even if the targets are distant from each camera. A detailed description of the proposed model is explained in the preceding sections.
3.2.1.1 SBTU-nets :
Saliency Segmentation is the method which is used to segment sensitive information from the video frames. Also useful in segmenting important information from background images. Normally the saliency segmentation is processed by active contour methods [37]. Since this existing method introduces many ripples of complexity, this research incorporates the Fine-tuned U-NET for segmenting the saliency objects from each frame which is shown in Fig. 2. The 3 primary components of the U-Net network topology are “skip connection, up-sampling, and down-sampling”. U-net is an encoder-decoder structure in which the encoder structure uses up-sampling while the decoder structure uses down-sampling. Encoders are in charge of extracting features; in other words, the size of the picture is decreased by down sampling and convolution to extract a few superficial features. On the other hand, the decoder obtains some deep global features using up-sampling and convolution layers. In the deeper global region, there will be more global features (contextual information) whereas the shallow region contains more local detailed information. This kind of combination enhances the network performance after the fusion of multiple features. In the U-Net model, which successfully captures object attributes, the skip-connection approach and weighted cross entropy method are also utilised. Since the U-NET's hyper-parameters may be adjusted, it is chosen as the base model for saliency segmentation.
In SBUT-nets, hyper-parameters such as learning rate, epochs, input bias weights and momentum are fine-tuned to segment the distant objects from the multiple views of videos. For annotation and masked preparation, MVS datasets from office [38] are taken as references and labelling has been done by the one-hot encoding technique. For effective segmentation, the encoder is constructed with three convolutional layers, two kernels, and five pooling layers while the decoder is also designed in a similar fashion. The residual networks are used as skip-connection for accumulating more features that can aid in an effective segmentation. The fine-tuned hyper parameter that is used for constructing SBUT-Nets are listed in Table 2
Table 2
Fine-tuned Hyper-parameter used for constructing the SBUT-nets
Sl.No | Hyper parameters | Fine-tuned parameters |
1 | No of Epochs | 1250 |
2 | No of hidden layers | 200 |
3 | Learning rate | 0.0014 |
4 | Momentum | 0.1 |
5 | Batch size | 50 |
3.3 Deep Non-Overlapping Feature Extraction:
This section demonstrates the importance of the non-overlapping feature extraction mechanism used in the proposed framework. To investigate the deeper feature extraction method, this research uses the capsule networks and Bi-gated attention networks (GAN) to exploit the spatio-temporal feature extraction. In previous studies [39], In order to solve the visual representation of pictures, Convolutional Neural Networks (CNN) have become more popular. These CNN, however, have serious flaws in their fundamental design, which negatively impact their performance in key tasks. CNN automatically extracts characteristics from the photos, and using these features, it can learn to recognize and recognize various visual things. Simple features like edges are extracted by early layers, and as the number of layers rises, the learning process gets more complicated. The final step is for CNN to predict the visual objects using all retrieved information. During this process. The spatial information that is recorded in the item is not taken into account by CNN; it only records the characteristics that are visible. The CNN design is seen to have a significant flaw in that it lacks the spatial information provided by the MVS. Given the necessity for more precision in the input raw data, CNN must make its own adjustments to obtain greater accuracy. In order to improve feature extraction, capsule networks are used in place of CNN's pooling layers to address the issues outlined above. A capsule network is suggested in this study paper for a more effective extraction of spatial features. The recently suggested Capsule Network [40] has been put up to alleviate this CNN network constraint. Neuronal clusters known as capsules store spatial data and the likelihood that an object will be present. Each entity in a picture has its own capsule in the network of capsules, and each capsule provides:
- An entity's probability of existence
- The instantiation conditions for entities
A low capsule layer, a high capsule layer, and a classification layer make up the three layers that make up the capsule network. A dynamic routing technique that has been optimized is utilized to update the parameters repeatedly while global parameter sharing is employed to decrease the buildup of mistakes. The calculation of the input vectors' input matrix multiplied by the weight matrix encodes the crucial spatial relationship between low-level and high-level convolutional features inside the picture.
$$Y\left(i.j\right)={W}_{i,j }U\left(i,j\right)*{S}_{j}$$
1
The present status of capsules and which upper level capsules receive their output are determined by adding the weighted input vectors.
$$S\left(j\right)={\sum }_{j}Y\left(i,j\right)*D\left(j\right)$$
2
The squash function is then utilized to apply non-linearity. The squashing function reduces a vector's directions while reducing its length to an upper limit of one and a lowest of zero.
\(G\left(j\right)=squash\left(S\right(j)\) ) (3)
Capsule networks are able to gather data from several locations and use Eq. (1) to determine the link between the characteristics for efficient spatial information extraction. In the region of low and high capsules, respectively, convolution layers are used. The quantity of convolutional layers applied to the lower and main capsule areas is shown in Table 3. Eq. (2) is used to determine the output weights, which are then sent to the high capsule area. Eq. (3)'s squash function, on the other hand, compresses the vector's length to the value (0, 1) while preserving the vector's original orientation. The dot product between related capsules and outputs is included into the proposed model utilizing optimized dynamic routing in the next stage. The feature maps are formed using this procedure, which iteratively changes the network weights. Then, using layers that have been fully converted, the dimension of feature maps is decreased, and using layers that have been flattened, the dimension is lowered to create single-dimensional feature maps. Eq. (4) is the mathematical expression for the characteristics.
$$Z=F\left(G\left(j\right),P\right)$$
4
As discussed above, the activation function used in the network is LeakyRelu in the pre-convolution block connected with the capsule network which in turn increases the local features that aid for the better extraction of spatial features from the multi-view video frames. Table 3 gives the parameters used to model the Pre-convolutional Blocks.
Table 3
Parameters used for modelling the Pre-convolutional Blocks.
Sl.No | Convolutional layer’s count | Length of the stride | Layers count |
1 | Convol(2d)_Layer_1 | 1 | 3x3 |
2 | Max_Pooling_layers_1 | 2x2 |
3 | Convol(2d)_Layer_2 | 1 | 3x3 |
4 | Max_Pooling_layers_2 | 2x2 |
5 | Activation Function | LRelu | ----- |
3.4 Gated Recurrent Networks
Long short term memory (LSTM) is thought to have a particularly fascinating variety, called GRU which is shown in Fig. 3. This concept, tries to merge the forget gate and input vector as a single vector [41]. In addition to supporting extended memories, this network also supports long term patterns. Compared to the LSTM network, there is a significant reduction in complexity.
Following equations are coined by Chung to represent the characteristics of GRU
$${h}_{t}=\left(1-{x}_{t}\right) ⨀ {h}_{t-1}+{x}_{t} ⨀{ h}_{t}$$
5
$$\stackrel{\sim}{{h}_{t}}=g({W}_{h}{x}_{t}+ {U}_{h}({r}_{t} ⨀ {h}_{t-1})+ {b}_{h}$$
6
$${z}_{t}=\sigma ({W}_{h}{x}_{t}+{U}_{z}{h}_{t-1}+ {b}_{z})$$
7
$${r}_{t}=\sigma ({W}_{h}{x}_{t}+{U}_{r}{h}_{t-1}+ {b}_{r})$$
8
The overall GRU characteristic equation is represented by
$$P=GRU\left(\sum _{t=1}^{n}[{x}_{t,}{h}_{t,}{z}_{t,}{r}_{t}\left(W\left(t\right),B\left(t\right),\eta \left(tanh\right)\right)\right]$$
9
where \(W\left(t\right)\) stands for weights, and B(t) stands for bias weights at the current instant. \({x}_{t,}\) stands for the input feature at the current state, \({z}_{t,}\)stands for the output state, and \({h}_{t,}\) stands for the module's output at the current instant.
3.5 Self Attention Maps:
The attention map was first presented in 2014 to specify the correct terminology for sequence-to-sequence architecture [42]. In the majority of recent research, attention layers have been included to simulate redundant characteristics that can help with the accurate classification method. For each input sequence, the three vectors “Q, K, and V” are created as part of the self-attention mechanism, also referred to as the intra-attention mechanism. As a result, each layer's input sequences are turned into output sequences. To put it another way, it is a method that uses scaled dot functions to map the Query with the collection of key pairs. The dot product for self-attention is calculated using the following formula in mathematics.
$$F\left(K,Q\right)=(\left(K\right),{Q}^{T})/\left({V}_{K}\right)^0.5$$
10
3.6 Self-Attention (SA) Evoked Bi-Gated Recurrent Networks :
GRU networks are built as a BiGRU network, which consists of forward and backward GRU, to extract the temporal information from the raw sine waves. Eq. (13) provides information about the mathematical properties of the BiGRU network. The temporal characteristics that comprise the variety of information needed for PU identification are extracted by the BiGRU network. However, these elements include a wider range of data that could have an impact on the training period and cause a high degree of delay when identifying the main users. SA layers are incorporated between the BiGRU network and classification layer in order to reduce this detection overhead. Eq. (10), which generates attention features using the input features from the BiGRU network, is then applied to the softmax layer, which is subsequently sent to the feed forward classification layers.
$$P\left(F\right)=GRU\left(\sum _{t=1}^{n}[{x}_{t,}{h}_{t,}{z}_{t,}{r}_{t}\left(W\left(t\right),B\left(t\right),\eta \left(tanh\right)\right)\right] \left(11\right)$$
$$P\left(B\right)=GRU\left(\sum _{t=1}^{n}[{x}_{t,}{h}_{t,}{z}_{t,}{r}_{t}\left(W\left(t\right),B\left(t\right),\eta \left(tanh\right)\right)\right] \left(12\right)$$
Combining the Eqs. (11) and (12)
$$P\left(BiGRU\right)=P\left(F\right)+P\left(B\right) \left(13\right)$$
Integrated SA with BiGRU Feature Extraction is mentioned as follows
$$Y=Softmax \left(P\left(BiGRU), F\left(K,Q\right)\right) \right(14)$$
These features are then fed as the input to the training network which can differentiate the different objects in the videos that aid for video summarization. Table 4 illustrates the parameters used for constructing the GAN networks.
Table 4
Parameters used for constructing the Capsule-GAN frameworks.
Sl.No | Hyper parameters | Fine-tuned parameters |
1 | No of GRU cells | 310 |
2 | No of Epochs | 700 |
3 | Learning rate | 0.0014 |
4 | Training/Testing/Validation | 70/20/10 |
5 | Optimization Algorithm | ADAM |
Since this deep spatio-temporal feature extraction consumes a lot of computational resources, the proposed framework incorporates federated training on the MV data to save time and resources for servers by maintaining the privacy of the MV data.
3.7 Federated Learning for the Proposed Feature Extraction:
According to a study of the literature, federated learning is a distributed method of machine learning in which a variety of clients train a single global network using local data under the control of a central server or cloud. The shared global model is the sole update that each client provides to the central server, in contrast to typical centralized learning systems, which sacrifice privacy and require costly computations. This improves training performance. In this study, the local model is trained on NVIDIA edge servers, and the local model and the global model are both finalized using servers and the cloud. The federated learning in the Algorithm-1 suggested framework is carried out using the following processes:
-
Selection of Edge Client: The server must inform all client nodes engaged in model training. Based on the types of data, model training, and data processing capabilities of the edge client server, its capacity is chosen.
-
Distribution of model: The initial weights are sent to the edge clients by the server after the client node has been chosen in order to enable distributed learning. The input bias weights (\({\varvec{W}}^{\varvec{u}}\)) of the classifiers are distributed among the edge client nodes in the framework that is being suggested.
-
Distributed learning: Each client updates the model on the local level by sending it to the central server after training it locally with its own data
-
Model Aggregation: The central server creates the new global model by averaging all the updated weights (\({\varvec{W}}^{\varvec{l}}\)) from the client nodes. To get the global model in this study, an ADAM-based network model was applied.
-
Model testing: The completed global model is tested using the datasets and its testing performance is determined on the central server.
-
Update of the Shared Model: The Client Edge Nodes' aggregated models are used by the Server to update the Shared Model. By using math, the revised weights are updated by the
$${\beta }={\varvec{W}}^{\varvec{u}}- {\varvec{W}}^{\varvec{l}}$$
15
Where, respectively \({\varvec{W}}^{\varvec{u}}\) and \({\varvec{W}}^{\varvec{l}}\) are regarded as distributed weights of the global model and the local model.
3.8 Federated Summarization Technique :
The final summary of the MV is generated in the cloud server station. Since the self-attention maps are used for generating the features and federated learning is used for the classification of different activities, these frames are grouped in accordance with the activities and a summary is generated based on predicted outputs from the learning models. Figure 4 illustrates the federated learning for MVS. Once the activity is predicted, these frames are clustered and identified with their own tag ID. Hence the video summarization is made complex-free by integrating the self-attention and a federated learning process.