3.1 Data convolutional neural network algorithm design
The convolutional layer consumes the majority of the computational time and computing resources, so the feature analysis of convolutional computation is an important research element for accelerating convolutional neural networks. The essence of convolutional computation is multiplicative accumulation operations, and thus, parallel computation can be achieved. In the convolutional layer, multiple input feature maps are often computed with multiple convolutional kernels to obtain multiple output feature maps, so there is a possibility of multiple parallel computations.
When the parallelism of convolutional computation is n, it means that n multiplication accumulation computations can be performed in a single clock cycle [15]. Figure 1 shows the structure of parallel computation inside 3 convolution windows, where the computation of the convolution kernel and the corresponding elements in the convolution window can be realized in one clock cycle. In addition, \(X_{n}^{i}\left( {i \in N} \right)\) denotes the neurons in the nth row and column i of the input feature map, and \({W_0}\sim {W_8}\) denotes the weights of the corresponding points in the convolution kernel. The neurons in the nth row are computed to obtain \({X_n},0 \times {W_0},1 \times {W_1},2 \times {W_2}\), and this process is then repeated for the other rows. The computational results obtained from all rows are summed to obtain the partial sum of a single convolutional window, which realizes parallel computation within the convolutional window.
The neurons on a certain output feature map all come from the same convolutional kernel with the computed output of the input feature map. Therefore, until all of the input feature maps are traversed in a convolutional window, the computation of the required weight parameters is constant, and the new weight data are read when the corresponding convolution of that convolutional kernel has been computed. Suppose the convolutional kernel size of a certain neural network is l×l, and each clock cycle requires l new input neuron data to be loaded. When the convolution kernel completes all of the computations, it needs to reload l×l weights. A single neuron or weight data needs b bytes to be stored, the number of input feature map and convolution window computations is \(k * k\), the number of input feature map channels is m, and the number of output feature maps is . The data to be read for a particular input feature map convolution computation are as follows:
Convolutional computation produces intermediate computation results that are often stored in on-chip storage space, and this cache space only needs to be sufficient to store the intermediate computation results produced by one input feature map. When the size of the parallel computation structure is the same as the size of the convolutional core, the scheme of parallel computation inside the convolutional window is most effective in speeding up. However, when the FPGA hardware resources are insufficient or there are multiple sizes of convolutional kernels in the neural network model, the parallel computation structure and the size of the convolutional kernels cannot be the same, and in this case, the computational efficiency decreases [16]. When computing a certain output feature map, all of the input feature maps correspond to the same convolutional kernel, so only l×l weight parameters need to be loaded alternately after completing the corresponding computation; in addition, a small amount of on-chip storage space is needed to save the weight parameters used in the current computation, so the on-chip storage resource occupation rate is lower for the method of internal parallelism in the convolutional window.
$$\frac{{\partial \sigma (x)}}{{\partial x}}=\sigma (x)(1+\sigma (x))$$
2
Although parallel computations are performed between convolutional windows at different locations on the input feature map, there is partial overlap of adjacent convolutional windows, which means that the input neuron data from the overlapped part can be reused in the computation of different convolutional windows. To reuse the input neurons, an additional data reading module needs to be designed with corresponding control logic in the control module, but it will reduce the off-chip reading operation of the data. A data cache copy is added to the module that stores the input neuron data to retain the data to be reused. When the neuron data need to be read, a multiplexer (MUX) is used for data selection control to provide the data to different computation units. Usually, when the convolutional neural network computes to the later layers, the size of the input feature map decreases continuously, the number of convolutional windows that can be computed in parallel on the same input feature map decreases, and the number of convolutional windows cannot always be integer-divided by the parallelism. At this time, the acceleration efficiency decreases, and the parallel computation structure is wasted.
$$coth(x)=\frac{{{e^{ - x}}+{e^x}}}{{{e^x} - {e^{ - x}}}}$$
3
In practical application scenarios, there are multiple concurrent requests from hardware systems for inference of convolutional neural networks, which may be computations of the same convolutional neural network model or multiple convolutional neural network models [17]. In this case, FPGA platform acceleration not only considers how to accelerate the inference process of neural network models through parallel computing but also how to schedule the task computation when multiple tasks are concurrent. According to the previous section, the FPGA bandwidth and computational resources are limited, so if multiple convolutional neural network tasks are computed in parallel, the resource requirements increase exponentially, especially the bandwidth resources.
$$x_{j}^{l}=f\left( {\sum\limits_{{i \in {M_j}}} {x_{i}^{k} * l_{{ij}}^{k}} - b_{l}^{{ij}}} \right)$$
4
The CPU is responsible for receiving the tasks, configuring the task sequence, and writing the task inputs to off-chip storage. In a conventional "CPU + FPGA" acceleration architecture, the CPU needs to be involved in the entire computational process of the convolutional neural network, i.e., both computation and data access need to be performed under the control of the CPU. Although logic control is the strong point of the CPU, the inference of convolutional neural networks is highly repetitive, and the process is fixed; therefore, it is not difficult to implement the control process by using hardware, and the hardware is faster. In the architecture presented in this paper, the CPU only needs to perform one write operation for each task to be computed, and the rest of the work is left to the control module. In this way, the CPU does not need to be occupied all the time and can enter a low-power state or perform other tasks after completing task writing. The power consumption of the hardware is less than that of the CPU under the condition that the same control logic is implemented.
$$P{C_l}=\left( {F \times F \otimes FM_{l}^{2}} \right) \oplus F{M_l}$$
5
The size of the input feature map of the convolutional layer is \(M \times M\), and the size of the convolutional kernel is \(N \times N\), where and can be configured according to the needs of different convolutional layers with specific values provided by the task sequence module. Similarly, the pooling and fully connected layers also use similar configurable operational units. Considering that there is a data dependency between the operational layers of the convolutional neural network, i.e., the output of the nth layer is to be used as the input to the n + 1th layer, there is no computational overlap between different operational layers, so there is no conflict when the configurable operational units are reused.
FPGAs, as general-purpose logic devices, do not have many advantages over ASICs in terms of storage bandwidth resources. FPGAs with high bandwidth resources are expensive, and the large number of off-chip data accesses means increased power consumption. Therefore, when designing computing units, it is important to minimize the number of off-chip storage accesses due to data reuse and reduce the bandwidth requirements for off-chip storage accesses [18]. The parallel computation between input feature maps has the highest bandwidth demand, and the parallel computation between output feature maps has a high requirement for on-chip storage resources; however, the internal parallel design of the convolutional window has the highest data reuse rate and meets the design requirements of this study. Therefore, this study adopts the structure of parallel computation within the convolution window, i.e., the convolution operations of all corresponding points within the convolution window are computed simultaneously. When all of the convolution windows on one input feature map are calculated, new convolution kernel data are loaded for calculation; when all of the convolution kernels corresponding to one input feature map are calculated, new input feature maps are loaded for calculation until the calculation of all input feature maps is completed, as shown in Fig. 2.
With a multilayer autoencoder architecture, the same operations can be repeated in the encoding and decoding stages as needed. By stacking multiple hidden layers in the encoder and decoder, a deep autoencoder can be constructed. Deep autoencoders suffer from gradient disappearance as well as gradient explosion problems common to deep learning, and certain measures are required to mitigate these problems. Generative adversarial networks (GANs) and variational autoencoders (VAEs) are more often used in generative tasks than autoencoders. The GAN generates content from the input noise; however, when generating specific content, the entire distribution needs to be searched, and it is difficult to randomly select specific features (noise) for a generation.
The concept of generative models has a long history in machine learning research and is commonly used when modeling data with base-mate conditional probability density functions. Generative models are probabilistic models with joint probability distributions over observations and target values that have not been widely used before, and deep learning has driven the application of generative models in various scenarios. Deep learning is a data-driven technique, and the increase in the number of samples can introduce significant performance improvement, but the collection and labeling of the preliminary data require a great cost [19]. Therefore, learning reusable feature representations from many unlabeled datasets has become an important research direction. Deep learning algorithms involve optimization in many aspects, the most important of which is the neural network training process. Even a single neural network training problem requires considerable computational resources and a long training time, which incurs extremely high costs for the application of deep learning algorithms on the ground, so research on optimization algorithms for deep neural networks and lightweight neural network techniques is crucial. Optimization algorithms for deep model training differ from traditional optimization algorithms in several aspects, and machine learning usually acts indirectly. In most machine learning problems, the focus is on some performance measure P, which is defined on the test set and may be unsolvable, and therefore, P can only be optimized indirectly.
3.2 Mathematical engineering modeling and optimization analysis of public data in multitasking systems
The DBN can handle the information correlation of system state monitoring data in time and space dimensions, and the EM algorithm can solve the parameter estimation problem under incomplete state monitoring data. Therefore, the simultaneous application of the EM algorithm and DBN model to learn the system state mapping relationship is worthy of research. However, the existing EM algorithm is only suitable for estimating time-varying parameters, and the time-domain consistency of the parameter estimation results cannot be guaranteed when used for node CPT parameter estimation [20]. On the other hand, in the case of scarcity of available system state monitoring data, the state monitoring information at different moments is correlated. Blindly ignoring the above information correlation cannot make full use of the valid information in the monitoring data. Therefore, under the framework in which the DBN is the system reliability model, this section proposes a customized EM algorithm to ensure the time-domain consistency of the subsystem and system node CPT parameter estimation results by dividing the DBN topology into multiple V-shaped basic structures and integrating the monitoring information of the same V-shaped basic structures at different moments with the idea of parameter modularity.
The decomposition the DBN is a multiple V-shaped structure. For any node \(X(t)\), the basic structure of itself and its parent node \(X(t)\) is called a V-shaped structure. If all nodes in the DBN contain state data, the V-shaped structure of node \(X(t)\) is independent of other nodes in the whole DBN. On this basis, the unknown parameters in the CPT of node \(X(t)\) can be estimated based on the state data in the V-shaped structure. However, due to the incompleteness of the system state monitoring data, the above V-structure independence assumption no longer holds. To solve this problem, this algorithm fills incomplete state monitoring data into complete state monitoring data by considering the distribution expectation of nodes without state monitoring information as their monitoring information [21]. Notably, the CPT parameters of the variable nodes corresponding to the same object on different time slices should be the same. Therefore, this method integrates the state monitoring information embedded in the V-shaped structures of the same object at different time slices and then unifies the CPT parameters of the nodes corresponding to the same object to ensure the time-domain consistency of the parameter estimation results. Using the three-part system shown in Fig. 3 as an example, the CPT parameters of nodes 1S(t) and S(t) in the DBN of this system need to be estimated.
The unknown parameters are estimated using a customized EM algorithm under DBN parameter modularity. In the EM algorithm, the incomplete data are populated with complete data under each time slice using the distribution expectation, and on this basis, the unknown parameters in the CPT of the subsystem and system nodes are estimated. It is worth noting that the EM algorithm continuously updates the estimates of the unknown parameters through iterations of the E and M steps. The incomplete state monitoring data of the system are populated with the complete monitoring data at each time slice using the distribution expectation values. For the subsystem and system node that need to estimate the CPT parameters, the joint state probability distribution of this node and its parent node are calculated using Eq. (6), and the state monitoring information of the V-shaped structure under the corresponding nodes of the same object in different time slices is integrated.
$$E\left[ {N\left( {{S_m}=k,pa({S_m})=j} \right)\left| D \right.} \right]=\sum\limits_{{n=1}}^{N} {\sum\limits_{{t=0}}^{T} {\left\{ {{S_m}(t)=k,pa({S_m})=j} \right\}} }$$
6
$$\theta _{{m,j,k}}^{2}=\frac{{E\left[ {N\left( {{S_m}=k,pa({S_m})=j} \right)\left| D \right.} \right]}}{{\sum\limits_{{t=0}}^{T} {\left\{ {{S_m}(t)=k,pa({S_m})=j} \right\}} }}$$
7
The system state monitoring data are obtained based on the simulation of the real parameters of each node’s CPT. It is worth noting that to verify the generality of the proposed method in terms of learning both deterministic and probabilistic system state mapping relations, the real values of the parameter vectors of the subsystem and system node CPTs in the three-component system are randomly generated by using the Monte Carlo simulation method, and both deterministic and probabilistic state mapping relations are covered.
$$s(u,v)=\frac{{\left| {{V_u} \cup V_{u}^{\prime }} \right|}}{{\left| {{V_u} \cap V_{u}^{\prime }} \right|}}$$
8
$${p_{u,v}}=\frac{{\sum\nolimits_{{w \in W}} {s(u,v){r_{u,w}}} }}{{\sum\nolimits_{{w \in W}} {s(u,w){r_{v,w}}} }}$$
9
The original input to the modeling method is the field point cloud of the target area, so an online incremental point cloud acquisition subsystem is built first. The captured point cloud data are presented in the head-mounted display and overlaid on the physical world, and the user can directly perceive the data acquisition process, which provides a more convenient and intuitive data acquisition experience. In addition, the system does not capture data in real time but provides the initiative to the user to capture data. Whenever the user thinks he needs to capture data, he can capture the current frame with a simple gesture command, which can effectively avoid capturing useless data and wasting system computing power [22]. Although HoloLens 2 itself provides a certain amount of arithmetic support, a large amount of data processing is still more difficult, and it often manifests in heat and system lag.
The captured point cloud data are then transferred to the server side, and an iterative nearest-point-based point cloud alignment algorithm is executed to align the current point cloud with the historical point cloud to reduce the alignment error caused by pose estimation in the holographic lens. The density-based denoising algorithm is then used to remove the outlier points in the field point cloud to obtain a cleaner field point cloud, as shown in Fig. 4. Finally, the processed point cloud data are transferred to HoloLens 2 for presentation to the user.
The technique of extracting edges from 2D images is now relatively mature, but the edges extracted from 2D images usually contain many nongeometric edges, such as textured edges on the surface of objects or edges unique to a particular viewpoint [23]. We only want to extract the geometric edges of the scene and ignore the texture edges, so we analyze the edge information directly based on the 3D point cloud, specifically using an algorithm based on the angle and half-disk criterion to extract the cloud boundaries of the field points.
$$\xi _{p}^{n}=\left( {{Q^n},{Q^{n+1}}} \right)_{\Gamma }^{2}$$
10
$${\left( {X_{h}^{2},v_{h}^{2}} \right)_{{\Gamma _h}}}+\frac{2}{3}t\beta {M_p}\left( {{\nabla _{{\Gamma _h}}}W_{h}^{2},{\nabla _{{\Gamma _h}}}W_{h}^{3}} \right)={\left( {g_{h}^{n},v_{h}^{2}} \right)_{{\nabla _{{\Gamma _h}}}W_{h}^{2}}}$$
11
For fluid surfactant systems, the nonlinear coupling term poses a great obstacle to the system’s energy stability. Therefore, the proof of energy stability for second-order discrete formats is an unsolved problem. However, in numerical experiments, it has been verified that both formats satisfy unconditional energy stability. Common geometric shapes such as circles and rectangles are often found in manufactured objects, and drawing these shapes can have a positive effect on the modeling process. However, if a different type of shape is specified each time a stroke is drawn, it can block the fluidity of the modeling [24]. To reduce this problem, we use a neural network to predict the shape type of the strokes, parameterize the strokes to the specified type of geometry, and reconstruct the strokes.
Points are then sampled uniformly on all 3D contour lines, and these points are triangulated to create the mesh model. Finally, the user can optionally specify that one or both ends of the created mesh are hollow. In the specific implementation, to enhance user perception, the mesh model is updated in real time as the trajectory strokes are drawn and reported back to the user in the head-up display.