PCGC: a performance compact graph compiler based on multilevel fusion-splitting rules

The existing deep learning compilers are unable to perform efficient hardware performance-related graph fusion when both time and power consumption are considered. In addition, the compilers optimize the computational graph of deep neural networks (DNNs) by performing static graph transformation based on the greedy algorithm, only considering the runtime performance, and ignoring the cost of the tuning process. To solve these problems, this paper proposes a DNN computational graph optimization compiler (PCGC). Through the performance feedback at runtime, PCGC designs a computational graph fusion and splitting optimization strategy based on multilevel operator layer fusion-splitting rules. First, PCGC uses a rule-guided graph segmentation algorithm to recursively segment the computational graph into smaller subgraph to achieve an efficient and detailed search. Then, PCGC uses the cost model to receive feedback from hardware performance information, proposes cost model and operator fusion rules to synthetically guide the partial fusion and partitioning of nodes and edges of the computational graph, and generates optimal subgraphs flexibly according to different hardware to optimize the search space for partial fusion. Finally, we make the cost model converge quickly to the loss value we set by manually adjusting the parameters. Compared with other advanced compilers, PCGC optimizes the overall power consumption on an embedded GPU by an average of 130.5% when the time consumption on each hardware is not lower than the average time consumption. On domain-specific architecture, PCGC optimizes power consumption by an average of 66.5%. On FPGA, PCGC optimizes power consumption by 66.1%. In a sense, PCGC can achieve high-speed inference in specific power supply scenarios, reducing the carbon emissions of edge computing.


Introduction
In the past decade, the rapid development of deep learning technology has brought a burden to its compilation and deployment on actual operating devices. On the one hand, the continuous expansion of the scale of DNN leads to the rapid growth of its requirements for hardware computing power. It often needs to run on a variety of general and special hardware, such as different types of CPU, GPU, FPGA, DSA [1], etc. On the other hand, the application scenarios of deep learning are more and more extensive, and the requirements for performance are gradually increased, resulting in the emergence of new network structures and a large number of new operators. Like YOLO [2], BERT [3], GPT [4], etc., they all consist of operators with different types, shapes, and connection relationships [5][6][7][8][9]. The diversity of hardware platforms and the increasing number of operators cause the cost of manually developing and implementing an optimal operator for each scenario to multiply.
The model compilation acceleration framework for deep learning can automatically generate optimized code for any hardware device, solving the problem of high labor costs. Among them, computational graph fusion optimization is the most important [10]. The computational graph of DNN is a directed graph, which includes a group of nodes, each node representing an operation; a set of directed edges, each marking a relationship between nodes (data transfer and control dependencies). The speed of neural network training and inference can be improved by computational graph optimization [11][12][13][14][15].
Most deep learning compilers classify operators and formulate operator fusion rules through self-designed intermediate representations (IRs), but such static rules cannot adapt to specific hardware platforms and generate optimal subgraph. TVM [16] uses Halide IR [17] to represent tensor operators, which can represent computation and scheduling processes in a relatively flexible way. But TVM also has the problem that the hardware of the computational graph mapping operator is not suitable. Although TVM claims to support mainstream deep learning frameworks and a large number of neural network models, there are few networks that TVM can adapt to DSA and FPGA at present, and few other network models are supported for the following reasons: relay quantization currently only supports limited types of operations, and VTA [18] of TVM only supports limited types of operations.
Several algorithms have been developed for DNN optimization. MetaFlow [19] minimizes graph substitution using the maximum flow algorithm. However, it does not optimize operator fusion or consider nonlinear operators. TASO [20] generates equivalent subgraphs and substitutions using graph substitution but does not consider double computation. Rammer [21] relies on manual scheduling templates and traverses the entire calculation graph once, resulting in high compilation costs. APOLLO [22] uses a bottom-up multilayer optimization approach for operator parallelism but does not consider underlying integration. IOS [23] proposes a dynamic parallel method for operator grouping but does not describe how its cost model works, is not closely related to hardware, and does not support other types of embedded hardware. DNN optimization compilers should consider power loss in addition to inference time-consuming optimization.

3
PCGC: a performance compact graph compiler based on multilevel… In this paper, the information of the hardware runtime layer and the basic fusion rules of the operators will be used to reasonably partition and fuse the computational subgraphs, respectively. In the operator layer, the layers are fused in advance based on the multilevel fusion rules of different operators. The subgraphs to be evaluated are marked for dynamic tuning; by collecting the actual operating power and time consumption of hardware, a cost evaluation model is established to decompose the feedback-guided subgraphs and balance the final operating performance. Compile and adapt for HUAWEI Atlas 220 DK, Nvidia Jetson Nano GPU, and FPGA ZCU102 to realize the adaptation and optimization of multiple back-end cores.
To sum up, this paper has the following contributions.
• Through multilevel rule partitioning strategy, PCGC will perform computational graph fusion and partitioning earlier, reducing the search space and achieving a more lightweight search. • PCGC uses the cost model to receive hardware information feedback, uses the cost model and operator fusion rules to guide the fusion decomposition of nodes and edges of the graph, and generates optimal subgraphs flexibly according to different hardware devices. The output of the cost model is a representation of the combined computational graph. • PCGC combines hardware back-end code generation tools to further improve the performance of DNN inference. Compared with other advanced compilers, PCGC optimizes the overall power consumption on an embedded GPU by an average of 130.5% when the time consumption on each hardware is shorter than the average time consumption. On DSA, PCGC optimizes power consumption by an average of 66.5%. On FPGA, PCGC optimizes power consumption by 66.1%. In a sense, PCGC can achieve high-speed inference in specific power supply scenarios, reducing the carbon emissions of edge computing.
This paper is organized as follows: Sect. 2 summarizes the relevant work and research motivation of this paper. Section 3 introduces the overall design and core strategy of this method. Then, Sect. 4 shows the experimental results. Section 5 summarizes the work of this paper. Figure 1 shows most of the existing advanced DNN accelerated compilers. This paper investigates two kinds of graph hierarchical optimization methods and tensor operator optimization methods. MetaFlow minimizes graph substitution across subgraphs using the maximum flow algorithm. However, it lacks the potential to optimize operator fusion across subgraphs and its cost model has advantages only for DNN dense linear operators. TASO primarily utilizes graph substitution and operators available in the library to generate equivalent subgraphs and make substitutions. However, it does not consider double computation. Rammer relies on the 1 3 manual scheduling template, which makes the compilation cost high, and the whole calculation graph is traversed once. APOLLO uses a bottom-up multilayer optimization approach where parallelism between operators is achieved by combining uncorrelated results from the first two layers. However, it does not take into account the underlying integration, such as in the FPGA open hardware architecture resulting in missed optimization opportunities. IOS proposes a dynamic parallel method among operators based on cost model, which mainly adopts recursive algorithm. It logically groups graphs on the GPU. However, it does not describe how the cost model works, is not closely related to hardware, and does not support other types of embedded hardware. In addition, this paper defines semi-regular cost optimization to realize efficient and energy-saving deployment in embedded inference scenarios, taking into account the power consumption of the hardware, rather than just focusing on inference time-consuming optimization.

Motivation
On the one hand, the existing computational graph tuning methods cannot maximize the optimized hardware inference power consumption. On the other hand, in order to consider the optimization cost of DNN, the optimization of computational graphs is combined with operator optimization rules to bring the maximum benefit to the optimization computation of DNN while reducing the search space. Therefore, in this paper, we will start from the fusion method of computational graph with operator fusion rules and hardware performance feedback to reduce the search space and optimize the power consumption and time of model inference by reclassifying the hardware performance of partial molecular graph cost model.

Overall of system architecture
PCGC is applied to embedded inference scenarios and supports inference acceleration in most DNN. PCGC is compatible with mainstream deep learning frameworks, mainly facing DSA, FPGA, GPU, and other hardware.
The main process of PCGC is shown in Fig. 2. PCGC takes the ONNX of the model as input and converts it to TVM's Relay IR for further processing. For the subgraph at this time, according to the static operator fusion rule, the operator fusion is carried out for the first time to form a new composite subgraph. The operators in the subgraph are parallelized according to the hardware resources, and the second operator fusion is carried out to form a larger aggregation subgraph. On the subgraph, cross-boundary optimization operations such as constant propagation and common subexpression elimination are carried out. Next, according to the record of the previous fusion operator, the list of the original operator and the fusion operator is formed. The cost model is built on the Fig. 2 Overall of PCGC target hardware platform, and the fusion effect is evaluated. If the evaluation results show that the performance becomes worse after fusion, then split the operator. Finally, the generated subgraph is used as the input of the operator layer to generate the corresponding hardware code.
The purpose of using Relay IR is to be more compatible with existing deep learning compilers so that this method can be more widely used in the field of deep learning compilation acceleration. The composition operator in the original model is expanded into a fine-grained operator, one is to further optimize the interior of the original operator and the other is to unify the operator type, reduce the number of operators provided by the framework, and facilitate the subsequent static fusion and cost model construction. Operator fusion and parallelization process accelerate the inference process from the point of view of reducing the reading of main memory and maximizing the utilization of hardware resources. Arranging operator fusion before parallelization processing can reduce the amount of data from parallelization processing and can cooperatively deal with the parallelism between operators and within operators. Finally, the cost model is used to evaluate the operator's power and time, and the fusion operator is segmented so that the graph optimization scheme can dynamically adapt to different hardware backends. After fusion and parallelization, the evaluation of the performance of the operator by the cost model is closer to the result of its actual operation on hardware, and the predicted value is more accurate.

Relay IR and extended computational graph
PCGC accepts mainstream models such as PyTorch and TensorFlow as input and transforms them into custom IR that describes the structure of the computational graph. The IR defines the calculation type of each operator, the number of inputs of the operator, the number of outputs of the operator, and other necessary parameters of the calculation. For the fusion operators, it is composed of several operators, and the information on these operators is recorded. The resulting semantic tree is shown in Fig. 3.

Static fusion
In this paper, a static fusion strategy with multilevel fusion rules is proposed, which combines operator characteristic to achieve reasonable fusion within and between graph nodes. The overall process is described below. According to the IR completed by the transformation, the computational graph is traversed in reverse order from the output, and the operators that meet the fusion rules are fused. Iterate through the process several times until there is no operator in the graph that can be fused. Reintegrate the computational graph after fusion.
The operators commonly used in deep learning are divided into the following six categories: (1) algebraic operators: represents the calculation operation of the corresponding elements between two tensors, including RELU, ADD, MUL, etc.
(2) Broadcast operator: operators that deal with tensors of different shapes in the calculation process by copying operation, if the shape of the tensor needs to be adjusted in the calculation process of algebraic operators, it is also counted in the broadcast operator, including BN and so on. (3) Reduction operator: the operation of reducing the number of elements contained in the tensor according to the specified axis, including SUM, ARGMAX, etc. (4) Complex operator: there are some common operators in neural networks, such as CONV5 × 5, FC, and so on. (5) Infusible operator: refers to the operator that has no profit after fusion with other operators, including constants and so on. (6) Two-phase operator: different operators are fused on different hardware, and it is impossible to determine whether there is a benefit after fusion, including MAXPOOL, CONV1 × 1, and so on. On the basis of the aforementioned fusion criteria, a set of static fusion rules has been established and supported by historical experiments, as shown in Fig. 4, which displays a visual representation of the rules. (The same kind of operator may be classified as different types of operators according to different specific parameters. For example, CONV1 × 1 is a two-phase operator, and CONV5 × 5 is a complex operator. The specific classification depends on the actual situation.) The stable fusion rule means that the fusion will improve the performance on most hardware platforms. For example, the fusion of CONV + RELU + BN conforms to rule one of stable fusion. The prohibition rule means that in most cases the fusion will degrade the performance or have no impact on the overall performance. In addition to prohibiting the fusion types prohibited by the fusion rules, the operator can carry out fusion on the premise that it conforms to the basic fusion rules. If the operator type conforms to the stable fusion rule, the operator will not be segmented later, otherwise, whether the fusion is effective or not will be evaluated by the cost model in the operator segmentation stage, and if the performance of the fusion is reduced, the fusion operator will be disassembled again. The operator fusion completed at this stage cannot guarantee all positive effects on inference, because the

Parallel processing
We learn from Rammer's idea [21] and adapt it to a parallel optimization method for nodes and edges of multilevel rules combined with the characteristics of operators. The underlying hardware resources are abstracted into virtual computing PCGC: a performance compact graph compiler based on multilevel… units, and the computing to be carried out is scheduled to these computing units so that the operators can complete parallelism. Through the operator parallelism within the fusion operator and the parallel between the fused operators, the hardware resources are used. The parallel structure is shown in Fig. 5.
Internal parallelism of operators: Combing the parallel logic between operators when merging operators, and parallel computing without data dependence within operators after fusion in the case of hardware resources permitting. Through the technique of cyclic optimization, the parallelism is not limited to the operators to be fused, but the computation within each operator is also accelerated.
Interoperator parallelism: After operator fusion, continue to do parallel operations on the fusion operator. When the fusion operator is parallel, the internal parallelism of the operator is readjusted at the same time. If the extra resources can achieve interoperator parallelism in the case of reducing the internal parallelism of operators, the interoperator parallelism strategy is given priority.
(1) Cross-boundary optimization: Traverse the computational graph using topological sorting. First, select a node with a degree of 0 to execute, then delete the node and its connected edges, and then find the next node with a degree of 0 to execute. In the process of traversal, constant folding, inline optimization, and common subexpression extraction are optimized. (2) Constant folding: If most of the inputs on which a calculation depends are constants, the result is calculated directly, replacing the nodes in the graph. Record the calculation content that has been processed during the traversal process. If the calculation content to be processed is the same, directly replace the computational graph.

Cost model
A dynamic partial rule-guided graph optimization method related to hardware performance is proposed in this paper. The main function of the cost model in PCGC is to evaluate the time and power consumption of each operator running on the target hardware platform without actual deployment and to provide a benchmark for dynamic fusion and segmentation of computational graph. In order to achieve the automatic optimization of the target hardware platform perception, for each different hardware, it is necessary to build a new most appropriate cost model. The cost model using the deep learning method requires a large amount of data during construction, and it takes quite a long time to obtain data in actual use. In order to speed up the optimization process, the amount of data must be reduced. Therefore, the semi-supervised regression model is used as the benchmark model. The process of building a cost model is shown in Fig. 6. First, extract the corresponding pretraining model according to different hardware architectures, such as CPU, GPU, FPGA, DSA, etc… By traversing the computational graph, PCGC extracts the features of all operators (the fusion operator and the operators that make it up), including the size of the input data, the type of the input data, the number of operators that make up the fusion operator, and the type of operation. At the same time, PCGC will also dynamically adjust the representation form of the computational graph according to the information of the hardware to be deployed. Hardware information mainly includes supported parallelism, supported memory transactions (such as 8 bytes, 32 bytes read and write), shared memory restrictions, register restrictions, etc. Then, PCGC selects 10% operators to deploy on the hardware, and the actual measurement time and power consumption are used as label data. According to the semi-supervised algorithm, the label data are used to build the model and predict the unlabeled data.
We propose the cost model based on Support Vector Regression Primal [24]. We pre-label a portion of the time and energy consuming data of the operators and specify the fusion rules. The cost model outputs the decisions to be fused for the given operators. When the difference between the error function value and the squared error function value is greater than our given error loss, we give more penalty to the squared error until the inference model is trained to perform up to the standard. Finally, PCGC adds high confidence data to the labeled dataset, uses it as labeled data and retrains the model. When more than a certain proportion of data in the data set is marked, the training ends, and a trained cost model is obtained.

3
PCGC: a performance compact graph compiler based on multilevel… The input of the algorithm is the operator list P to be tested and a parameter AC related to the model accuracy, and finally outputs a trained cost model. Line1 loads the pretrained model, and Line2 randomly shuffles the input operator list and takes out the top ones as part of preparing to obtain runtime data. Line4 ~ line10 is the process of model training. Firstly, obtain the actual inference time, power consumption, or other required indicators of the operators in the list T on the hardware, and then add them to the label data set. A cost model is trained on labeled data and then used to predict other unlabeled data. Take out the operators with the highest predicted performance and prepare for the next round of actual testing. When the proportion of labeled data to the overall data exceeds AC, the construction of the cost model ends.

Dynamic recutting of subgraph
Based on the computational graph fusion method guided by static and dynamic multilevel rules mentioned earlier, this paper further proposes a dynamic partial graph optimization based on hardware performance feedback to realize the re-segmentation of partial subgraph. The purpose is to better meet the requirements of graphic optimization of target hardware inference. Traversing the computational graph in reverse order, according to the record of the fusion operator, the operators which do not conform to the static fusion rules are re-segmented, and the operators before and after segmentation are recorded. The content of the record is mainly the operator characteristics needed by the cost model.
After segmentation, the internal parallel logic of some fusion operators is disrupted, which needs to be rearranged according to the rules of interoperator parallelism, and it needs to be considered when training the cost model. The time-consuming and power-consuming cost models are used to evaluate the operators before and after segmentation. If the time and power consumption decrease after segmentation, the segmentation operator will be adjusted according to the needs of users. If the user requires the minimum time consumption, it will be segmented in the case of reduced time consumption; if the user requires the lowest power consumption, it will be segmented in the case of reduced power consumption. In order to prevent excessive time or power consumption, the upper limit of power consumption is set, and if the upper limit is exceeded, the segmentation operator is readjusted. The specific process is shown in Fig. 7.
Compared with other computational graph optimization methods, this method provides optimal computing diagrams for different hardware, solves the problem of low coupling between layer and hardware, and saves a lot of manual optimizations use the computer to automatically generate the possible optimal subgraph and evaluate its performance after actual deployment; comprehensively consider and optimize the power consumption and time of the inference process. The method of fusion 1 3 PCGC: a performance compact graph compiler based on multilevel… Fig. 7 Flow chart of dynamic re-cutting of subgraph before segmentation is used to combine dynamic optimization with static optimization to reduce the optimization cost and improve the optimization effect.
In order to elaborate the subgraph re-decomposition process in more detail, this paper presents Algorithm 2: Subgraph Dynamic Re-cutting.
The input of the algorithm is a current subgraph list G, the record R in the operator fusion process, the optimization target Target, and the threshold H specified by the user. Target is a string sequence corresponding to the optimization target, such as "time", "power," and so on. The output of the algorithm is subgraph sequence Gʹ after splitting. Line2 ~ line25 traverse the list in G in turn, looking for possible optimizations. The function in Line4 searches for the record corresponding to the subgraph in R, which guide the process of subgraph generation. Each fusion point is also a splittable point. If there is no corresponding fusion record, it means that the granularity of the subgraph is the finest, and there is no possibility of further splitting, so it is directly added to Gʹ. Line11 ~ line17 traverse these records in turn and make several possible splits of the subgraph g according to the records. Line12 calls the cost model to evaluate the performance of the main optimization target after splitting. When the number is less than the currently recorded optimal value and after applying this split to the overall model, the secondary optimization objective will not exceed the threshold H, and then, replace the optimal value and the corresponding split mode. If after evaluation, it is found that the performance is the best without splitting, then directly add the subgraph g to G'. If there is an optimal split, it will be split according to its pattern, and the split subgraph will be added to G again, waiting for the next evaluation.

Several implementation methods
Several implementation methods of PCGC are described below. The actual implementation is not limited to several implementation methods described in this paper. In order to better explain how this method is effectively optimized, first of all, some related technical concepts and principles are described.

Case study 1
Through this example, the principle of reducing model inference time by this method is simply explained. A simple DNN structure is shown on the left side of Fig. 8. According to the invention, the result of the optimization is shown on the right side of Fig. 8. The original structure needs to store the data in memory after CONV calculation, and in the process of preparing for BN, it needs to move the data from memory to the on-chip cache. After calculating the BN, you also need to move the data twice to start the calculation of the RELU layer. According to the principle of operator fusion, the optimized computational graph structure will store the data of CONV calculation into the on-chip cache and directly carry out BN operation. After the BN calculation, the data are also stored in the cache, and then the RELU operation is performed directly. Compared with the original structure, the process of moving data from memory four times between operators is saved, and a lot of inference time is saved.

Case study 2
Through this example, the principle of the strong correlation between the method and hardware is simply explained. The left side of Fig. 9 shows the structure before the fusion of the two operators in a neural network, the middle of Fig. 9 shows the structure of the two operators fused on hardware A, and the right side of Fig. 9 shows the structure of the fusion of the two operators on hardware B. For different hardware A and B, this method will construct different cost models and output different computational graph according to the operator running time evaluated by the cost model.
For hardware A, we build a cost model p. The cost model p predicts that the running time of CONV1 is t1, the running time of CONV2 is t2, and the running time of the composite operator after fusion is t3. Because t3 < t1 + t2, it is decided to fuse the operator. The output is the computational graph in the middle of Fig. 9.
For hardware B, we built a cost model q. The cost model q predicts that the running time of CONV1 is t1ʹ, the running time of CONV2 is t2ʹ, and the running time of the composite operator after fusion is t3ʹ. Because t3ʹ > t1ʹ + t2ʹ, it is decided to split the operator and output the computational graph shown on the right side of Fig. 9.
Although the method of using the same set of operator fusion strategies for all hardware platforms has a small optimization cost, these strategies are obviously not suitable for every kind of hardware, and certain performance will be lost on some hardware. This method dynamically modifies the output computational graph in the face of different hardware, so that the power consumption of the model running on all corresponding hardware is minimized, while considered that operator fusion and splitting rules are based on time-consuming optimization.

Case study 3
This example is used to simply illustrate the process of operator fusion of complex models by this method. Figure 10 shows the part of the structure in GoogLeNet [25] on the left and the structure after static operator fusion using this method on the right. The inverse order traverses the computational graph, and the Concat operator inputs the infusible operator, so the operator is not fused. CONV1 × 1 and CONV3 × 3 satisfy the basic fusion rules but do not satisfy the stable fusion rules, so they are fused into the CONV3-CONV1 operator, and the benefits of the operator fusion will be reevaluated later. CONV1 × 1 operator and CONV5 × 5 operator satisfy the basic fusion rules and are fused into the CONV5-CONV1 operator. CONV1 × 1 and Max-Pool3 × 3 satisfy the basic fusion rules but do not satisfy the stable fusion rules, so they are fused into MaxPool3-CONV1 operators. CONV1 × 1 and MaxPool3 × 3 do not meet the basic fusion rules, so they are not fused. CONV3 × 3 and RELU satisfy the stable fusion rule and merge into the CONV3-RELU operator, and the fusion operator will not be segmented later. The fusion of other operators is similar.

Case study 4
Through this example, the parallel processing of the computational graph by this method is simply explained. Figure 11 marks the part of the computational graph after static operator fusion for parallel processing, and the other parts only do parallel processing within the operator. There is data dependence between operators in the blue box, so serial processing is done. There is no data dependence between operators in the red box, so parallel processing can be done. The degree of parallelism is determined according to the abstract virtual computing unit. If there are four computing units that can do convolution operations in the hardware at the same time, the four operators in the red box can be completely parallel, and the final running time only depends on the operator with the longest running time. If there is only one computing unit in the hardware that can do convolution operation, then the total operator of the red box can only be completely serial, and the final running time is the sum of the four operators.

Case study 5
Through this example, the process of operator segmentation of DNN structure by this method is briefly explained. The operators that do not meet the stable fusion rules are reevaluated by the cost model, and the inference time or power consumption before and after fusion are predicted, respectively (selected according to the actual needs), and the operators with higher power consumption after fusion are re-disassembled. As shown in Fig. 12, after the evaluation of the cost model, CONV1-CONV3 finds that the inference time and power consumption are reduced after fusion, so the fusion state is maintained. After the evaluation of the cost model, it found that the inference time and power consumption of the CONV1-CONV5 fusion operator increase after fusion, so the two operators are disassembled and become unfused again.

Environment setup
PCGC supports the model files generated by most mainstream frameworks, such as PyTorch, TensorFlow, MXNet, Caffe, MindSpore, etc., and can transform the model files into unified graph expansion representation files. This work is at the level of TVM: at the level of relay optimization, we carry out further subgraph fusion combined with hardware information. We will add two additional passes before all TIR pass to achieve operator fusion, namely the subgraph fusion pass guided by the cost model and the subgraph splitting pass. We use Python to describe the algorithm flow and use C++ to implement the corresponding hardware execution engine. The time and power consumption of a subgraph are measured directly in the execution engine to guide the scheduling process. The execution engine is based on the backend libraries provided by TVM and also provides backend support for other DSA. In order to execute multiple sets of operators concurrently, PCGC puts different models into the corresponding CUDA and CANN flows as required. If there are sufficient computing resources, the kernels in different CUDA and CANN streams will execute in parallel. During the whole experiment, we used cuDNN 7.6.5 [26], CUDA 11.4, Nvidia driver 470.74, CANN5.1, TensorRT 7.13 [27], and TVM 0.8 as the basic library.

Results
In the experiment, we benchmark several modern DNN, including ResNet [28], YOLO, and so on. In this experiment, in addition to other operators such as Concat, we also use the operator types shown in these models as the basic scheduling units of computational graph. Some models (e.g., ResNet) may have some potential parallelization opportunities. For example, for ResNet-50 and ResNet-34 [29], we can achieve 10-15% acceleration through parallel sample convolution.
In the evaluation of these models, we conducted five experiments each and reported the average performance. Each experiment is tuned on Nvidia Tesla V100 GPU and inferred on FPGA ZCU102, Nvidia Jetson Nano, and Huawei Atlas 200DK. Considering the inference delay, we try to optimize the power consumption according to the target.
We designed experiments to evaluate the average model running power consumption of a single image. In our experiment, we prepare 10,000 images for edge computing hardware to perform inference and collect the average operating power through the official API. At the same time, record the time consumption of each group of experiments, and compare the average time consumption of each image in the final inference with the average time consumption of the most advanced model compilers.
There are eight baselines in this lab: XLA [30], TASO, MetaFlow, TVM (VTA), APOLLO, AKG [31], Xilinx DPU, and TensorRT. For a fair comparison, we only compare corresponding compilers on different hardware. Figure 13 and Tables 1, 2 , 3, 4, 5, 6, 7 show the actual measured values of PCGC on the benchmark DNN with other benchmark compilers. Compared with existing compilers, PCGC can achieve a substantial optimization of power consumption under different hardware and timeconsuming inference tasks of different models. For the FPGA platform, we choose Xilinx ZYNQ ZCU102 for verification. In this article, we use Vivado HLS (v2022.1) for C code compilation and logic simulation.  We simulate and debug the algorithm function and add hardware constraints and timing simulation to the simulation platform. Finally, the simulation platform generates and loads onboard bitstream that is deployed to Zynq UltraScale + MPSoC ZCU102. By analyzing the deployment experimental results of this framework, the   For the GPU platform, we chose the low-power Nvidia edge-computing platform Jetson Nano. Here, we can compare more compilers, because most compilers only support GPU. By running the DNN models, we count the inference time consumption of 10,000 images with a size of 1024 × 1024 and the average power recorded by official tools. The actual test data are shown in Tables 3, 4, which achieves more optimized inference than existing compilers.
For DSA, we chose the Atlas 200 DK of Huawei A310 chip for the experiment. The experimental process is similar to that of an embedded GPU. We also count the inference time consumption of 10,000 images with a size of 1024 × 1024 and the average power recorded by official tools. We obtain the time-consuming and power consumption of the model by analyzing the profiling data and record them. We find that PCGC performs better in terms of power optimization.
In order to better make readers clear about our experimental results, we added the experiment for ResNet50 to run the graph fusion test on the Jetson Nano device. We have calculated the fusion of different situations. When the fusion rule we proposed is combined with the cost model to run, the effect of acceleration and energy consumption reduction is very obvious. We have collected the information about the fusion time of ResNet, guided the fusion rules within the cost model through these data, and achieved the optimized reasoning effect. The statistical records are shown in Table 7.
Currently, we have a larger model to explore in the tuning section. We experimented with some computer vision (classification, detection, etc.) models in the fields of natural language processing and convolutional neural networks, compared with the benchmark baseline of Ansor [32], and the experimental platform is Nvidia V100 GPU. In the comparison experiment, we conducted at least 512 task trails. For those tasks with a small number of tasks (such as BERT, etc.), we increased the number of trails to 20,000 to ensure convergence. Figure 13 shows the speedup relative to Ansor.
For the current results, neither Ansor nor our results use the functionality of Tensor Core. Some benchmark deep learning acceleration libraries such as CuBLAS [33] and CuDNN may achieve significant acceleration on some computing-intensive models (Transformers [34]) with the help of Tensor Core.

Conclusion
In this paper, we proposed a new subgraph fusion and split optimization method, which could maximize the inference performance of the model on the hardware platform through the partial optimization of multilevel rules and the cost model of hardware performance index constraints. In addition, we also considered the execution rules of different operators on the hardware platform, optimized the parallel and partial execution of the model computational graph, and then evaluated the computing time and power consumption of nearly ten benchmark networks on FPGA, GPU, and DSA by using the image data set to evaluate the performance of the whole compiler system. Compared with other advanced compilers, PCGC optimized the overall power consumption on Jetson Nano by an average of 130.5% when the time consumption on each hardware was not lower than the average time consumption. On Atlas 200 DK, PCGC optimized power consumption by an average of 66.5%. On ZCU102, PCGC optimized power consumption by 66.1%. In a sense, PCGC could achieve high-speed inference in specific power supply scenarios, reducing the carbon emissions of edge computing. PCGC needs to be enhanced based on the time-consuming optimization characteristics in order to achieve the best energy consumption, and we will also achieve to maintain excellence in the generalization and accuracy of the model based on the existing work.
Author contributions DD and HL participated in the design of the study and performed the statistical analysis. DD and HJ conceived of the study and participated in its design and coordination and helped to draft the manuscript. DD and YS prepared figures and carried out experiments. All authors read and approved the final manuscript.