An Efficient Real-Time Embedded Application Mapping for NoC Based Multiprocessor System on Chip

The Network on Chip architecture’s performance metrics and inter-core communication are significantly impacted by the acceleration of the evolution of the components integrated on a single chip. Therefore, it is crucial to offer an effective mapping between the cores so that communication between them improves in order to solve such problems. Throughput and latency both have a higher impact on outperforming the network’s performance in NoC. In this research paper, an efficient mapping strategy implemented on the real-time embedded applications named ERTEAM is presented. In this algorithm, based on the minimum core average distance the mapping region is finalized, ensuring the overall mapping area reduced. The PE’s mapped according to the minimum communication energy in the selected mapping region. This research is evaluated on a set of embedded applications, which reveals a reduction in latency at 12.3% and 8.4%, the simulation time reduces at an average of 19% and 9.6%, the throughput increases at 14.5% and 7.8% and reduces the communication energy by 15.6% and 5.2% against Branch and Bound Based Mapping (BBPCR) and segmented brute-force mapping respectively. The proposed ERTEAM is simulated and tested on Xilinxs Zynq UltraScale+ MPSoC ZCU104 Evaluation Kit using Xilinx Vivado 2020.2 software platform. The obtained hardware implementation results outperformed the delay and area metrics.


Introduction
A new ecosystem has originated for semiconductor devices, which allows complex tasks and features to be integrated into a single package, referred to as System on Chip (SoC). As per International Technology Roadmap for Semiconductors, named ITRS 2.0, an emerging ecosystem comprises heterogeneous implementation with electronic components linked to various application domains, including high-performance computing (HPC), IoT, Big Data, including Cloud Computing [1]. The architecture utilized in SoC design is bus-based structures that could not evolve well as an application's communication needs an expansion. The Network-on-Chip (NoC) connectivity approach is developed as a solution to resolve these limitations [2].
Network interface (NI), Routers or Switches, including connectivity links, are the core elements of NoC. The cores in NoC communicate with each other through interconnection links using a technique named packet-based switching. On the intended NoC platform, the detailed design of a NoC architecture comprises Task Partitioning in an application, Tasks Management and Scheduling, including Application Mapping mechanisms. Required tasks schedule and the processing time is strongly associated with task partitioning, along with task allocation. Later, the tasks aligned to the application cores to perform the execution.
The NoC design paradigm's integrative research issues could be generally divided into 4 categories. From those mentioned four categories, the application core mapping on NoC platform is identified as one of the most critical and fundamental problems in NoC designs [3,4]. The mapping designs are basically classifies into 2 types namely, static mapping and dynamic mapping. The static mapping doesn't depend on the networks current state and uses the static paths for transferring the data. The dynamic mapping completely depends on the networks current state and transfer of data occurs on the basis of traffic during the run time.
The primary responsibility of any mapping technique is to map the tasks to the cores available in the chosen platform. Then, the mapping of an application allows to perform the tasks as mapped accordingly and provide the suitable output. As the number of cores is increasing drastically, many mapping techniques came into existence to provide a reliable result. So, it is essential to follow certain rules by considering the critical shortcomings in the present NoC methodologies. Therefore, an efficient mapping technique is implemented by following the above rules, entitled efficient real-time embedded application mapping (ERTEAM). This mapping strategy mainly implemented for the real-time embedded applications and deals with providing a mapping region through minimum core average distance (CAD) and mapping the PE's in the reduced mapping area.
The major contributions provided by the proposed ERTEAM algorithm are stated below: • This mapping approach, is mostly used for real-time embedded systems, reduces the mapping area by using a minimum core average distance (CAD) and maps the PE's inside that region. • A detailed analysis and explanation of the mathematical calculations for CAD as well as the communication energy were presented using an example. • This proposed algorithm was used to evaluate each set of real-time embedded applications, improving performance. • The performance of the proposed method, which outperformed the delay and area metrics, was tested on the Xilinx Zynq UltraScale + MPSoC ZCU104 Evaluation Kit using the Xilinx Vivado 2020.2 software.
The organisation of this research is determined as follows; Sect. 2 provides the related work and, Sect. 3 provides the Model Analysis of NoC architecture. Section 4 explains the details of the proposed mapping strategy and the experimented outcomes represented in Sect. 5. Section 6 provides the hardware verification for the proposed algorithm. The research paper concludes with Sect. 6 and provides the future scope.

Related Work
As Core mapping plays a crucial role in NoC architectures, many research works were proposed to provide efficient mapping strategies to improve performance metrics. Below listed were some of the recent mapping methodologies and their outcomes. Li et al. [5] implemented a runtime mapping that is a thermal-aware algorithm that optimizes the overall performance for 3D NoC. The available core regions restored through the defragmentation algorithm introduced in this mapping. Li et al. [6] implemented a mechanism for mapping the irregular IP's embedded on a regular 2D mesh topology for NoC architectures. The core principle is to break down each big IP into several smaller dummy IP's, each of which can move into a single tile, reducing energy consumption and avoiding congestion.
Liu et al. [7] proposed a TopoMap algorithm for the SMART NoC architectures to improve performance. The topology of the architecture is selected dynamically based on the configuration by the thermal aware task mapping algorithm. Jiang et al. [8] developed a mapping strategy based on the BB algorithm to provide both the core and the communication mapping. This scheme reduces the overall latency and the energy of the hybrid NoC and optimizes the overall mapping. Bhanu et al. [9] proposed a technique to provide a fault-tolerant system and verified it through both simulations and FPGA validation. Firstly, the mapping of an application by considering the fault-tolerant mechanism for a Torus topology performed through ILP and PSO. Therefore, it provides a complete mathematical approach for replacing a faulty core with a spare core.
Khan et al. [10] implemented a BEMAP algorithm where the real-time applications mapped, considering bandwidth constraints. This mapping mechanism used the modular systematic searching technique, where the system is divided into small possible modules and performed the mapping on both Torus and Mesh topologies. Therefore, it reduces the overall latency and energy consumption for 2D NoC architectures. Liu et al. [11] proposed a BBPCR algorithm to find the optimal mapping for an application. Firstly, a PCM model that is highly accurate and flexible developed, containing both the energy and reliability parameters. Later, using this model, BBPCR is implemented for figuring out the best mapping solution for an application. Therefore, it significantly impacts the improvement of reliability, low energy consumption, and low latency. The SBMAP [12] mechanism implemented, considering bandwidth constraints to minimize energy consumption and computational complexity. This mapping mechanism used the modular systematic searching technique. The system is divided into small possible modules and performed the mapping on it, resulting in high performance and less simulation time.

Background
A network core graph (NCG), G = G(P, A) is a directed graph in which vertices of the graph represent the available processing elements PE's ( P = P 1 , P 2 , P 3 , ..., P n ) for the task execution. The directed arc ( a ij ∈ A) shows characteristic parameters and required bandwidth between the IP cores ( P i to P j ).
NoC architecture graph (NAG), A = A(C, D) is a topology graph in which the node of the graph, (C = C 1 , C 2 , C 3 , ..., C n ) shows a network cores and the directed arc, 'D' represents the communication distance ∀ C ij ∈ D , and ' C ij ' denotes the distance between core ( C i ) and core ( C j ).
NoC architecture mapping graph (NMG), M = M(C, D) is a topology graph in which node of the mapping graph, (C = C 1 , C 2 , C 3 , . . . , C n ) shows a network mapping cores and the directed arc, 'D' represents the communication distance ∀ C ij ∈ D and ' C ij ' denotes the distance between core ( C i ) and core ( C j ).

Core Average Distance (CAD)
CAD is the shortest average path length between any two cores in the network. The average distance between any two selected vertices of a network, which is of X*Y size in NoC is evaluated as shown in below Eq. (1), such that the evaluation of CAD provides the mapping region for a NoC network [13,14].

Measurement of Communication Energy
Communication energy is considered the same as the distance between two tiles or nodes [15,16]. It is calculated as the sum of differences between their corresponding modules determines the distance among two vertices, i.e. V i and V j , where V i having parameters as ( a 0 , b 0 ) and V j having parameters as ( a 1 ,b 1 ) [17].
Therefore, the total communication energy (TCE) calculated as mentioned in Eq. (2).
where W(E ij ) is illustrated as the weighted communication energy between any two nodes in a network.

Measurement of Performance
The throughput and latency considered important metrics for performance improvement [18][19][20]. Since network congestion significantly impacts latency, avoiding congestion for each node is an efficient way to minimize latency. Simultaneously, less congestion will result in increased throughput. As a result, the bandwidth limitation, which is interrelated to congestion, is considered the performance limitation. Therefore, the communication volumes for each node managed through bandwidth restrictions, so congestion is diminished, and performance, including latency and throughput, is guaranteed [21].

Proposed ERTEAM Technique
Problem Definition An efficient real time embedded application mapping problem is defined as: Given a set of network core graph (NCG), G = G(P, A) and NoC architecture graph (NAG), A = A(C, D) , finding a mapping function M(C, D) that maps an IP core c i ∈ C in the NCG to a PE in the NoC.
Let a ij ∈ A be mapped to some P xy ∈ C then P xy = Ω(C i ) ∈ D.
CE(P i , P j ) = W(e ij ) × | P(i)P(j) | in terms of nodes. CE(P i , P j ) = W(e ij ) × C ij C ij denotes the distance between core (C i ) and core (C j ) . C i parameters (a 1 , b 1 ) , C j parameters (a 2 , b 2 ) . W(e ij ) denotes the communication rate from P i to P j .
The proposed core mapping algorithm is clearly explained in Algorithm 1. Network core graph (NCG) and NoC architecture graphs (NAG) acts as an input and NoC architecture mapping graph (NMG) acts as output. Initially, select the efficient mapping region using minimum core average distance (CAD) which reduces the mapping area. Then, processing element (PE's) in NCG are mapped on efficient mapping region in NoC according to the minimum communication energy. Figure 1 presents an obvious example with clear explanations. A simple network core graph has shown in Fig. 1a and 5 × 5 NoC Architecture Graph shown in Fig. 1b. As the number of vertices is 7 in the NCG, the efficient mapping region is selected based on CAD, preferably a size 3 × 3 region shown in Fig. 1c. Finally, NCG vertices are mapped on 3 × 3 region according to the minimum communication energy of the network shown in Fig. 1d.

Experimental Results
In this section, we conduct sets of comprehensive experiments to evaluate the effectiveness of the ERTEAM algorithm, mapping performance and communication energy. The mentioned metrics are compared with state-of-the-art approaches on embedded applications. A set of embedded applications exploited for evaluation. Application names and their numbers of cores are shown in Table 1 [22] and the simulation configuration parameters of the ERTEAM are depicted in Table 2. The best mapping pattern found using a C++ program, the simulations carried out on Noxim simulator [23], and the time consumed can be obtained. .

Latency
The time taken by the packet's header flit to migrate between any source to destination in the network referred to as latency. According to network congestion, latency frequently involves a packet's waiting time between any source to the destination node, illustrated in Eq. (4). K = Total number of packets reaching their destination cores. L n = The clock cycle latency for the nth node. Table 3 explains the obtained latency of the proposed algorithm ERTEAM (in terms of cycles) compared to the BBPCR [11] and SBMAP [12]. Therefore, the graphical representation of the latency depicted in Fig. 2.

Simulation Time
The term simulation time is defined as the overall time required by the system to execute the tasks during the mapping of cores, known as the simulation time or the execution time. Thus, lesser simulation time provides an increase in the performance of the system. Table 4 illustrates the obtained simulation time of the proposed algorithm  ERTEAM (in terms of seconds) compared to the BBPCR [11] and SBMAP [12]. Therefore, the graphical representation of the simulation time depicted in Fig. 3.

Throughput
Throughput considered as one of the important parameters regarding the performance of the system. It represents the maximum amount of information that transferred in a given amount of time. Therefore, the mathematical formulation for throughput illustrated in Eq. (5).
where R p = total number of received packets, N = the total number of cores, N p = number of clocks cycles lapsed from the first generated packet to the last received packet. Table 5 describes the resultant throughput of the proposed algorithm ERTEAM (in terms of cycles/packets) compared to the BBPCR [11] and SBMAP [12], whereas the graphical representation of throughput depicted in Fig. 4.

Communication Energy
The term Communication Energy defined as the sum of differences between their respective modules establishes the distance between any two nodes in a chosen topology of a network. Table 6 illustrates the communication energy of the proposed algorithm ERTEAM (in terms of J) compared to the BBPCR [11] and SBMAP [12]. Therefore, the graphical representation of the communication energy depicted in Fig. 5. Table 7 demonstrates the evaluation of the metrics for the proposed ERTEAM algorithm against the BBPCR [11] and SBMAP [12]. The reduction of latency improved by an average of 12.3% and 8.4% against BBPCR [11] and SBMAP [12], the overall simulation time reduced to 19%, 9.6% compared to BBPCR [11] and SBMAP [12]. Furthermore, the throughput of ERTEAM improved by an average of 14.5%, 7.8% compared to BBPCR [11] and SBMAP [12] and the communication energy reduced to 15.6%, 5.2% against BBPCR [11] and SBMAP [12].   [24,25]. The metrics such as delay and area in terms of LUTs for 4-bit, 8-bit, 16-bit, 32-bit, 64-bit were evaluated for the ERTEAM algorithm against BBPCR [11] and SBMAP [12]. The evaluation kit is based on a Zynq UltraScale + XCZU7EV MPSoC, that pairs programmable logic with a processing system based on a quad-core Arm Cortex-A53 application processor and a dual-core Arm Cortex-R5 real-time processor. Each metrics along with the comparative results were elucidated as follows:

Delay
Delay is considered as the most satisfying performance measure within systems. The overall measure of time required for a communication or packet to transit from its source towards its final destination is referred to as delay. The data transfer rate is indeed a probabilistic parameter. Table 8 illustrates the delay of the proposed algorithm ERTEAM (in   [11] and SBMAP [12] algorithms. Figure 6 provides the graphical representation of delay of ERTEAM algorithm for various bit sizes.

Area
Area is also one of the major metrics in terms of performance improvement for NoC applications. Due to the exponential increase in the number of components on the chip, the area is getting increased. So, the proposed ERTEAM algorithm concentrated on this drawback and the Area (in terms of LUTs utilization in FPGA) of ERTEAM reduced by an average of 13%, 8% compared to BBPCR [11] and SBMAP [12] algorithms. Table 9 illustrates the Area (in terms of LUTs utilization in FPGA) of the proposed algorithm ERTEAM for various bit lengths (4-bit, 8-bit, 16-bit, 32-bit, 64-bit). Figure 7 provides the graphical representation of LUTs utilization of ERTEAM algorithm for various bit sizes.

Conclusion
The proposed mapping strategy entitled ERTEAM is applied to real-time embedded applications to improve the network's performance. This implementation chooses the mapping region based on the minimum Core Average Distance. After providing the mapping area, the PE's embedded in the arrangement of minimum communication energy between the cores. The resultant outcome of the proposed mapping technique provides low latency at an average of 12.3%, 8.4% against BBPCR and SBMAP, less simulation time of 19% against BBPCR and 9.6% against SBMAP. In addition, the overall throughput increased at an average of 14.5%, 7.8% compared to BBPCR and SBMAP. The communication energy of ERTEAM reduced by 15.6% and 5.2% against BBPCR and SBMAP respectively. The hardware verification is carried out through Xilinx Zynq UltraScale + MPSoC ZCU104 Evaluation Kit using Xilinx Vivado 2020.2 software platform, where the test results shows a significant improvement and outperforms the delay and area metrics. In future, we would like to extend this work by implementing a efficient mapping technique through ML (machine learning) concepts and provide a fault aware core mapping using a spare core replacement methodology. This improves the reliability and performance metrics of the NoC system.