High-Performance Application Mapping in Network-on-Chip based Multicore Systems

The allocation of resources and scheduling of tasks, speciﬁcally mapping, in multicore systems on-chip (MCSoC), poses signiﬁcant challenges. Tasks have diverse resource requirements and interact with each other, while network-on-chip (NoC)-based MCSoC consists of heterogeneous cores and communication networks. The heterogeneity of resources in MCSoC, along with the varying computational and communication demands of applications, makes mapping a complex optimization problem. We have mathematically modeled the mapping problem in NoC-based MCSoC using mixed-integer linear programming (MILP) with the objective of minimizing the execution time of applications. This model incorporates computation and communication capacity, energy budget constraints of MCSoC, and execution time requirements of applications. We further propose heuristics, including simulated annealing (SA) and genetic algorithm (GA), considering the capacity and budget constraints of MCSoC systems to accelerate applications and provide quick mapping solutions. Simulation results demonstrate that the performance from SA and GA heuristics are very close (within 10%) to the optimal solutions from MILP across various applications for 2D-mesh NoCs with 16 to 100 cores. The energy consumption of SA and GA heuristics is also very close to the optimal solutions from MILP, with a few exceptions on small-scale 16-core NoC. Additionally, SA outperforms GA in most cases for all applications.


Introduction
Machine learning-based computing rather than algorithm-based computing continues to accelerate.A study [1] suggests that popularity of neural networks would double the computation demands on datacenters.This acceleration of demands puts pressure to have large-scale systems to meet demands.Fortunately, because of the advancement of transistor technology, we see a huge increase in the number of cores per chip [2][3][4].Network-on-chip (NoC) has become the standard communication medium for multi/many-core systems on-chip because of various properties, including scalability and parallelism [5][6][7].For example, Cerebras have integrated 800K cores in a NoC based single chip with trillion transistors [2], which can deliver more performance than a cluster of traditional machines.We could even built a datacenter on a chip as many-core on-chip systems have better power efficiency, interconnection, and latency compared to traditional local area network (LAN) based systems [8].However, fraction of time spent in communication for an application increases with the increased number of compute nodes in the systems, for example, communication time can be more than 50% for an application in large-scale systems [9].Therefore, communication is the major bottleneck to achieve high parallelism in systems with many compute nodes as communication costs much more time and energy than that of computation [9][10][11].
Given heterogeneous workload requests with diverse computation and communication (data dependencies) demands, a challenging research problem is to exploit the optimal mapping of resource demands onto various heterogeneous resources to fulfill high-performing task-resource co-allocation.Our target is to achieve both computation and communication efficiency for heterogeneous workloads without incurring overhead while mapping on heterogeneous processor tiles connected by heterogeneous routers and links.The mapping optimization in NoC-based MCSoC faces new challenges of limited energy and capacity budgets.Billions of transistors are being integrated on a single chip.However, while transistors keep shrinking, current leakage poses greater challenges, and consequently power density increases, causing the failure of Dennard Scaling [12].It is not feasible to activate all cores at their maximum speed simultaneously, leaving a large amount of silicon inactive (dark silicon) or operating at a lower frequency (dim silicon), the so-called dark/dim silicon [13].Mapping solutions should consider the heterogeneous frequency of computation and communication resources, including the inactive resources.Dark and dim silicon are achieved by applying voltage-frequency scaling (including power-gating), called dynamic voltage-frequency scaling (DVFS).The search space for mapping task-to-core could increase quadratically or exponentially with the increase in the number of cores, routers, and link resources in MCSoC.For example, the communication paths between cores in MCSoC increase quadratically or exponentially with the increased number of cores in NoC-based MCSoC.The large number of task-to-node assignment options and data communication (routing) options significantly contribute to the size of the solution space and the time complexity required to obtain an optimal solution.The communication path also depends on the topology of interconnection networks, e.g., mesh, ring.The higher time complexity of getting a solution makes it harder to apply the solution in a real-time-constrained system to meet the deadlines of the application(s).Therefore, the mapping or resource allocation problem, such as the bin packing problem, is NP-complete and/or NP-hard [14], where the time complexity grows exponentially with an increase in the number of tasks in application and/or resources in MCSoC.For example, mapping is an NP-hard problem, as there is no known algorithm that can solve it in polynomial time.Large problem instances cannot be computed using linear programming optimizers (e.g., IBM CPLEX [15]) for optimal solutions since they fail to compute results in time especially for run-time mapping decisions.Heuristics can produce solutions quickly without the need for significant hardware overhead, whereas machine learning, a young field with many underexplored research opportunities [16], may require significant training and/or hardware overhead to provide efficient mapping.Heuristics can cover all possible cases under heterogeneous tasks (and traffic) and their data dependencies and heterogeneous resources in complex and large platforms.This shows the opportunities of using heuristics for application-architecture mapping in NoC-based MCSoC.
The major contributions of this work are outlined below: • Mixed-integer Linear Programming Model for Mapping: A mathematical model using mixed-integer linear programming for mapping tasks onto NoC-based MCSoC has been proposed and implemented using IBM CPLEX [15] for optimal solutions.The model aims to accelerate applications by reducing the execution times (sum of computation and communication times) of running tasks.The MILP model and SA and GA algorithms were evaluated on both small-scale and large-scale systems using garnet [17], a cycle-accurate NoC model on the gem5 [18] platform.Embedded system benchmark suite (E3S) [19] is used for evaluating all these proposed approaches.An in-house developed simulator was used to generate mapping results for subsequent latency and performance evaluation on the gem5/garnet platform.Mapping multiple applications is simulated, as all the applications in E3S contain multiple sub-applications within them.
Related work is presented in Sec. 2. Mathematical model of mapping tasks onto NoCbased MCSoC architecture is formulated using MILP in Sec. 3. Then we discussed simulated annealing and genetic algorithm mapping heuristics for mapping tasks onto NoC in Sec. 4 and 5, respectively.Simulation results in Sec.6 demonstrate the effectiveness of the proposed high-performance application mapping solutions, particularly the performance of SA and GA heuristics compared to the optimal solutions from the MILP model.

Related Work
Mapping decisions are driven by various energy and performance goals, such as minimizing the execution time of applications, reducing latency in interconnection networks, and optimizing the energy consumption of MCSoC.Depending on the application needs, the mapping strategy maps the tasks onto the system and allocates/configures the resources in the system.Various task mapping algorithms have been developed for multicore systems, some are based on graph theoretic, some use mathematical programming, and others are heuristic algorithms (we did not discuss machine learning-enabled task mapping in this work).Variations of shortest tree algorithms have been devoted to mapping highly correlated tasks on the task graph to nearby cores, in order to shorten the communication delay [20][21][22][23][24].For example, [23] maps the closely communicating threads to neighboring cores to improve locality of data accesses.Similarly, multi-application mapping in [25,26] maps the tasks in the same application together to reduce the communication distance, for example, [25] finds an optimal region for each application using the maximal empty rectangle technique.Along with shortening the communication distance, the communication traffic is split into multiple paths to improve bandwidth efficiency during task mapping in [20].The run-time mappings in [27][28][29] map the communicating tasks in a close neighborhood to have contiguous mapping regions.Hill climbing technique in [29] maps the tasks of an application as nearest as possible.Run-time mapping in [28] selects a square region required for an incoming application to reduce congestion in the mapped neighborhood.Some work focuses on balancing the load distribution to reduce contention and hotspots [30][31][32][33].Some work considers dark and/or dim silicon cores (voltage-frequency scaled) during task mapping to reduce the overall energy dissipation in the system [34,35].Some work minimizes both energy hotspots or contention and overall chip power [32,35].Run-time mapping in dark silicon, presented in [34,36], aims to increase power budget for activating more resources.Some work focuses on thermal management during task mapping in NoCs [37][38][39][40].Some work employs fault-tolerant task mapping to enhance the reliability of the system while reducing energy consumption [41].Some work focuses on reducing energy consumption and accelerating applications at the same time, such as [42,43] and this paper.This paper is an extension of the paper published in [44], where we made several modifications, including the implementation of the MILP model using an LP solver and the addition of MILP results.In this work, we address the placement of computation and communication-demanded tasks of multiple applications onto heterogeneous NoCs under computation/communication capacity, execution time, and energy budget constraints.The proposed mapping approaches accelerate the applications by reducing the execution time of applications, which can indirectly reduce energy consumption.We modeled the task mapping problem using an MILP model and proposed simulated annealing and genetic algorithm heuristics for a quick mapping solution applicable in real-time systems.We implemented all the proposed models and heuristics and compared their results.

Linear Programming Modeling for Mapping
For the linear programming (LP) formulations, we assume a mesh architecture that employs shortest-path routing for data delivery.A 3X3 2D-Mesh NoC-based multicore system is shown in Fig. 1, where a tile contains a core and a router.Cores do the computation and routers connected to cores are used for communication between tasks mapped to cores.Each node (or tile interchangeably) i in multicore systems has a frequency denoted by f i .Two adjacent nodes (routers) i and j are connected by a bidirectional link l ij with a maximum communication capacity of W ij .
Given a system that consists of computation and communication resources, this LP model considers a sequence of K applications {A 1 ...A K } for mapping.Each application contains a set of Z tasks {t 1 ...t Z }.A sample application graph with computation and communication Router Core Tile Fig. 1: 2D Mesh NoC-based multicore system on-chip.demands is shown in Fig. 2. Let's assume, (X t k ) m is a binary variable to indicate the mapping 1 4 Fig. 2: A sample application task graph G.
of task t of application k to node (core) m.

Computation and Execution Time Constraints
Assume E k is the execution time of an application k, (E t k ) m is the computation time of task t on node m, the total number of tasks in a application k is Z, the total number of nodes in the system is M , (C t k ) m represents the computation time of task t on node m, Q m is the size (quantum) of the time slice given by node m for each task in the round-robin scheduling policy, and S m is the time node m needs from switching the execution from one task to another.(C t k ) m is calculated by dividing the computation demand of (cmp) t k by the processor capacity (operations per second) of pc m .(cmp) t k represents the number of computational operations requested by task t of application k. (1) Since the mapping needs to satisfy the computation time requirements (quality-of-service requirements), E q k , for all the applications, we arrive at the computation constraint for each application k:

Communication and Link Constraints
Mapping needs to consider the capacity of the communication links in NoC and the traffic on those links, as the execution time of tasks depends on the communication time with mapped tasks at remote nodes in MCSoC.The adjacent tasks T s k and T t k in the task graph need to communicate with each other, with a traffic load of (com) st k +(com) ts k .When they are mapped to two nodes, e.g., p and q, the traffic is delivered through a path in multicore systems, denoted as P h pq , which consists of a set of NoC communication links.Such a path is determined by the underlying routing algorithm like XY-routing algorithm in this work.Multiple traffic flows may simultaneously traverse through a communication link l mn , thus the total communication traffic load via a link is: Since each communication link l mn has a maximum bandwidth capacity of W mn , we arrive at the communication constraint: Communication time on the link l mn is determined by dividing the traffic load ω mn on the link l mn by the corresponding bandwidth of the link W mn : (7)

Energy Budget Constraint
The energy dissipation P energy is determined by multiplying computation and communication load with the energy dissipation per load unit.e cmp and e com represent the energy coefficient of computation and communication loads, respectively.In 22nm technology, e cmp and e com are approximately 45 picojoule (pJ) per computational operation (assuming 8-bit per computational operation for simplicity) and 65 pJ per communicating bit, respectively [45].
Since the mapping needs to satisfy the energy budget constraint of P budget , we arrive at the energy constraint constraint:

Objective Function
The objective is to minimize the total execution time (sum of computation and communication time) of the tasks to accelerate the applications: subject to the constraints given in Eqs. ( 4), (6), and ( 9).This objective function, along with the communication, computation, and energy constraints, forms an MILP model.This is an MILP model as the model contains both binary and continuous variables.
We notice the long computation time for an LP solver to discover an optimal solution.For example, for a mapping of an application with 28 tasks in 36-core NoC, it takes about hours for LP solvers, such as IBM CPLEX [15] and mixed-integer LP solver (intlinprog) in Matlab [46], to search for the optimal solutions.When the number of applications and/or NoC size increases, an LP solver cannot deliver a solution in a reasonable time period for its application in real-time systems.Simulated annealing (SA) and genetic algorithm (GA) heuristics can produce near-optimal solutions much more quickly compared to LP solvers.Similar to LP modeling, a mesh NoC-based MCSoC is utilized to evaluate the proposed SA and GA-based mappings.XY-routing algorithm is used to find the distance between two communicating tasks that are mapped to different cores in a 2D-mesh NoC.In a 2D-mesh NoC, XY-routing algorithm functions as a shortest-path algorithm, much like Dijkstra's algorithm [14].Though a 2D Mesh NoC is utilized for mapping for the MILP model and heuristics in this work, all these approaches can be adapted to other NoC topologies, such as a ring and torus, by changing or modifying the routing algorithm.

Simulated Annealing for Mapping
Simulated annealing (SA) [47] has proven to be quite effective in approximating global optima for many different NP-hard or NP-complete combinatorial problems, such as mapping.Due to the high time-complexity of the mapping problem, an SA technique has been proposed to quickly solve application-architecture mapping problem in MCSoC.The proposed SA can address the heterogeneity of applications and systems and can handle the mapping of a large number of applications and tasks in large-scale systems.The pseudocode of the proposed SA for task mapping onto NoC-based multicore systems is shown in Algorithm 1.The technique starts with an initial temperature of T = 10000, α value of 0.95, T = T • α function to decrease temperature, ϵ value of 0.5, and an initial mapping solution.The values of T , α, and ϵ are empirically determined (based on the best performance).The initial mapping is evaluated first.The heuristic executes until it meets a certain predetermined stopping criteria.The stopping criteria are either when temperature goes below 10 −20 or when the mapping remains the same for 250 iterations.At each iteration, the current mapping is changed to make a new mapping, and the new mapping is compared to the current mapping.The new mapping solution needs to meet the constraints in Eqs. ( 4), (6), and (9), before the mapping candidate is considered for acceptance.If the new mapping solution has a better value for the execution time (i.e., the fitness function that is being optimized) of the mapped application(s), then the new mapping is chosen to be the current mapping solution.When the new mapping is worse than the current mapping, the solution is accepted if the ϵ is smaller than e −|scorecurrent−scorenew|/T , where score new and score current are the new and current values returned by the evaluation function.The property of accepting worse solutions helps to explore a broader search space, potentially escaping local optima and discovering better solutions that may be further away in the search space.This strategy of accepting better and worse solutions are similar to exploration and exploitation strategies in reinforcement learning (RL).Exploitation in SA involves focusing on accepting better solutions with a high probability, and exploration is achieved through the process of accepting worse solutions with a certain probability to avoid local optima in the proposed SA.The exploration and exploitation trade-off in SA is controlled by the cooling schedule (e −|scorecurrent−scorenew|/T ).At higher temperatures, SA is more explorative (possibility of accepting worse solutions).The probability of accepting worse solutions gradually reducer over time (as temperature decreases).As the optimization process progresses and the temperature parameter decreases (during the annealing process), SA becomes more likely to accept only better solutions, aiming to converge towards an optimal or near-optimal solution.This feature of SA mimics the machine learning or reinforcement learning agent that becomes smarter with more training.
The evaluation function calculates a score for a mapping using the computation score and communication score.Communication scores are calculated by considering the communication times between the communicating tasks in NoC.Computation scores are calculated by considering the computation times of the tasks at the corresponding mapped nodes.The return from the evaluation function (in SA) is the inverse of the summed computation and communication times (execution time).So, a higher score (lower execution time) means better performance.
Algorithm 1 Simulated Annealing Heuristic for Task-to-Node Assignment in MCSoC 1: Input: Application Task Graph, Topology Graph 2: Output: Task Placement on Multicore Systems On-Chip 3: step 0: select the values of initial temperature (T), ϵ and α parameters; 4: step 1: select a random mapping (assignment of tasks onto nodes); 5: step 2: evaluate the current mapping and return the score score current for current mapping; 6: step 3: generate new mapping and evaluate the new mapping and return the score score new for new mapping; 7: step 4: accept the new mapping as the current mapping, (a) if the new mapping is better (score new ≥ score current ) or (b) if ϵ is smaller than e −|scorecurrent−scorenew|/T ; 8: step 5: reduce the temperature according to the formula, T = T • α; 9: step 6: stop if stopping criterion met (T ≤ 10 −20 or mapping remains same for 250 iterations), otherwise go to step 3.

Genetic Algorithm for Mapping
Genetic algorithms (GAs) follow evolutionary processes [48][49][50] to search large solution spaces.The approach of GA for combining and evolving previous solutions can result in efficient solutions that may not be obvious and would be difficult to arrive at intuitively.In classical GA, a solution is encoded as a chromosome.In the proposed GA for mapping, a chromosome represents the mapping of tasks onto nodes, and a population consists of multiple chromosomes or multiple mappings of (multiple) tasks onto nodes.The algorithm begins by encoding a set of possible mapping solutions as chromosomes.The entire set of mapping solutions (chromosomes) is called a population (a list of mapping solutions).Similar to SA, the proposed GA for mapping is also an iterative improvement algorithm.The mapping solutions (population) are iteratively updated by the GA toward better solutions until a stopping criterion is met.The iteration starts with a population of randomly generated individuals (mappings).The population in each iteration is called a generation (new mapping).
In each generation, the performance score (inverse of execution time) of every individual in the population is evaluated.As in simulated annealing, a similar calculation is used for calculating the performance score using computation and communication scores.Mappings with lower execution times are stochastically selected from the current population, and each mapping is modified (recombined and possibly randomly mutated) to form a new mapping.The new mapping of candidate solutions is then used in the next iteration of the algorithm.The GA typically terminates either when a maximum number of different mappings (generation count) has been produced or when a satisfactory execution time has been reached for the population.In this paper, we terminate the iteration when the maximum number of generations is reached.Similar to the proposed SA algorithm, the new mapping solution in the proposed GA also needs to meet the constraints in Eqs. ( 4), (6), and (9), before the mapping candidate is considered for acceptance.The pseudocode of the proposed GA for mappings onto multicore systems is shown in Algorithm 2.
Algorithm 2 Genetic Algorithm Heuristic for Task-to-Node Assignment in MCSoC 1: Input: Application Task Graph, Topology Graph 2: Output: Task Placement on Multicore Systems On-Chip 3: step 0: select the values of crossover probability, mutation rate, swap probability, stop generation count, and size of the initial mappings (populations) parameters; 4: step 1: generate the initial size number of random mappings (assignment of tasks onto nodes); 5: step 2: evaluate the initial mappings and add the performance scores score current to a list for initial mappings; 6: step 3: 7: while (currentgeneration count ≤ stopgeneration count) do: 8: step 4: perform selection using linear ranking selection and roulette-wheel sampling; 9: step 5: perform crossover using uniform crossover; 10: step 6: perform mutation to generate a mapping; 11: step 7: evaluate the mapping and add the new score score new to a list; 12: end while 13: step 8: Output best mapping solution from the list; The four steps of a genetic algorithm for a mapping solution are described below: selection, crossover, mutation, and evaluation.One cycle of selection, crossover, mutation, and evaluation is called a generation.

Evaluation
The evaluation function calculates a score for the mapping using the computation and communication time.A higher performance score returned from the evaluation indicates lower execution time.The evaluation function could also return invalid scores.For example, a score of 0 indicates an invalid mapping due to violating computation capacity, communication capacity, or energy budget constraints.

Selection
The selection phase helps us choose the best-suited mappings and discard the worse (higher execution time) mapping solutions.The main idea is to increase the chances for better mapping solutions to be preserved over the next generation.The selection phase consists of two phases: selection and sampling [51].The linear ranking selection algorithm is used as the selection algorithm [52].This selection algorithm sort the mappings in descending order based on their performance score.The mapping with the highest performance score is at the beginning of the sorted list and the lowest score at the end of the list.The selection algorithm then assigns a selection probability to each individual mapping in the current mapping population based on their ranking in the sorted list.The selection probability is a linear function of the mapping's rank.The formula to calculate the selection probability for the i − th mapping P rob i is as follows: (11) where: pop size is the total number of mappings in the population.n max is a constant representing the maximum probability for the fittest mapping.n min is a constant representing the minimum probability for the least fit mapping.i is the rank of the mapping in the sorted list, ranging from 0 to pop size -1.n max and n min are determined empirically.n max and n min control the level of exploration and exploitation desired in the GA.A higher n max will increase the selection pressure for the best mappings, making them more likely to be selected.A smaller value of n max will reduce the selection pressure on the fittest mappings, allowing more exploration.Similarly, a higher n min will increase the selection pressure for the worst mappings, making them less likely to be selected and promoting exploration.A smaller value of n min will reduce the selection pressure on the least fit mappings, allowing more exploitation.n max and n min are set to 1.1 and 0.9, respectively, in our experiments to produce the best-performing solutions.The selection algorithm favors better mapping by assigning them higher probabilities.However, most of the solutions, including worse solutions, will have a some probability of selection for next generation.This helps avoid quick convergence to possible local optima solutions, leading to a good mapping solution.
The sampling algorithm then produces the mating pool by selecting individual mappings from the current mapping solutions.The Roulette Wheel Selection algorithm is used for sampling.The algorithm spin the "pseudo" roulette wheel by randomly selecting a position.Each slice in the Roulette Wheel is a candidate mapping solution.The size of the slice is proportional to the performance score returned by the evaluation function.Because of the possible larger size for a better mapping solution, they have a higher probability of getting selected as winners in the roulette wheel spinning.The chosen mapping corresponds to the slice where the wheel stops.Repeat the spinning process until the mating pool is filled with the desired number of selected mappings.

Crossover
A set of chromosome pairs is formed by randomly selecting mappings (chromosomes) from the current mappings.For each pair in this set, a random number r between 0 and 1 is generated; if r is less than or equal to crossover rate of 0.7, the pair is chosen for crossover.For each chosen pair, every task (gene) is exchanged between the two parents with a certain probability called the swapping probability.This method of crossover is also called uniform crossover [53].The values of crossover rate and swapping probability are empirically determined to produce the best-performing solution.In this work, the crossover rate and swapping probability are empirically determined to be set to 0.7 and 0.5, respectively.A higher crossover rate is used to perform more crossovers and introduce more diversity in the mapping solutions, which helps in reaching a better mapping solution and providing convergence.Uniform crossover introduces a higher level of exploration in the search space compared to traditional crossover methods, such as single-point or multi-point crossover, as it has the potential to create new mapping with gene combinations that are not present in either parent mapping solutions.This exploration property can be beneficial for maintaining genetic diversity in the population and increasing the chances of discovering better solutions in the optimization process.

Mutation
Mutation is used to introduce diversity in the population (mappings) to prevent the GA from converging to local optima.Mutation swaps the two randomly chosen tasks (assigned to cores) in a single mapping to create new mapping.If the random number r is less than or equal to the mutation rate of 0.05, the mapping is chosen for mutation.Then two tasks (genes) are selected randomly within the chosen mapping, and the tasks (gene values) are swapped.The mutation rate is kept small as a higher mutation rate (e.g., 0.5) could turn the proposed GA into a random search, which reduces the quality of the mapping solution.GA utilizes exploitation and exploration strategies, similar to Reinforcement Learning (RL), to iteratively improve solutions over generations.Exploration is done during crossover, whereas exploitation is done during mutation.

Simulation and Results
We implemented the Linear Programming (LP) optimization model as described in Sec. 3 using a commercial LP solver IBM CPLEX [15] which searches for the optimal solution for the mapping.We integrated the CPLEX solver in Matlab and called the CPLEX solver to solve the optimization problem using the Matlab optimization toolbox [46].We implemented both the proposed SA and GA for mapping evaluation using our in-house developed simulator that includes the mathematical constraints mentioned in Sec. 3 (Eqs.( 4), (6), and ( 9)).The mapping solutions were then fed to gem5/garnet to evaluate the latency and throughput performance of NoCs.We simulated multicore systems ranging from 16 cores to 100 cores, where cores are interconnected by a mesh-based NoC.For instance, an 10X10 2D-Mesh NoC is used for a 100-core system where each core has a router for communication with other cores and memory in the MCSoC.Both large-scale and small-scale NoCs have been simulated to show the robustness of the mapping algorithms.The performance score of the mapping solutions is calculated by taking the inverse of the sum of the computation time and communication time of tasks mapped in the system.
Real system experiments were conducted using gem5 [18] and Garnet [17] platforms to evaluate the performance of NoCs in terms of end-to-end latency and network throughput.The results of the mapping solutions from the linear programming solver and from the in-house-developed simulator for SA and GA heuristics are fed to the Garnet [17] on-chip networks in the gem5 platform [18] for latency and throughput performance evaluations in real systems.gem5/garnet configurations for 64-core 2D Mesh NoC are shown in Table .[54].64-core NoC and 16-core NoC are simulated in gem5, as the gem5 only allows system with exponential of two for square mesh NoCs.Each core has a router associated with it for communication with other cores and memory in garnet.The injection rate is set to 0.3, ensuring a higher packet injection to test the robustness of the mapping solution, for comparing solutions of MILP, SA, and GA.Furthermore, latency   and throughput performance of all the solutions are evaluated at different injection rates from 0.05 to 0.3 to demonstrate the robustness of the algorithms.
The proposed MILP model and SA and GA algorithms were evaluated using the E3S benchmark suite [19].E3S comprises applications in Consumer (C), Auto-Industry (A), Networking (N), Telecom (T), and Office-Automation (O) domains.Consumer application has two sub-applications with 12 tasks.Auto-Industry application has four sub-applications with 24 tasks.Networking application has four sub-applications with 13 tasks.Telecom application has eight sub-applications with 28 tasks.Office-Automation is a single application with 5 tasks.The energy budget constraint is set to 50 mJ (milliJoule) for all the applications in E3S benchmarks.
The energy consumption of the proposed MILP model and SA and GA algorithms was evaluated using a 64-core and a 16-core 2D-Mesh NoCs for all applications, as shown in    Figs.3a and 3b.Simulation results for E3S benchmarks on a 64-core 2D-Mesh NoC suggest that the energy dissipation of SA and GA solutions is within 5-15% of the optimal or near-optimal solutions from the MILP solutions.This shows the suitability of SA and GA for high-performance mapping in NoC-based multi/many-core systems.However, on a 16-core NoC, the energy dissipation of MILP solutions is 50-90% better than that of the SA and GA solutions.These results indicate that SA and GA perform very close to MILP solutions when enough resources are available.SA and GA could struggle when computation and communication resources are limited, as in small-scale 16-core NoCs.Simulation results also suggest that the energy consumption of the GA is only 2-6% higher, on average, than that of the SA for both 64-core and 16-core NoCs.The faster execution of the applications also indirectly contributes to the lower energy consumption in all the proposed approaches.
Execution time performance of all the applications for NoC sizes with 64 cores and a6 cores are demonstrated in Figs.4a and 4b.Execution time for all the algorithms is in the   range of milliseconds for all the applications, demonstrating the efficiency of the algorithms.Execution time significantly decreases (100-500%) with the increase in NoC sizes from 16core to 64-core for all the applications with a very few exceptions.This shows the scalability property of the MILP model and the SA and GA algorithms as they can utilizes the increased availability of computation and communication resources (with the increase in NoC sizes) properly for running the tasks.Execution time performance of all the applications for NoC sizes ranging from 4 × 4 to 10 × 10 is demonstrated to show the scalability property of the SA and GA algorithms.The results for MILP are not included as MILP takes very long time to generate solutions for 100 cores.We again observe that execution time decreases with the increase in NoC sizes for all the applications with a few exceptions.SA and GA algorithms can map tasks as close as  possible to reduce the communication time, even though the possibility of increased communication distance between mapped tasks with the increase in NoC size.The performance of SA and GA algorithms differs for different applications.The SA algorithm performs better for consumer, networking, and autoindustry applications, while the GA algorithm performs better for telecom and office automation applications.For the consumer application, as shown in Fig. 5a, on average, the execution time from the SA algorithm is 10-50% (on average 38%) better than that of the GA.For the autoindustry application, as shown in Fig. 5b, the execution time from the SA algorithm is 19% (on average) better than GA, with one exception.The exception is that the GA algorithm performs significantly better than the SA algorithm for the autoindustry application in a 49-core NoC.For networking applications, as shown in Fig. 6a, on average, the execution time of the SA algorithm is 25-90% better than that of the GA.The performance of SA is higher (90%) than GA in larger NoCs.For telecom applications, as shown in Fig. 6b, the execution time of the SA algorithm is 4% worse than that of GA.For office automation applications, as shown in Fig. 7, on average, the execution time of the GA algorithm is 5% better than that of the SA.Along with better performance, the time complexities of the solutions from the SA are relatively lower (better) than those from the GA.The higher time complexities in the GA are mainly the result of the multiple stages (selection, crossover, and mutation) involved in solution generation within the GA.We also evaluated the performance of the proposed SA and GA at different temperatures and crossover rates, respectively, on a 6 × 6 NoC.The initial temperature of T = 10000 is chosen for the SA as it provides the best solution at this temperature.As shown in Fig. 8a, the solutions from the SA maximize at T = 10000.The telecom application is chosen here as it has the maximum number of tasks within E3S benchmarks.As shown in Fig. 8b, a crossover rate of 0.7 provides a good mapping solution.Although a crossover rate of 0.4 provides the best mapping, we select the crossover rate of 0.7 because the higher crossover rate provides more diversity in mapping populations, leading to better mapping and convergence.
Latency and throughput performance of the proposed MILP model and SA and GA algorithms were evaluated using a 64-core 2D-Mesh NoC and a 16-core 2D-Mesh NoC on the garnet/gem5 platform.Simulation results for E3S benchmarks on a 64-core 2D-Mesh NoC   suggest that the latency and throughput performance of SA and GA solutions are within 5-15% of the optimal/near-optimal solutions from the MILP solutions, as shown in Figs.9a  and 9b.This shows the suitability of SA and GA for high-performance mapping in NoC-based multi/many-core systems.Simulation results also suggest that the latency and throughput performances of the GA are only 2-6% better (on average) than that of the SA.The GA performs slightly better in auto-industry, office automation, and telecom applications compared to the SA.In the 16-core 2D-Mesh NoC, latency and throughput performance of MILP, SA, and GA are very similar, as shown in Figs.10a and 10b.This implies that performance does not change significantly in a small-scale NoC, as the models and algorithms can address the resource limitations in small-scale systems like a 16-core NoC.
The proposed SA and GA algorithms avoid local optima by accepting worse solutions and performing mutation/crossover, respectively, to reach a solution closer to the global optimum mapping solution.That's why the solutions from SA and GA are very close to the solution    from MILP.The major advantage of the proposed SA and GA for mapping is their ability to quickly produce a near-optimal mapping solution, making the proposed approaches suitable for real-time systems, unlike the solutions generated by LP solvers using the designed MILP model in Sec. 3.An LP solver explores the solution space completely to obtain the optimal solution, which could take a long time to generate solutions and may not produce solutions within a finite time, especially for large-scale NoC-based MCSoC systems.
The throughput performance of all the algorithms for all the applications increases with the increase in the injection rate of packets or flits in NoCs, as shown in Figs.11b, 12b, and 13b for 64-core NoC.As more packets are injected into the NoC, more packets are delivered to the destination.This shows the robustness of the mapping solutions.However, as many flits are injected into the NoC, they also suffer from more delay due to the resource limitations, such as the buffer at the router and link capacity, of the NoC.That's why latency increases  with the increase in the injection rate of packets or flits in NoCs, as shown in Figs.11a, 12a, and 13a for 64-core NoC.

Conclusion
The application mapping problem has been formulated using a mixed-integer linear programming (MILP) model that considers the computation and communication capacity, energy budget constraints of the NoC-based MCSoC, as well as the execution time requirements and data dependencies between tasks in applications.The MILP model aims to accelerate applications by reducing the computation and communication times of running tasks.Simulated annealing (SA) and genetic algorithm (GA) heuristics have been proposed to optimize the application-architecture mapping and accelerate applications, considering the capacity and budget constraints of the NoC-based MCSoC systems.Simulation results showed that both simulated annealing and genetic algorithms perform very close to the optimal solutions generated by the MILP model.Additionally, simulation results showed that simulated annealing generally outperforms the genetic algorithm in most cases.The energy consumption of simulated annealing and genetic algorithms is very close to the optimal solution from MILP on a 64-core NoC, where the energy differences are higher in a 16-core NoC.As the size of the NoC increases from 16 cores to 100 cores, the proposed MILP model and SA and GA algorithms accelerate the applications, as demonstrated in the throughput results.This shows the scalability property of the proposed MILP model and heuristics.

Fig. 3 :
Fig. 3: Total Energy Comparison for MILP, SA and GA on different NoC sizes under E3S benchmarks.

Fig. 4 :
Fig. 4: Execution Time Comparison for MILP, SA and GA on different NoC sizes under E3S benchmarks.

Fig. 5 :
Fig. 5: Execution time comparison at different NoC sizes for different applications under E3S benchmark.

Fig. 6 :
Fig. 6: Execution time comparison at different NoC sizes for different applications under E3S benchmark.

Fig. 7 :
Fig. 7: Execution time comparison at different NoC sizes for Office Automation application under E3S benchmark.
GA at different crossover rates.

Fig. 8 :
Fig. 8: Execution time of SA and GA for Telecom application for different parameter settings.

Fig. 11 :
Fig. 11: Latency and Throughput comparison for MILP for different injection rates under E3S benchmarks.

Fig. 12 :
Fig. 12: Latency and Throughput comparison for SA for different injection rates under E3S benchmarks.

Fig. 13 :
Fig. 13: Latency and Throughput comparison for GA for different injection rates under E3S benchmarks.
The model includes computation and communication capacity, energy budget constraints of MCSoC, and also incorporates the computation and communication demands of the applications.Mapping tasks of multiple applications is addressed in the MILP model.
• Simulated Annealing and Genetic Algorithms for Mapping: SA and GA heuristics are proposed to map tasks onto NoC-based MCSoC with the goal of reducing execution times of running tasks to accelerate applications.The energy and performance constraints of the proposed MILP model are integrated with SA and GA for mapping in energy and computation/communication capacity-constrained NoC-based MCSoC systems.• Evaluation on Real Benchmarks and Platform:

Table 1 :
gem5 and garnet configuration for 64-core