Network accelerator for parallel discrete event simulations

The simulation time of parallel and distributed discrete event simulation is the heart rate for as-fast-as execution schemes. Agreeing upon a global simulation time and distributing it to the simulation processes can be improved by exploiting or redesigning the network hardware. In this paper, we present such an approach that offloads simulation time calculations to network switches in order to speed up the steps where time advance requests are made and time advance grants are waited. By reducing the waiting time for time advancement, we aim to improve the overall performance of the parallel simulations. The measurements from the FPGA-based hardware setup and the results from our network simulations show that overall performance can be improved when time management calculations are offloaded to the network switches. Additionally, the transient message problem is also solved within the network by not allowing the time control messages to bypass the time-dependent events. The network acceleration of the region-based event distribution is also studied, and offloading the region matching tasks to the network switches is found to be feasible to reduce the costs of node-based calculations, especially for fast-moving regions. In this study, we consider High Level Architecture (HLA) for simulation infrastructure and fat tree topology for high-performance networking.


3
Network accelerator for parallel discrete event simulations

Introduction
The High Level Architecture (HLA) is an IEEE standard for parallel distributed simulation systems with published framework and rules [1] and federate interface specifications [2].Time Management (TM) and Data Distribution Management (DDM) services are defined under federate interface specifications.TM service is responsible for message ordering in the time domain.DDM service is responsible for implementing effective data exchange algorithms between federates.This includes matching, message filtering, and multicast group planning.TM and DDM services are the dominating factors for network traffic and overall performance of a simulation.
Initially designed for integrating different simulators together, the HLA found an alternative usage for parallelizing a single simulation.For instance, a new production line planning simulation to verify the required product quantities could be sped up by the HLA [3].On the integration side, co-simulating analog, digital, and RF (radio frequency) components of a System On Chip (SoC) IC (integrated circuit) are possible since the HLA framework has well-defined interfaces [4], and allows for the simple integration and testing of a variety of simulators and models.Additionally, hybrid co-simulation of MATLAB Simulink continuous time domain designs and discrete event designs is also possible [5].The simulations which are impossible to run in only one simulation environment can be easily integrated with HLA.Smart grid application for the cyber-physical energy system simulation is based on HLA to cosimulate the different simulation environments [6].Continuous time-domain application of power system and discrete event application of internet and communication technologies (ICT) co-simulation runs on a real-time environment together on HLA.An alternative would be to cosimulate the power system and ICT using the system on the loop (SITL).Although SITL provides real-time hardware-in-the-loop (HIL) tests, HLA supports different technologies, models, and algorithms to improve user co-simulation capabilities [7].Whether it is a parallel simulation or a cosimulation application, simulation processes interact with each other by exchanging events.Those events are usually time-stamped and time-ordered events for precise replication of real events of the simulated systems.Parallel and distributed discrete event simulations should keep track of simulation time for correct inter-process interactions.In this paper, we are proposing a hardware acceleration solution that improves some crucial TM and DDM services of the HLA.

Related work
HLA simulations latency could be improved by detecting the network load imbalance and fixing it [8] [9] [10] [11].The HLA sevices, such as TM and DDM can be improved by defining multiple federation servers and by grouping federates under those servers [12].Collective communication operations could also be used to implement the TM service's calculation of the greatest available logical time (GALT).This approach first calls the barrier collective communication operation to synchronize all nodes, and then the reduce-all collective communication operation to complete the GALT operation [13].Field programmable gate arrays (FPGAs) have been employed to enhance the operation of collective communication [14,15].In addition, the speed of the reduce-all operation has increased by almost ten times, cutting the calculation durations for Wilson fermions and the Schrödinger function by 40%.It is also possible to implement collective communication operation functions on ASIC (Application-Specific Integrated Circuit) like Mellanox SwitchIB-2 Ethernet switch.Operations involving the barrier and reduce-all collective communication are accelerated by 7.1 and tenfold, respectively.On a fat tree network, tests are run with 5120 nodes with a 32 KB message size [16,17].
Distributed interactive simulation (DIS) uses a broadcast collective communication algorithm to distribute data.But broadcast is not appropriate for large-scale distributed simulations.HLA uses DDM service instead of broadcast to utilize network resources more efficiently.DDM service has mainly five different approaches to match and filter the publish and subscribe requests.These are class-based, regionbased, grid-based, hybrid, and sort-based approaches [18].The class-based approach defines object groups as a class.Federate could subscribe to object class attribute values.In that case, federate will receive all the object's subscribed attribute values of that class.The parallel implementation method could improve the latency for the class-based approach by running on shared-memory multiprocessor machines [19].Region-based approach mandates the region definitions on routing space (RS) for subscribing and publishing federates.If subscribing federate region overlaps with the publishing federate region, then a match is found, and subscribing federate receives attribute updates from the object associated with the publishing federate.The region-based is a brute-force approach due to intensive computation.
The grid-based approach divides RS into a grid of cells.Subscribing and publishing federate regions are mapped to the cells.If subscribing and publishing federates share at least one cell, then a match is found.Grid-based approach implementation methods are named as constant, dynamic, and optimized dynamic approaches.When the constant grid-based approach is used, cell size is defined at the beginning of the simulation.Cell size is updated depending on current publish and subscribe regions when the dynamic method is used.Also, a multicast group belonging to a cell is created after a match is found.Hence, irrelevant message count is decreased.If publishing or subscribing federates know about the updates, then they could also filter the irrelevant messages with an optimized dynamic method [20].Optimum cell size selection has also a huge performance effect on the grid-based approach and, it has been a topic of interest in the literature [21,22].The optimum cell size could also be a dynamic value and could be updated during simulation [23].
The region-based and grid-based approach is used together in a hybrid approach.The hybrid approach aims to decrease the matching cost of the region-based approach and list update with the message cost of the grid-based approach.The hybrid approach first uses the constant grid-based approach to find coarse overlaps.After that, it uses the region-based approach to find fine overlaps [24].

3
Network accelerator for parallel discrete event simulations The sort-based approach stores regions on the list and searches the list to find the region overlaps.Region overlaps are also stored on a matrix [25].Sort-based approach suppresses many useless checks by sorting which could be optimized in many ways during optimization [26].Raczy's sorting-based algorithm has a low latency for a higher ratio of maximum range size to an upper bound of each dimension for any region on any dimension.But lower ratio sorting-based algorithm developed for dynamic large spatial environments has a better latency [27].There are also other less common approaches like agent-based approach which sends mobile agents to the publishers to follow and inform the subscriber of updates.Whenever the publishers update the related object attribute agents catch the message and filter it.As a result, subscriber receives only relevant messages [28].Due to the high computation cost of matching or list update algorithms offloading those processes to another system that has high processing capacity is also preferred.The computation cost of region-based and grid-based approaches could be improved by offloading those processes to CUDA (Compute Unified Device Architecture)-based GPUs (Graphics Processing Units) [29] or network processors [30].
The performance of TM and DDM can also be enhanced simultaneously by creating an algorithm that allows for contention-free communication over the Torus network [31].By altering communication patterns and reducing communication step size, communication cost is reduced.
Similar to HLA's TM service GALT algorithm, the global virtual time (GVT) algorithm computes the global minimum time.For asynchronous simulations, after receiving a time advance request (TAR) message, a transient message is revealed if the received message time is less than the recently advanced current simulation time.Simulation time is then rolled back in that situation.After the message has been processed, the simulation time is once again advanced.GVT algorithm performance is improved for asynchronous communication [32].
Although the torus and mesh topologies offer the advantages of scalability without routing congestion and low communication cost [33], message packets could not be delivered within a few communication steps as the fat tree topology does.A threetiered fat tree architecture with a 48-port Ethernet switch allows 27,648 nodes to be connected to the network [34].The network may be scaled up simply by expanding the Ethernet switch's number of ports.
In Sect.3, GALT and data distribution algorithms are explained.The specifics of the Ethernet switch designed, called the Intelligent Switch (IS), are also given.The hardware block added to the conventional switch (CS) to make it intelligent is also detailed.Section 4 presents the computation and communication (CC) cost functions for IS and CS in the context of HLA's TM and DDM services.A Parallel Discrete Event Simulator (PDES) is also used to verify the calculated CC cost, and the performance of the IS is benchmarked against that of a CS in different test conditions.In Sect.5, the conclusions are presented.

Offloading HLA's TM and DDM services to network switches
In this study, we modify the ethernet switches to use the fat tree-based distributed algorithms for managing simulation time and distribution of the simulation.Each switch, depending on its location in the fat tree topology, plays a role defined by the distributed algorithms presented in this section.The first algorithm is the basis for managing the global simulation time.In such synchronous simulations, events with time stamps smaller than the global simulation time cannot be generated in order not to violate the time ordering of the simulation events.The GALT is the term used for the globally agreed limit for time stamps of the new events generated by the participating simulation processes (i.e., federates in HLA).The HLA uses publish-subscribe region information to distribute events to the subscribers that wait for those events.In the second algorithm, region-based event distribution is implemented on the ethernet switches used for building the fat tree topology.
The GALT calculation algorithm is initiated whenever a time-regulating federate requests time advancement by providing a new logical time plus a lookahead value.The logical time is federate's current point on the High Level Architecture (HLA) time axis.Before granting this new logical time, it is compared against the ones provided by the other federates, and then, if it is found to be acceptable the new logical time is granted to the requesting federate.A GALT calculation example is presented in Fig. 1, where the GALT value is equal to the minimum of T i + L i (0 < i < 7) values.In this example, only the third and the fifth federates get time advance grants, the others should wait until the GALT value equals to their requests.
Fat tree topology is an enhanced tree topology that provides rich connectivity at the root nodes.The switches used at the root are the core switches, the Fig. 1 GALT calculation example 1 3 Network accelerator for parallel discrete event simulations intermediate nodes are the aggregate switches, and finally, leaf nodes are the edge switches (Fig. 2).The existence of multiple core switches provides alternative paths for distributing the network traffic load (Fig. 3).

GALT and region-based event distribution algorithms
Core switches act as decision points in our distributed GALT calculation algorithm (Algorithm 1).The aggregate and the edge switches basically calculate partial GALT values for their subtrees, and if there is a change in a partial GALT value, this is sent to every uplink to cover every possible network path upwards.This approach will ensure that no transient event message exists since the event messages are followed by the messages containing partial GALT values (Fig. 4a).In the opposite direction, the new GALT values will be arriving to both the aggregate and the edge switches from the uplinks.Before passing this newly calculated GALT value to the downlinks, each switch waits for the same value to be received from all its uplinks (Fig. 4b).This specific condition is set to handle multiple ongoing GALT calculations, and also ensures that event messages using alternative network paths cannot overtake the GALT messages which they are supposed to follow.
The GALT calculation and the distribution of simulation events are implemented in coordination with each other to eliminate the transient message (event) problem.Since each ethernet switch preserves the incoming order of the network packets when the GALT calculation result reaches a simulation process (federate), it is guaranteed that there is no transient simulation event to be delivered.The region-based data distribution algorithm also relies on this mechanism but requires more for handling publish and subscribe regions.The subscriber declares its interest to receive some simulation events by sending a subscribe object class attributes (SOCA) message to the run-time infrastructure (RTI), and additionally, provides a subscription region information along with the SOCA message.This additional region information is used for filtering out the events which are not relevant to the subscriber's interests.The expected benefit of this HLA service is to eliminate distribution of events to the federates that are irrelevant, hence there will be no need to discard those events at the destination federates.
The conventional approach is to calculate the region matchings at the publisher side locally.After the preparation of multicast vectors, the events can be multicasted to the subscribing federates.The frequent changes of the publication/ Network accelerator for parallel discrete event simulations subscription regions make those computations too frequent and costly.A communication cost is also involved in for broadcasting the subscription region changes to the publishers, and especially, the frequent changes increase the number of the costly broadcast messages.By offloading this task to the network switches, we completely eliminate the communication costs for informing the nodes and the computational costs for recalculating region intersections.The subscription messages are broadcasted by covering every possible path to leave traces of the subscriptions (Fig. 5a).The event distributions are achieved by matching the subscription and the publish regions on the switches.Each link of such switches is attached with a list of class subscriptions for instant matching with published events (The black dots in Fig. 5b).

3
Network accelerator for parallel discrete event simulations

Intelligent ethernet switch
As a proof-of-concept, the FPGA implementation of the ethernet switch, called IS, is applied on a Xilinx KC705 board.The additional latencies due to GALT implementation and data distribution functionalities on the switch are measured and used in cost functions to predict performance gains of simulation applications.The test pattern generator was implemented in a Xilinx ZC702 board.These boards are connected to each other via a special cable which preserves signal integrity for 16 Serial Gigabit Media-Independent Interface's (SGMII) differential pairs.The test setup is shown in Fig. 6.
The block design of the Intelligent Switch is given in Fig. 7. First, Xilinx "SGMII" intellectual property (IP) converts incoming Ethernet packets to MII (Media-Independent Interface) format at the input of the inbound logic.Then Xilinx "Tri-Mode MAC" IP converts the incoming packet to Advanced eXtensible Interface Stream (AXIS) format in the RX MII block.The frame errors and overflows of the incoming packets are also checked in this block.The next block in the sequence will parse the MAC (media access controller) address, Time Management (TM) and Data Distribution Management (DDM) services-related data, and then those data are sent to the following block to construct the control data.As the output of the inbound logic, raw data is written to the Intelligent Switch memory, and the control data is sent to the control and TM/ DDM Accelerator block in the forwarding core.TM/DDM Accelerator block is responsible for maintaining the GALT values.Whenever a GALT value is changed by any port, a GALT calculation is initiated by the TM/DDM Accelerator.If the GALT value changes, TM/DDM Accelerator informs the control block with control data, and the new GALT value is stored in the memory of the forwarding core.TM/DDM Accelerator also gets control words for subscribing or publishing messages.For the subscriber message case, the subscriber's class/interaction id, attribute vector, and region coordinates are written to the forwarding core memory, and the control block is informed to broadcast the message.For the publish message case, a packet is written to the Fig. 6 Test system with KC705 and ZC702 forwarding core memory, and the result of the matching algorithm is also sent to the control block.Then, the control block either sends the packet to the matching port(s) or discards the packet.For each outgoing link, the forwarding core sends control and data to the outbound logic.The control data is extracted, and the packet is rebuilt using MAC address, TM, and DDM service information.AXIS format is converted to the MII format and then SGMII format at the output of the outbound logic.This process applies to all switches in the fat tree network.
Intelligent Switch adds 25 clock cycle latency to the incoming packet for the GALT algorithm as a TM service.This value was measured on the KC705 board, and also verified at Hardware Description Language (HDL) simulator.The latency for publish message (DDM service) latency was measured as 128 cycles on average by the HDL simulator.This latency value varies around this value due to emulated AXIS interface traffic.All those latency values are fed as a parameter to the performance analysis, and helps us to estimate the components of the cost functions.Those values were also used in a discrete event simulator for exploring the behavior of the GALT and DDM algorithms running on the switches of the fat tree network.The test simulation scenarios are also used as case studies for creating various network traffics.

Analysis and simulations
The intelligent switches of a fat tree network collaborate with each other to achieve the GALT calculation and the distribution of region-based events.During this collaboration, each intelligent switch devotes small time intervals for the pieces of such Network accelerator for parallel discrete event simulations distributed computations that do not exist in the conventional switches.This makes an individual intelligent switch slower than an individual conventional switch.When the overall fat tree network is considered, the intelligent switches reduce substantial amount of network packages and eliminate the computations on the computational nodes that have higher latencies due to task scheduling overheads.The cost functions defined in the next sect.(4.1) are used for time modelling for both GALT calculation and region-based event distribution in quiet network conditions.In sect.4.2 and 4.3, a problem specific fat tree simulator and simulation results of the fat tree in different network traffic conditions are presented.

Modeling the cost of the distributed network calculations
The GALT calculation algorithm is basically a distributed reduction algorithm that finds the minimum value of the values provided by the fat tree leaf nodes (computational modes).Each switch joins the distributed algorithm to find the minimum value of its subtree, when a core switch finds a new minimum this value is broadcasted to every leaf node.For a three-layered fat tree, this means three links upwards and three links downwards travel of the minimum values, and during this travel, exactly five switches are visited.The cost function (Eq.1.1) presented below models this behavior.The last two components of the cost function are the communication time spent on the links (six links) and the total computation time spent on the visited switches (five switches), respectively.1 component is the message preparation time on the computational nodes, and 3 is the processing time on the receiving computational node.The message length ( L ) and the link bandwidth ( 1∕ ) are the other parameters for calculating the communication cost, and n cp_galt is the average computation cost on each switch.
The cost functions of two other node-based GALT calculation algorithms are also presented to justify the benefits of our switch-based approach.The first one is a broadcast-based approach that each node sends its value to the others (Eq.1.2).This algorithm (BTAR, Broadcast-based Time Advance Request) can be considered for small configurations of network since each node will be receiving messages from the others through a single link that connect the node to the switch.The larger the configuration the higher the delay to receive so many messages will make this approach obsolete.The dominating factor in the cost function of this approach is the communication part that increases linearly with the number of computational nodes that is N = k 3 ∕4 where k is the number of switch links used for constructing the fat tree.Upon receiving all messages, each node spends a constant 2 time to find the GALT value.
The second node-based GALT calculation algorithm is a tree-based algorithm that exploits the fat tree topology by grouping the computational nodes according (1.1) to the switches they are connected.At each switch group, a node receives the values from the others in the same group.By using this approach, a single node holds the global minimum value after three stages, and then, this value is broadcasted to all others.This node-based GALT calculation algorithm is named after fat tree as FTAR (Fat tree-based Time Advance Request), and offers better communication cost than BTAR algorithm does for larger network configurations (Eq.1.3).
The Intelligent and Conventional switches behave similarly when the regionbased event distribution is considered.The intelligent switch uses region information to match with subscription regions at each switch while the conventional switch uses pre-calculated multicast vector to decide how to replicate and send to the selected links of the switch.Both distribution methods are basically all-to-all multicast operations, and for the worst case, the distribution methods become all-to-all broadcast.The dominating cost factor for the worst case is the transmissions of messages from the edge switches to the computational nodes (Eq.1.4).
The intelligent switch dynamically maintains the list of subscription regions and compares them with publication regions of the incoming event messages.On the other hand, the conventional switch processes the multicast vector distributed along with event messages.Those multicast vectors are prepared at the publishers' computation nodes using the declaration messages broadcasted from the subscribers' computational nodes.The intelligent switch eliminates such broadcasts and vector calculations on the publishers' end.

Parallel discrete event simulator
A problem specific parallel discrete event simulator is designed to explore the behavior of both the intelligent and conventional switches [35].The switch-based GALT calculation is an embedded part of this simulator while node-based GALT calculations are implemented on top of this simulator as if an application is running on the hardware.The logical processes of this simulator are mapped directly onto MPI processes (Message Passing Interface), and the simulation events are distributed through peer-to-per MPI communication primitives.Each logical process is responsible for calculating its simulation time by using time-related parameters measured from the FPGA-based experimental switch prototype.Basically, this purpose-built simulator helped us to observe the behavior of each GALT calculation method, and to experiment with several different network traffic scenarios.

The comparison of the cost functions and simulation results
The switch-based (TAR), broadcast-based (BTAR) and fat tree-based (FTAR) GALT calculation methods are compared against each other for three Network accelerator for parallel discrete event simulations different network sizes.Both the simulation and the calculated results are presented together for double verification, and the deviation between the simulated and the calculated values is not significant (Fig. 8).The switch-based and broadcast-based methods are similar to each other for small network sizes (N = 16 computing nodes), but the broadcast-based method gets expensive for larger network sizes and it does not scale well for very large networks.Both the switch-based and the fat tree-based methods scale well while the network sizes get larger.The switch-based method is almost three times faster than the fat tree-based methods for all sizes presented in Fig. 8.
The Intelligent switch can complete the region-based event distribution slightly faster than the node-based method using conventional switches due to the high initial costs for calculating the region matchings before submitting events to the network in the conventional switch (Fig. 9).This computational overhead arises every time the publication regions change, otherwise both methods behave similarly.The intelligent switch can match the publish and subscribe regions on-theflight efficiently by employing parallel comparators.On the other hand, the nodebased method can use the conventional switches only for multicasting the events.The multicast operation requires a precomputed multicast vector at the source of the multicast.For the calculation of such vectors, the subscribers should inform the publishers by broadcasting the subscription regions.The overhead associated with such broadcasts are not shown in Fig. 9. Frequent changes of subscription regions will trigger such broadcasts, and eventually the node-based method will perform far worse than switch-based method employing the intelligent switch.

Test case scenarios
In this study, we take two scenarios into consideration for evaluating intelligent switch performance against the conventional one.The first scenario we consider is an adaptation from another study that simulate air traffic control [28].The second scenario is a subproblem of a well-known n-body problem called molecular dynamics that is used for simulating molecules interacting in a three-dimensional space.The implementations of both problems have periods start with calculations that generate events, continue with simulation time advancement requests, and eventually, upon deciding on a global simulation time, ends with event consumption step.The global simulation time can be decided upon after calculating the GALT value, and each simulation process (federate) first consumes all time dependent events, less or equal to the requested simulation time advancement, and then the simulation time is set to the requested value.Time advance requests trigger GALT calculation and as a result of this calculation the simulation processes synchronize with each other without violating the simulation time.The scenarios in this section are used for generating message traffic for both distributing events and GALT calculation.

Air traffic control scenario
In this scenario, there are airports and the planes flying from a source to a destination by following a straight route [28].In the scenario, there are two airports that monitors the air traffic through radars with some defined ranges.When a plane is within the range of a radar, the airport can track that plane.To simulate Fig. 9 The matching algorithm comparison Network accelerator for parallel discrete event simulations the airports, the airport federates subscribe the plane coordinate changes to receive the new coordinates at each simulation step.The radar at the airport can be modeled as the area of interest, which is a circular area whose center is the airport and whose radius is the range of the radar.Since the rectangular shapes can be defined as publish-subscribe regions in HLA, a bounding box containing the radar range can used as subscription region for receiving coordinate change events of the planes (Fig. 10).Upon receiving such an event, the airport federate can make further calculations to decide whether the plane is really in range or not.If it is not, then the received event is irrelevant and should be discarded.Similar to the airport federate, the federates simulating the planes use small bounding box for each plane as the publish regions do at a given simulation time dynamically.Every time the new coordinate of the plane is to be published, the event containing this information is published with a region that places the plane at its center.The runtime infrastructure of HLA uses those publish and subscribe region intersections for distributing coordinate change events to the airport federates.
The experiments employing air traffic control scenario were conducted to compare the intelligent switch against the conventional switch.We tested three different sizes of fat tree network with three different message sizes (Table 1).When the message sizes and the network get larger the intelligent switch performs better than the conventional switch.

Molecular dynamics scenario
Molecular dynamics is selected as a representative scenario since it is widely used for simulating molecules interacting in a three-dimensional space.In this scenario, we are investigating its behavior to evaluate the intelligent switch running simulation specific tasks.The three-dimensional simulation space of simulated molecules is mapped onto a two-dimensional space, and each cell in three-dimensional space is assigned to a simulation federate that simulates the movement of the molecules within the assigned space (cell).Since this is a parallel and distributed simulation each federate should exchange molecules moving from one cell to a neighboring one, and also, at the end of each simulation time step the federates should request simulation time advancement.At this point, we synthetically experimented various exchange loads before advancing the simulation time before the end of each simulation step.The experiments were conducted for three different network sizes as it is presented in Table 2.The performance of the simulation is better when the intelligent switch advances time (GALT calculation) than time advance calculations are made on the computational nodes that are connected together with the conventional switches.The significant performance improvements are observed especially when the event publication traffic is low.This shows that fine grained simulation steps can be better executed when the time calculations are offloaded onto the intelligent switches.We can also observe that for larger network sizes the intelligent switch approach scales well.Network accelerator for parallel discrete event simulations

Conclusions
In this research, we developed a method for speeding up parallel and distributed simulations by offloading some simulation infrastructure components to the network switches that are used for constructing fat tree network.The High Level Architecture (HLA) is taken into consideration as the simulation infrastructure, and the test scenarios are built upon this infrastructure.We focused on two particular services that can be improved to increase the performance of simulations.The first service is the main part of the Time Management (TM) services, and it calculates the Greatest Available Logic Time (GALT) from the Time Advance Requests (TAR).Since the events cannot be consumed by the subscriber federate until the GALT calculation is finished, the federates will wait in an idle state.Offloading the GALT calculation to the network switches reduces these idle waits, and increases the performance of parallel and distributed simulations.The case studies showed that significant performance improvements can be achieved by this way.The second service is from the Data Distribution Management (DDM) services, and it aims to reduce the distribution of irrelevant simulation events to the subscriber federates.This service requires both the subscriber and the publisher federates to provide a region (region of interest) in a multidimensional space, and the Runtime Infrastructure of HLA distributes the events to the subscribers whenever there is an intersection between publish and subscribe regions.By offloading the computation for matching the intersections to the network switches, the distribution of events can be speeded up.We observed notable improvements on the overall performance when we offload such computations to the intelligent switches since the computations made on the nodes connected with each other with conventional switches caused delays before submitting event to the network.
As a side effect of offloading GALT calculation and region-based event distribution to the network switches, it is guaranteed that there is no transient message when the time advance request is granted.This can be considered as an alternative approach to solve the transient message problem in parallel and distributed simulation.

Fig. 4
Fig. 4 Switch logic for GALT algorithm

Fig. 5
Fig. 5 Switch logic for the matching algorithms

Fig. 10
Fig. 10 Air traffic control: plane routes and the radar ranges