GPU performance analysis for viscoacoustic wave equations using fast stencil computation from the symbolic specification

Seismic forward modeling is a computationally and data-intensive stage in the seismic processing workflow. By profiling the kernels of seismic forward modeling algorithms, it was observed that they need to access a wide variety of memory locations, in addition to the computational cost of performing floating-point operations for the numerical solution of wave equations. In this context, the Roofline model was used to analyze six representative computing kernels in seismic modeling on GPU environment to indicate bottlenecks in the performance and suggest improvements of these wave equation propagators. Based on this, six viscoacoustic equations were implemented using the Devito tool. Experimental data have shown that optimizations in increasing data reuse and decreasing off-chip memory traffic can significantly improve performance.


Introduction
Stencil computation is the essence of several applications in the high-performance computing area, such as computational fluid dynamics, image processing, and geophysical applications that involve solving wave equations. According to [1], stencil computations are often used in applications to solve partial differential equations in multidimensional grids. A typical 3D grid is iteratively traversed from a starting cell in stencil computation. Each grid cell is updated based on a set of coefficients and the values of its neighboring cells [2]. Regular and predictable memory access patterns characterize stencil computation; however, it has a low FLOP/byte ratio, precisely tuning the data layout and optimizing the memory accesses according to the different hardware architectures [3]. In the context of a wave equation solver, these are finite-difference (FD) schema time-marching kernels that update a data set in a grid at iteration n to get that of the next iteration.
The acoustic wave equation is the most common for seismic modeling. However, several other wave equations which better represent the subsurface can be used for seismic modeling. Viscoacoustic modeling provides an accurate image with the high resolution of exploratory targets due to the inclusion of characteristics such as amplitude attenuation and velocity dispersion in the propagation of seismic waves. In this work, the viscoacoustic equations based on the Maxwell, Kelvin-Voigt, and standard linear solid (SLS) rheological models are described, considering first-and second-order formulations. These equations are in the time-space domain, and the finite-difference method is suitable to solve them.
Our implemented FD kernel viscoacoustic equations were introduced into Devito [ [4][5][6], an open-source domain-specific language (DSL) tool for highly optimized FD kernel solutions. Devito is available as a Python language package and can be integrated with different packages, such as SymPy [7] and Numpy [8].
The methodology used includes using the Roofline model [9][10][11], which proposes to identify performance bottlenecks and be a guide to performance optimization. According to [11], the Roofline is a performance-oriented model centered around computational capabilities, memory bandwidth, and data locality. The data locality is commonly expressed as the arithmetic intensity (AI), the ratio of floating-point operations performed per data moved (FLOP/Byte).
The present work describes developing and implementing six viscoacoustic equations on the Devito tool. The primary purpose is to provide information about the bottlenecks and optimization strategies indicated by the computational analysis of the kernels of viscoacoustic equations. The computational aspects of viscoacoustic seismic modeling will be evaluated through the Roofline model. This paper proceeds as follows: Sect. 2 brings information on mathematical model to viscoacoustic wave equations. Section 3 details the implementations of our model over the parallel architectures. Section 4 presents our case studies and experimental results. Some conclusions and future work ideas close the paper.  The viscoacoustic equations based on the rheological models originate from the  stress-strain relation   where is the stress, is the deformation, and is the relaxation function. Moreover, and where v is the particle velocity and is the density. Equations (1), (2), and (3) are the main relations to derive the viscoacoustic equations based on Maxwell, Kelvin-Voigt, and SLS rheological models [12].

Equations based on the Maxwell model
The relaxation function for the Maxwell model is defined as where M U is the elasticity constant of the unrelaxed spring, H(t) is the Heaviside function and = ∕M U = 0 Q is the relaxation time, being the viscosity and Q the quality factor. Replacing (4) in (1), using (2) and (3), considering M U ≈ k as the dominant frequency, and adding the source term S, derived the first-order equation where is the bulk modulus, p = − is the pressure field, and 0 = 2 f 0 is the angular frequency, being f 0 the dominant frequency.
Taking the time derivative in (5), and replacing the second term in the first term, derived the second-order equation

Equations based on the Kelvin-Voigt model
The relaxation function for the Kelvin-Voigt model is defined as where M R is the elasticity constant of the relaxed spring, is the viscosity, and H(t) and (t) are the Heaviside and Dirac delta functions, respectively. Replacing (7) in (1), taking the time derivative, and using (2), we obtain Thus, considering = −p , the motion equation in (3), and adding the source term S, derived the first-order equation where , , are the bulk modulus, viscosity, and relaxation time, respectively. Taking the time derivative in (9), and replacing the second term in the first term, derived the second-order equation

Equations based on the SLS model
The relaxation function for the SLS model is defined as By taking the time derivative in (1), and replacing (11) in (1), is obtained the relationship between the pressure field and particle velocity (3) [13,14]: where = (x) the bulk modulus, p = − is the pressure field, H(t) is the Heaviside function, and S = S(x s , t) the source term at position x s . The = ∕ − 1 represents the magnitude of Q. The and are stress and strain relaxation parameters given by: Equation (12) is expensive to solve by numerical modeling because of the associated convolution operation. So, it is necessary the introduction of a memory variable r p [15], allowing us to derive the first-order equation For the second-order formulation, the derivative of time was taken in (12) and r p was introduced to remove the convolution term [16], resulting in

Methodology
This section presents the methodology with detailed explanation and useful insights.

Hardware and software setup
In the computational system was used the operational system Red Hat Enterprise Linux Server version 7.9 (Maipo), and all experiments performed in this work were run on a environment offering GPU node, where this node contains two 18-core Intel(R) Xeon(R) Gold 6240 CPU @2.60 GHz, 376 GB RAM, and 1 NVIDIA TESLA V100 SXM2 [17]. This architecture provides 80 streaming multiprocessors with 32 GB HBM2 memory. For the Roofline analysis, the nvprof and nsight systems tools [18,19] collect and visualize the data obtained from the execution of kernel. The tools collect a timeline of application-related activities on the CPU and GPU, including kernel execution and memory transfers.

Experimental setup
As input parameters, the three layers seismic model Fig. 1a, with grid spacing in x, y, and z, and border size of equal to 8 and 50 ms (m), respectively. We performed the experiment considering two different grid point models: the first with 100 3 points and the second with 500 3 points. The choice of the model is for computational purposes only. In this work, we do not intend to evaluate the geophysical side. We generated a seismic shot Fig. 1b with 4 s of record time.
In this experiment, the source is located in the center position of the model surface. The receivers are spread along a plane (x, y). Such receivers capture energy from various points, which improves the illumination of the target explored. In this context, the seismogram of a shot with three-dimensional modeling provides details by taking planes along the x or y axes and analyzing them separately.

Roofline implementation
For the implementation of the Roofline model, the performance analysis tool provides the data to feed the graph with information on AI and GFLOPs. First, it is necessary to perform the machine characterization to define the peak performance (GFLOP/s) and the bandwidth (GB/s), both attributes of the roofline graph. Based on that, it was used the Empirical Roofline Toolkit (ERT) [20], whose function is to characterize GPU-accelerated systems. According to the methodology adopted in [11], to calculate the AI of the kernels, as well as the performance in GFLOP/s, it is necessary to use the performance analysis tool to obtain three basic information: the kernel execution time, the number of floating-point operations performed and the number of bytes read and write at each memory level. From this: The Roofline plots that will be shown in the graphs are: the p kernel, p:r kernel, vx:vy:vz kernel and p sub-kernel. The experiment has as main objective to measure (16) AI = GFLOPs Bytes , GFLOP/s = GFLOPs Runtime .

Viscoacoustic wave equations implementation on Devito
This section presents the computational aspects of viscoacoustic seismic modeling and provides information about the bottlenecks and optimization strategies indicated by the computational analysis of the kernels of viscoacoustic equations.

Target application: seismic forward modeling workflow
Based on the seismic forward modeling application Fig. 2, it was had the seismic model definition and the acquisition geometry, which involves the source type and source peak frequency, as well as the source/receivers locations. Then the FD Kernel represents the function with the highest computational cost of the entire application. In the case of this application, for each viscoacoustic wave equation presented in Sect. 2 an FD Kernel has been implemented.

Case study: code generation of the first-order SLS viscoacoustic equation on Devito
In Algorithm 3, it is possible to visualize only a part of the automatically generated code, referring to the update of the particle velocity components of the first-order SLS viscoacoustic equation.

Optimizations
For the GPU flow, Devito presents two optimization levels (noop, advanced) for code generation. With noop, no performance optimizations are introduced. The advanced directive provides several flop-reducing and data locality optimization passes.
Cross-iteration redundancy elimination (CIRE) searches for redundancies across consecutive loop iterations. These are often induced by a mixture of nested, high-order, partial derivatives. The advanced mode has a code motion pass. In explicit PDE solvers, this is most commonly used to lift expensive time-invariant sub-expressions out of the inner loops. The advanced directive is used by default in Devito. However, the generated code still has optimization gaps and this work aimed to contribute to this aspect.
From Algorithm 3 analysis and the AI and performance in GFLOP/s information obtained, it is assumed that the FD Kernel presents a computationally-intensive process with a large number of memory accesses and that demands a high memory bandwidth. In this context, the FD Kernel operates in a DRAM bandwidth limit region. Thus, strategies that optimize memory access are analyzed to take advantage of all memory bandwidth and eliminate redundant memory accesses.
There is a feature available in the standard OpenACC directives [21], which is the addition of the tile clause to the acc loop directive. With the tile clause, it is possible to optimize the loop by operating smaller blocks to explore locality and data reuse [22]. Using the tile clause in Devito is optional and is disabled by default.
Tiling can improve parallel stencil applications in a lot of ways. A possible optimization, tiling partitions loop data and computations into tiles, thereby enabling the GPU to handle amounts of input data that exceed the capacity of its internal memory. Another possibility, tiling reorders loop nesting of the stencil, which can improve spatial and temporal locality within the tiles. And the most outstanding optimization would be, the partitioning combined with loop reordering potentially reduces the volume of communication between host memory and GPU, which reduces the memory bandwidth requirements for the application.

Experimental results
This section presents some results obtained with the usage of our proposed. Experimental results are shown and commented, with detailed explanation and useful insights.

First-order viscoacoustic equations
For each first-order viscoacoustic equation, we performed application profiling. Figure 3 represents an approximate average of the execution time of the application functions using the first-order viscoacoustic equations.
In the graphics of Figs. 4, 5, and 6 was noticed similarity, and the Roofline plots are located in a region where GFLOP/s performance is limited by memory bandwidth. From the tests performed, it was noted in Figs. 4b, 5b and 6b that the increase in the domain size results in the approximation between the Roofline plots and the memory bandwidth limit.
It is also important to note a better performance using the tile clause represented by OPTIMIZED KERNEL. It was understood that the optimization decreases the data movement between DRAM memory and device, which resulted in increased performance in GFLOP/s, Figs. 4b, 5b, and 6b.

Second-order viscoacoustic equation
For each second-order viscoacoustic equation, we performed application profiling. Figure 7 represents an approximate average of the execution time of the application functions using the second-order viscoacoustic equations.   In the graphics of Figs. 8, 9, and 10, it was noted that the Roofline plots are located in a region where GFLOP/s performance is limited by memory bandwidth. It is important to note that the performance increase with the use of the clause tile represented by the OPTIMIZED KERNEL was more significant for the secondorder viscoacoustic equations compared to first-order viscoacoustic equations.   Roofline plots for a 500×500×500 model Figure 11 shows the seismic forward modeling execution time results, using each of the viscoacoustic equations. The input parameters are the same previously used. A considerable time step was arbitrarily used to have a long propagation time for the long wave to increase the accuracy of the Roofline experiment.
With the data of Fig. 12 was observed that this optimization was due to the increase in global L1 memory read and store transactions in performance analysis tool, while there was a decrease in transactions of reading and storing DRAM memory in performance analysis tool. It is important to emphasize that L1 memory read and store transactions are faster when compared to DRAM memory.
Another approach is to analyze the cache hit ratio. The performance analysis tool provides information about the cache hit rate using the following metrics: • global_hit_rate: Hit rate for global load and store in unified L1 cache. • local_hit_rate: Hit rate for local loads and stores. • tex_cache_hit_rate: Unified cache hit rate. • l2_tex_hit_rate: Hit rate at L2 cache for all requests from texture cache.    In Fig. 13, it is possible to notice that the kernels included in the OPTIMIZED KERNEL present higher values for global_hit_rate and tex_cache_hit_rate. In addition, there was a reduction regarding l2_tex_hit_rate.

Conclusions
From the application profiling, it was observed that the Kelvin-Voigt viscoacoustic equation presented greater complexity due to a more significant amount of floatingpoint operations performed. In addition, the Kelvin-Voigt viscoacoustic equation showed a more substantial amount of memory read and storage operations. A more significant performance gain was noticed for the Kelvin-Voigt viscoacoustic equation, when compared to other viscoacoustic equations. This fact is explained based on the use of the tile clause, which provides an optimization aimed at reading and storing transactions from memory.
It was analyzed that the maximum GFLOP/s performance values obtained for the viscoacoustic wave equations FD kernels were approximately 1255 GFLOP/s. This fact shows that it is possible to increase the GFLOP/s performance since the maximum value achieved is far from the theoretical peak available in the GPU. For that, it is necessary to find more optimization alternatives. Based on the results obtained, it was understood that optimizations in decreasing data movement in memory are essential to increase performance.
Some related works [1,23] and [24] indicate strategies for optimizations that replace replicated global memory accesses with local memory accesses that substantially reduce the stencil computation execution time for any grid size. Other related works, such as [25] which aim at optimizations from the proper configuration of OpenACC directives for data management and [26] which investigates multi-GPU performance. Devito also supports distributed-memory parallelism via MPI, will also be prospected for future work the Roofline analysis of the viscoacoustic equations FD kernels execution in multi-GPU environments.