An efficient communication strategy for massively parallel computation in CFD

With the development of high-performance computers, it is necessary to develop efficient parallel algorithms in the field of computational fluid dynamics (CFD). In this study, a novel parallel communication strategy based on asynchronous and packaged communication is proposed. The strategy implements an aggregated communication process, which requires only one communication in each iteration step, significantly reducing the number of communications. The correctness and convergence of the novel strategy are demonstrated from both theoretical and experimental perspectives. And based on the real vehicle CHN-T model with 140 million meshes, a detailed performance comparison and analysis is performed for the novel strategy and the traditional strategy, showing that the novel strategy has significant advantages in terms of scalability. Finally, the strong scalability and weak scalability tests are carried out separately for the CHN-T model. The strong scaling efficiency can reach 74% with 10.5 billion meshes and 256,000 cores. The weak scaling parallel efficiency can reach 90% with 10 billion meshes and 179,000 cores. This research work has laid an important foundation for the development of the fast design of aircraft and cutting-edge numerical methods.


Introduction
Computational fluid dynamics (CFD) is a discipline that uses numerical methods and computers to carry out the study of flow problems. Compared to traditional wind tunnel experiments, CFD techniques have numerous advantages such as low cost and fast prediction. Today, CFD plays a role in aerodynamic development in the aircraft industry, combustion studies, turbomachinery simulations, and various other applications [1,2].
CFD has been a great success, mainly due to the rapid development of computers in the last decades [3]. Based on modern clusters, the simulation process of a typical case usually takes only a few hours. However, the quest for computational resources in CFD is never-ending, and as fidelity requirements increase and numerical methods evolve, higher demands are placed on computational resources. For example, when using direct numerical simulation methods, flow simulations of simple shapes typically require billions of grid cells and months of computational time [4]. According to predictions, supercomputers will not be able to fully satisfy the demand for CFD high-precision methods even before 2050 [5].
Fortunately, an E-class supercomputer has been successfully developed, which is an important sign of the development of computer hardware and solves to some extent the current dilemma in the development of CFD methods [6]. However, the development of applications is much slower than computer hardware, and current CFD software has difficulty in efficiently utilizing computer hardware resources [7]. Supercomputers are divided into homogeneous and heterogeneous systems, and even on homogeneous supercomputer systems, CFD flow solvers with high scalability are scarce. Currently, the largest machines have O(1,000,000) cores, yet few applications can efficiently utilize more than O(1000) cores, and the development of efficient parallel methods is a pressing problem. Homogeneous systems are still the mainstream supercomputer systems, and the development of massively parallel algorithms based on homogeneous supercomputers is an important research direction.
Many well-known CFD applications, such as OpenFoam, FUN3D, SU2, and Overflow, have carried out a lot of massively parallel research work and achieved fruitful results. Mohammed A. Al Farhan [8] reevaluated the MPI/OpenMP hybrid programming model based on PETSc-FUN3D and carried out a strong scalability study on KAUST's Cray XC40 system Shaheen II. Based on the ONERA M6 wing with 2,761,774 grid points, the performance was tested using 3,072 nodes with a total of 98,304 cores, with a parallel efficiency of 13.2% relative to 512 cores. Ahmet Duran et al. [9] studied the scalability of the icoFOAM solver based on OpenFOAM software for the two-dimensional problem of arterial blood flow. The simulated mesh contains 6.4 million cells and the maximum number of cores for parallel computation is 16384, with a parallel efficiency of 54.7% with 128 cores. Scalability tests of MPI/ OpenMP hybrid parallel simulations based on SU2 software were performed by Thomas D. Economon [10]. For the ONERA M6 model with 818,921 grid points, the parallel efficiency was 73% when 60 cores are adopted. Haoqiang Jin et al. [11] also carried out similar work based on OVERFLOW and AKIE software of NASA.
In addition to this, many in-house applications has conducted research in massively parallel computing [12][13][14][15][16]. For example, Feng He et al. designed a coarsegrained MPI/OpenMP hybrid parallel framework based on the In-House solver, which overlays non-blocking MPI communication with OpenMP shared memory communication to obtain super-linear acceleration. The parallel efficiency for NASA rotor 35 can reach 153% when the core count reaches 3456 cores.
Thanks to the efforts of many CFD scholars, massively parallel computing technology in CFD has made great progress. However, there are still some difficulties in massively parallel computing, such as the grid size is difficult to break through the order of 10 billion, the number of cores is difficult to break through the order of 100,000, the computational model is relatively simple, and the parallel efficiency is difficult to improve. Therefore, efficient large-scale simulation based on complex geometry is still a problem that needs to be solved.
In this paper, we design a novel CFD parallel strategy based on PHengLEI opensource CFD software [17,18] and pure MPI parallel mode. The novel parallel strategy has good scalability and is suitable for large-scale parallel computation, and the correctness of the method is also proved from the theoretical point of view. The parallel strategy uses asynchronous communication and packaged communication, and combines multiple communication processes into one, effectively reducing the number of communications. Three validation cases are carried out and the results show that the parallel efficiency can reach 74% in the simulation of a real vehicle model with 10.5 billion cells and 257,600 cores, which show a significant advantage in scalability compared to conventional methods.
The rest of this paper is organized as follows. Section 2 describes the key techniques and implementation process of the novel parallel strategy in detail. In Sect 3, three numerical experiments are conducted to verify the correctness, convergence, and scalability of the novel parallel strategy and to illustrate the advantages of scalability in the simulation of real aircraft. Section 4 concludes the paper.

Introduction to PHengLEI
PHengLEI [18] is a hybrid, open-source CFD platform developed by China Aerodynamics Research and Development Center(CARDC). PHengLEI opened the source code in China in 2020. The source code has been cloned over 1000 times by users from colleges, institutes, and companies. PHengLEI is playing a role in China's CFD industry.
Based on the C++ programming language, PHengLEI designs a powerful and flexible architecture and data structure. Full-speed flow problems, including subsonic, transonic, supersonic, and hypersonic flow problems, can be accurately simulated via PHengLEI. A variety of computational models and numerical methodologies are incorporated in PHengLEI software. The most attractive feature of PHengLEI is that it includes a structured grid solver as well as an unstructured grid solver, allowing it to support both structured grids and unstructured grids simultaneously. The two solvers can work independently to simulate distinct questions, meanwhile, they can work together to mimic the same problem. This paper focuses on unstructured mesh solvers for solving laminar flow problems.

Governing equation and numerical solution
This section gives a brief overview of the control equations and discretization scheme. The governing equations for the flow problem are the Navier-Stokes equations. The Navier-Stokes equations based on the perfect ideal gas model can be expressed in integral form as: where Q represents the conservative variables, F i and F v are the convective fluxes and viscous fluxes respectively. Ω is the control volume, Ω is the outer boundary of the control volume Ω.
The discrete method, in this paper, is the cell-centered finite volume method based on the unstructured grid. The solution process includes two parts: spatial discretization and temporal discretization. Spatial discretization mainly includes convective fluxes and viscous fluxes. The temporal discretization process is to solve the system of equations formed after spatial discretization.
The spatial term is integrated over the control volume and discretized using the Gauss-Green formula, and the temporal term is discretized using the implicit method, which gives the discrete form of the NS equation as: where V i is the volume of control volume i, Δt represents the time step, Q n+1 i is the conservative variables in cell i at step n + 1 , Q n i is the conservative variables in cell i at step n, N(i) means the set of neighbor cells of the cell i, j is one cell of the neighbor cells set of cell i, F n+1 c,ij represents the convective flux on the common face of cell i and j at step n + 1,F n+1 v,ij is the viscous flux on the common face at step n + 1 , dS ij is the area of the common face.
The term marked with n is the value of the n step and the one marked with n + 1 is the value of the n+1 step. The value of the n step is known and the value of the n+1 step is to be solved. There are several methods used to solve for 2, and the Lower-Upper Symmetric Gauss-Seidel (LU-SGS) method is the more stable solution method. After discretizing using the LU-SGS method, the formula 2 can be written as: where ΔQ n i is the variation of the conservative variables in cell i at step n, ΔQ n j is the same meaning as ΔQ n i in cell j, M n i is one coefficient of the formed Matrix at the position of ith row and ith column, M n ij is one coefficient of the formed matrix at the position of ith row and jth column, F n c,ij represents the convective flux on the common face of cell i and j at time step n, F n v,ij is the viscous flux on the common face at time step n, dS ij is the area of the common face.
ΔQ n is the only variable to be solved in formula 3, and other terms are derived by the known variables. It must be clarified that Fc and Fv in Eq. 3 are discrete expressions, not flux expressions of Eq. 1. This point is important in the following proof.
The inviscid flux is solved using the Roe scheme [19] with second-order accuracy, which consists of three steps. Taking the flux calculation in Fig. 1 as an example, the first step is to calculate the gradient value of the cell, the second step is to construct the values of the left and right sides of the face with second-order accuracy, and the third step is to calculate the face flux using the Roe scheme. When constructing the face left and face right sides, Venkatakrishnan limiter [20] is usually used to ensure the stability of the solution. If the face of the inviscid flux calculation is the interface of two zones, some data need to be transferred during the parallel calculation. The quantities transferred in the PHengLEI during the inviscid flux calculation include the limiter coefficients and the gradient of the primitive variables ∇q.
The viscous fluxes are calculated using a central difference scheme, which consists of two steps. The first step is to calculate the values of the face center using a weighted average method based on the values of the left and right cell centers. The second step is to calculate the viscous flux by substituting the face center values into the viscous flux formula. If the face for calculating viscous flux is the interface after partitioning, data transfer is required. In PHengLEI software, the transferred variables include the temperature T and the gradient of temperature ∇T when calculating the viscous flux.
The treatment of the time term is crucial for the convergence and robustness of the numerical solution. The LU-SGS method is popular because it is a matrix-free method and has excellent robustness. The LU-SGS method consists of two steps, forward sweep, and backward sweep.
where L and U are respectively lower and upper triangular matrix of the coefficient matrix M n , ΔQ n is the variable to be solved, ΔQ An efficient communication strategy for massively parallel… ΔQ n in the whole flow field can be solved by formula 4. In parallel computation, ΔQ n is required to be transferred between zones after the forward sweep is completed.
After solving for ΔQ n , the primitive and conservative variables are updated by ΔQ n . The updated primitive variables q is communicated in the variable update step. The whole iterative solving process can be summarized in the following six steps.

A basic communication method in CFD
Communication methods in massively parallel computation have a great impact on parallel efficiency. This section describes the conventional communication methods in CFD parallel computing. The main study in this paper is carried out based on the pure MPI (message passing interface) mode, which is the most important parallel mode in massively parallelism. In the simulations of practical engineering shapes, the number of grid cells will usually reach tens of millions and is usually solved by the strategy of domain decomposition. First, the mesh is partitioned into several sub-zones based on the METIS [21] tool. Second, each sub-zones is assigned to a process and the connectivity between the sub-zones is established. Finally, the CFD application performs an iterative calculation, in which there is a large number of communication processes. In the following, we will discuss these three steps in detail using the grid in Fig. 2 as an example. An initial grid, with 16 cells is separated into four zones. The four zones are labeled with numerals 1-4, and each zone is arranged to a CPU processor respectively. It is easy to see that each zone owns 4 cells and the load on each processor is balanced. The principle of ensuring load balance is similar to large-scale problems. Afterward, connection relationships and communication data structures between the four zones are established.
The establishment of the connection relationship includes information on neighbored zones and neighbored cells. The connection relationship for Zone 1 is shown in Fig. 3a. Zone 1 has two neighboring zones, Zone 2 and Zone 3. Cell 1 and 3 of Zone 1 are neighbored by cell 0 and cell 2 of Zone 2 respectively. The cell 2 and cell 3 of Zone 1 are neighbored by cell 0 and cell 1 of Zone 3 respectively. Zone 1 creates a corresponding ghost grid for storing data sent from its two neighbored zones, as shown by the blue arrows in Fig. 3b.
There are a large number of communication processes in the complete solution of CFD, and we only consider communication in the iterative process in this paper. Figure 4 depicts the communication process in one iteration step. In the calculation of the convective flux, the variables that need to be communicated include the gradient values of the primitive variables ( ∇q ) and the limiter coefficients ( ). The variables to be communicated in the calculation of the viscous flux include the temperature T and the gradient of temperature ( ∇T ). The variables that need to be communicated during the time-marching process are ( dQ ). The variable update process requires communication of the primitive variable ( q ). After the iteration is completed, the new iteration step will repeat the same pattern that was followed before.

Asynchronous communication
In CFD parallel computation, synchronous or immediate communication is the most popular, resulting in a large amount of communication in each iteration step.   The difference between the temporal terms of the two communication methods is a negligible second-order quantity. The difference in the convective flux between the basic and asynchronous communication methods is a second-order quantity, and the proof progress can be expressed as: An efficient communication strategy for massively parallel… The difference in the viscous flux between the two methods is also of second order as shown below: The difference between the asynchronous and synchronous communication methods is second-order accuracy and can be ignored. Where the temporal term usually requires first-order accuracy, so the difference in the time term can be ignored. The spatial term usually requires maintaining second-order accuracy, so the difference in the spatial term can lead to a difference in the results during the iteration. However, this difference is small, and the difference exists only at the interfaces, which has a smaller impact. Moreover, when the iterations converge, the value of the flow field does not change with time, and this difference will be equal to zero, thus ensuring that the results are calculated correctly. Algorithm 2 details the asynchronous communication process in the PHengLEI software. The 'RegisterInterField' function is used to register the variables that need to be communicated. 'InviscidFlux', 'ViscousFlux', 'LUSGS', and 'Update' are the four steps in an iterative process. The function 'UploadInterfaceValue' is used to update the variables that need to be communicated on this regional intersection interface. The function 'DownloadInterfaceValue' is used to update the variables in the virtual grid cell. The function 'CommunicateInterfaceValue' is used to transfer data utilizing MPI, etc.

Packaged communication
There are three steps in packaged communication mode. First, each zone compresses the variables to be communicated into the DataContainer. Second, each zone sends the compressed DataContainer to the other zone and receives the DataContainer sent from the other zone Fig. 6. Third, each zone reads the data Fig. 6 The implementation process of the packaged communication method in the received DataContainer and updates the data in the ghost cells. Packaged communication reduces multiple MPI "send/receive" processes to one. The purpose of packaged communication is to aggregate all communication processes and the key technology depends on the design of DataContainer. Data-Container is compatible with all types of data, such as int, bool, double, float, etc. The design principle of DataContainer is shown in Fig. 7.
The DataContainer class contains a member variable 'data' of vector type to store the variables to be communicated. The member functions of DataContainer contain 'write', 'read', 'send' and 'receive', which are used to compress, decompress, send and receive respectively during communication.
The "write" function compresses the variables to be communicated into the DataContainer. Before writing the data, the DataContainer first checks whether there is enough memory space left. If there is enough memory space left, the value is copied directly to the remaining memory, and if there is not enough memory space, a new memory is first opened and then written. The "write" process converts the data to char type and then copies it at the memory level, so it is suitable for storing any type of data. "read" is the corresponding function to "write", which is used to read data from the DataContainer and store it in a preprepared array. After the copy operation, the amount of data remaining in the DataContainer decreases until all data has been read. It is important to note that the order of data reading and writing is the same.
The 'send' function sends the DataContainer to the other zones by MPI, and the other zones will receive the DataContainer afterward. Correspondingly, the 'receive' function is used to receive the DataContainer from other processors and after receiving the DataContainer, it opens the memory space and reads all the data from the DataContainer.

Results and discussion
In this section, three numerical experiments are carried out to verify the correctness, convergence, and scalability of the novel method. The first case is an ONERA M6 wing with 295,000 meshes, which is used to verify that the novel method has the same accuracy and convergence as the conventional method. The second one is a real vehicle shape with 139 million meshes, which is used to verify the parallel efficiency of the novel method compared with the conventional method. The third case is the same vehicle with 10.5 billion meshes to verify the scalability of the novel method in massively parallel computing. It is important to note that the cases in this paper are based on pure MPI parallel mode.
The tests in this section were carried out based on the Shanhe supercomputer constructed by JiNan's national supercomputer. ShanHe supercomputer holds 5400 computing nodes connected by the InfiniBand SDR network. Each computing node is equipped with two Intel Xeon Gold 6285R processors and each processor holds 28 cores. There is a total of 192GB of memory on each computing node. The MPI library version is MPICH − 3.3.2, and the compiler is GCC 7.5.0.

Correctness and convergence verification
Both correctness and convergence verification require many iterations to obtain convergent solutions. When the mesh size is large, it takes a lot of time and computational resources to obtain the convergence solution, so this section uses a small mesh case for comparison and analysis. This section uses the ONERA M6 airfoil for correctness verification, which is a widely studied verification model with experimental values for comparison. The incoming flow condition is that mach number equal to 1 3 An efficient communication strategy for massively parallel… 0.8395, the reference temperature is 255 K, the reference pressure is 315,979 Pa, and the angle of attack is 3.06 degrees. The computational grid (Fig. 8) with 295,000 grids is divided into 8 zones and is computed in parallel using the novel and basic parallel strategy. Ten thousand iteration steps are performed to ensure the algorithm converges. Figure 9 shows the pressure distribution curves at z = 0.2 and z = 0.65 positions, respectively. The red triangles represent the calculation results of the novel strategy, the green line represents the calculation results of the conventional strategy, and the black circles represent the experimental results. The calculation results of the novel strategy and the basic strategy are almost identical, and both methods are in good agreement with the experiment data, which indicates that the novel strategy can obtain the same calculation accuracy as the basic method.   Figure 10 shows the residual convergence curves, the red line represents the novel method and the green line represents the conventional method. During the initial iteration, the residuals of the two methods differ, but the amount of difference is small, and when the computation converges, the residual convergence curves almost completely overlap. This indicates that the novel strategy and the conventional strategy have the same convergence speed and convergence magnitude. Figure 11 shows the lift and drag coefficients convergence curves, respectively. The red line represents the novel method and the green line represents the basic method, and it can be seen that the two methods can eventually obtain a consistent convergence solution. The calculation results of lift and drag of both methods are consistent, which further shows the correctness of the novel method. In terms of the convergence speed of the lift resistance, the convergence speed of the two An efficient communication strategy for massively parallel… methods is also almost the same, which indicates that the novel method will not lead to a decrease in the convergence speed.

Scalability comparison
This section focuses on the comparison and analysis of the novel and basic methods in terms of scalability. CHN-T(CHiNa-Transport) model [22] is a standard singlechannel transporter designed by China Aerodynamics Research and Development Center(CARDC). The CHN-T model consists of airfoils, a body, a flat tail, a vertical tail, a pylon, a nacelle, and other components. Figure 12 shows the simplified CHN-T model without nacelle and pylon. The flow condition in this experiment is that the Mach number equals 0.2, the Reynold number per meter is 6.5e6, and the reference temperature is 288.15K. The computational grid contains 0.139 billion cells. The grid is created by refining a coarse mesh with 17.3 million cells and gaining an 8X increase in the number of cells.
In this case, 1, 7, 56, 560, 1120, and 11,200 cores are used for the calculation. Figure 13 gives the parallel efficiency comparison curves, the x-axis is the number of computational cores in logarithmic coordinates, and the y-axis is the parallel efficiency. The red line is the parallel efficiency curve of the novel strategy, and the green line is the parallel efficiency curve of the basic method. When a single process is used, there is no communication process, and the computation time of both methods is the same. As the number of processes increases, the parallel efficiency of both methods decreases, however, the basic method decreases more severely. In particular, the parallel efficiency of the basic method is 81.5% when the number of processes reaches 11,200, while the parallel efficiency of the novel strategy can reach 25.4%. The parallel efficiency of the novel strategy is significantly higher than that of the basic method, which shows that the novel method has better scalability. Figure 14 gives the comparison results of the speedup. The red and green lines are the speedup curves of the novel method and the basic method, respectively, and the purple line is the ideal speedup ratio. It can be seen that the novel method is closer to the ideal speedup ratio relative to the basic method when the number of processes increases, which indicates that the novel method has a better speedup.
When the number of processes increases, the decrease in parallel efficiency is mainly due to the dramatic increase in the number of communications. Figure 15 shows the total number of MPI sends/receives per communication and the average number of sends and receives per single process. When the number of partitions becomes larger, the number of MPI sends and receives per communication operation increases dramatically. Asynchronous and packaged communication reduces more than 20 communication processes to one. When the number of partitions is larger, the number of MPI sends/receives reduced by the novel method is substantial.
The packaging and unpackaging process does not add much cost. Table 1 shows the cost analysis data of packaging and unpackaging for different partitioning cases. When the number of processes increases, the average packaging and unpackaging time for a single process decreases, but the percentage of single-step computation time is increasing. The decrease in packaging and unpackaging time is mainly The novel strategy does not lead to an increase in network bandwidth requirements. When the number of processes increases, although the total communication times increases dramatically, the average single-process communication times increases less (Fig. 16), and a decrease in the average number of interfacing interfaces per process also makes the average single-process communication times decrease. Taking 56 partitions as an example, calculated on a node (56 cores), the size of the array of each communication variable is 28211.6, the array type is double type array, and the package contains a total of 20 arrays, then the size of the data sent by the whole node is 252.77 Mb. The bandwidth of supercomputers is usually in the tens of GB/s or even hundreds of GB/s, so packaging communication does not impose harsh bandwidth requirements.

Hyper-massive parallel computing validation
This section carries out the strong scalability test and weak scalability test in hyper-massive parallel computing with the computational model CHN-T and the same computational conditions as in Sect. 3.2. Due to the limitation of computational resources, only the parallel efficiency is tested in this section, and no converged solution is obtained. The computational grid for the strong scalability test contains 10.5 billion grid cells and requires approximately 11 T of hard drive space. The large-scale grid is obtained by refining the CHN-T coarse grid   Fig. 17. With such a large-scale parallel computation, the computation time is longer or even the computation fails using the conventional method, so the performance curve of the conventional method is not given. From the parallel efficiency curve, it can be seen that the parallel efficiency can be maintained at 74% when the number of cores reaches 257600, indicating that the novel method has good scalability.  Figure 18 shows the comparison curves of the actual speedup based on the novel communication strategy and the ideal speedup. It can be seen that when the number of cores is extended to 257,600 cores, it still has a good acceleration effect. This hyper-large scale test result is very rare in the literature, including the mainstream CFD software, such as SU2, OpenFOAM, FLUENT, etc. No similar work has been carried out, which indicates that the novel communication method proposed in this paper is highly advanced.
The weak scaling efficiency is another important measure of scalability. The weak scaling test uses the same computational model and incoming flow conditions as the strong scaling test. The numbers of computing cores used in this test are 350, 2800, 22400, and 179200 cores respectively. The average load of each core is 58,746 cells, and the total grid size ranges from 20 million to 10.5 billion cells. The parallel efficiency curve for the weak scaling test, based on the computation time of 350 cores, is shown in Fig. 19.
As the number of cores increases, the weak scaling efficiency hardly decreases. In particular, when the number of cores reaches 179,200 and the grid size reaches 10.5 billion cells, the parallel efficiency can still be maintained at 89.8%. The test results of weak scaling parallel efficiency further demonstrate the high scalability of the novel method, which is suitable for conducting hyper-large scale parallel simulations of complex flow problems.

Conclusion
Massively parallel computing is a key technique in CFD simulations of complex flow problems. In this paper, we propose a novel parallel strategy including asynchronous communication and packaged communication. The novel strategy aggregates all communication processes and performs one communication after the completion of the iterative step, which significantly reduces the number of communications. The Fig. 18 Speedup curve novel strategy can obtain computational results that are consistent with conventional methods and is proved from a theoretical point of view. Moreover, the correctness and convergence are verified based on the M6 wing case, showing that the novel strategy has the same computational accuracy and convergence as the conventional method. Detailed performance comparison and analysis is carried out for the real vehicle CHN-T, 139 million meshes, showing that the novel strategy has obvious advantages in scalability compared with the conventional strategy and the additional cost of the novel method is small. Finally, the strong and weak scalability tests of the hyper-massive parallel computation are carried out for the real aircraft, and good scalability is obtained for both. The CHN-T model with 10.5 billion grids was tested for a strong scaling test, and the strong scaling efficiency could reach 74% when the number of cores reached 257,600. The weak scaling was tested for the CHN-T model, and the parallel efficiency reached 89.8% when 10.5 billion cells and 179,200 cores are adopted.
The novel strategy achieves large-scale parallel computation for the simulation of complex flow problems, which provides important support for the development of fast design and high-precision methods for air vehicles. However, due to the limitation of computational resources, only performance tests are conducted in this paper for the hyper-large scale computational process, convergence results are not obtained, and other issues such as storage and visualization of the results need further research.

Data Availibility Statement
The datasets generated during and analysed during the current study are available from the corresponding author on reasonable request.

Conflict of interest None
Informed content All authors agreed with the content and all gave explicit consent to submit and they obtained consent from the responsible authorities at the institute where the work has been carried out, before the work is submitted.

Content for publication
All authors agreed with the content for publication.