Parallel implementations of randomized vector algorithm for solving large systems of linear equations

The results of a parallel implementation of a randomized vector algorithm for solving systems of linear equations are presented in the paper. The solution is represented in the form of a Neumann series. The stochastic method computes this series by sampling only random columns, avoiding multiplication of matrix by matrix and matrix by vector. We consider the case when the matrix is too large to fit in random-access memory (RAM). We use two approaches to solve this problem. In the first approach, the matrix is divided into parts that are distributed among MPI processes and stored in the available RAM of the cluster nodes. In the second approach, the entire matrix is stored on each node’s hard drive, loaded into RAM, and processed in parts. Independent Monte Carlo experiments for random column indices are distributed among MPI processes or OpenMP threads for both approaches to matrix storage. The efficiency of parallel implementations is analyzed. Results are given for a system governed by dense matrices of size 104\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$10^4$$\end{document} and 105\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$10^5$$\end{document}.


Introduction
Large systems of linear algebraic equations often arise in many scientific and engineering fields, physics, chemistry, ecology, financial market analysis. Some problems require solving very large dense linear systems, such as three-dimensional geophysical analysis of information obtained from seismological and gravity datasets [1], fluid flow modeling in biology [2], electromagnetic applications [3].
Solving large linear systems requires a lot of time and memory. Thus, fast, efficient methods for solving linear systems are needed. There are various approaches to solving large-scale problems, factorization methods [4][5][6][7], iterative, projection, Krylov methods [8,9], Monte Carlo methods [10]. In many important applications, not very accurate solutions can be satisfactory. Monte Carlo methods allow to obtain a solution with moderate accuracy. The calculation time for solving large linear systems with moderate accuracy with Monte Carlo methods can be significantly less than the time of high-precision calculations using traditional deterministic methods. Moreover, Monte Carlo algorithms are very efficient and scalable on parallel platforms [11].
The application of the Monte Carlo method for solving linear systems was originally proposed by von Neumann and Ulam [12]. Various techniques have been developed to improve this Monte Carlo algorithm: the sequential Monte Carlo method [13], the improved "Walk on Equations" method [14], variance reduction methods [15,16]. In [17,18] the convergence of Markov chain based algorithms is analyzed. Parallel implementations of Monte Carlo methods solving linear systems are given for cases without matrix distribution [19] and with matrix distribution on different computational nodes [20,21].
In [22], a new vector randomized algorithm for solving systems of linear equations is suggested. The method is based on a special representation of a matrix through a stochastic matrix. The idea of this stochastic method is a randomized vector representation of matrix iterations. The calculations in this algorithm are very simple and economical because it computes the matrix iterations by sampling random rows and columns instead of multiplying matrix by matrix and matrix by vector.
In this paper, we present parallel implementations of the vector randomized algorithm proposed in [22] for matrices that are too large to fit in the memory of a single computer node. The motivation of our research work lies in the need to create a fast and cheap solver for large systems of linear algebraic equations, in particular to solve differential and integral equations of mathematical physics. We consider two approaches to implement the algorithm for large matrices. In the first approach, the matrix is divided into parts so that the size of each part is suitable for placing a sub-matrix in the memory of one node. In the second approach, the whole matrix is stored on the hard disk, loaded into random-access memory (RAM), and processed in parts.

Parallel implementations of randomized vector algorithm…
The rest of the paper is structured as follows. Section 2 contains a description of the random vector algorithm for solving a system of linear equations. Section 3 presents parallel implementations of the algorithm. The sequential and parallel implementations of the algorithm are described. The efficiency of the parallel implementations for the system governed by dense matrices of size 10 4 and 10 5 is shown.
2 Random vector algorithm for solving a system of linear equations

Random vector estimators for stochastic matrix iterations
Let us consider a system of linear algebraic equations In the case, when the spectral radius of matrix A is less than unity, (A) < 1 , the solution of the system (1) can be represented as a Neumann series [22] In [22] an algorithm for the randomized calculation of the matrix iterations A k b is proposed. Let b ∈ R n be a nonnegative stochastic vector, and let A be a nonnegative n × n column stochastic matrix. A nonnegative vector is a stochastic vector if the sum of its elements is equal to unity, A matrix is a column stochastic matrix if all its columns are stochastic vectors, a ij ≥ 0 , ∑ n i=1 a ij = 1 for j = 1, 2, … , n . Hereinafter, we denote by A k the kth column of the matrix A.
According to [22], a random unbiased vector estimator for the product Ab is defined by the column j 1 = A k , where k is a random index chosen from {1, 2, … , n} according to the probability distribution p 1 = b Here {A k | p 1 = b} means that the mathematical expectation is taken over random columns chosen from the distribution p 1 The random unbiased vector estimator for the second iteration A 2 b is defined as follows This means that we first select a random column A k from the distribution p 1 = b and then select a random column A j from the distribution p 2 = A k .
The unbiased estimator for the third iteration is obtained by following random sampling of a random column A i from the distribution p 3 = A j , etc. For the kth iteration, A k b we obtain where each subsequent random column is selected from the distribution of the previous selected column. The mathematical expectation is taken over the random independent Monte Carlo trajectories consisting of the random column indices.
Note that to calculate the realization of the random vector A j k for the kth iteration, we calculate the realizations of the random vectors A j k−1 , … , A j 1 for all previous iterations.
According to [22], in the case of an arbitrary vector b , the kth iteration A k b is defined as follows where p 0 is an arbitrary probability distribution of indices i = 1, 2, … , n satisfying the condition p 0 (i) ≠ 0 if b i ≠ 0, i = 1, … , n . Here p 0 (i) is the ith element of the vector p 0 . The index j 1 is a random index sampled from the distribution p 0 .
A random index is sampled from the discrete distribution given by the stochastic vector p using Walker's alias method [23]. This method runs in O(1) time, but requires setting alias tables, which takes O(n) time [24]. The algorithm includes a preconditioning step in which two tables are computed for the discrete distribution p , a probability table and an alias table.

Random vector estimators for arbitrary positive matrix iterations
In the case of an arbitrary positive matrix or an irreducible nonnegative matrix, such a matrix can be transformed into a column stochastic matrix [25].
Let us consider a nonnegative matrix A which has positive eigenvalue and positive eigenvector z of the transpose matrix A T ( A T z = z ). Let Z be a diagonal matrix defined by Z = diag{z 1 , … , z n } , where z i , i = 1, … , n are positive components of the eigenvector z . Then the matrix S = 1 ZAZ −1 is a column stochastic matrix [22].
The maximum eigenvalue of the matrix A T calculated by the power method [26] is taken as . First, we choose an initial vector r 0 ∈ R n , which must be a non-zero vector. Then we compute the sequence r k given by (7) until the convergence crite- If the convergence criterion is satisfied, the eigenvector corresponding to the maximum eigenvalue is obtained z = r k . The maximum eigenvalue is calculated using the Rayleigh quotient The kth iteration of the original matrix A is expressed through the iteration of the stochastic matrix S as follows

Solving a linear system
As mentioned above, our goal is to solve the linear system (1) of algebraic equations for the case where (A) < 1 . Here, we assume that A is a positive or irreducible nonnegative matrix, then we transform the system (1) to a system with a column stochastic matrix, as specified in [22].
Thus, we transform the matrix A into a column stochastic matrix S and compute the iterations ( S) kb , k = 1, 2, … . In this case, the random vector estimator (6) described in Sect. 2.1 is multiplied by the factor k . By analogy with the solution (2) of the system (1), the solution of the system (10) can be represented as a Neumann series Thus, the algorithm for calculating the solution of the system (1) is as follows.
The number of iterations m depends on the target accuracy of the solution x . According to [22], m ∼| log( ) |.
Here the calculation of matrix iterations is given in general terms. The implementation of step 6 will be described below.

Sequential implementation of the algorithm
According to [22] and Sect. 2, the random vector algorithm computes the estimator of the matrix iteration A k b as the arithmetic mean of matrix columns whose indices are chosen at random from the distribution p = A j k−1 , where A j k−1 is the column chosen at random in the previous iteration. On each Monte Carlo trajectory, we start with a random index j 1 chosen from the distribution b and contribute to the estimator of Ab , then we sample index j 2 from the distribution A j 1 and contribute to the estimator of A 2 b , next we sample index j 3 from the distribution A j 2 and contribute to the estimator of A 3 b , etc.
Thus, when executing the random vector algorithm, a sequence of random indices is generated for columns of the matrix A. The choice of columns depends on the probability distribution determined by the columns of the matrix A and the vector b . Therefore, to reduce the number of accesses to the matrix, we pre-generate a sequence of indices for each column. At the beginning of the algorithm, we generate q random indices for each column. These indices are then consumed by the algorithm to compute the matrix iterations. When the indices for any column run out, the next q indices are sampled from that column's distribution.
As mentioned earlier, when transforming the original matrix A into the stochastic matrix S, the solution of the system is calculated by the formula (11). The matrix iteration ( S) kb is calculated as the random vector estimator (6) multiplied by the factor k . Thus, the solution (11) is calculated as the arithmetic mean of the random columns, each multiplied by the coefficient corresponding to the iteration number. To reduce the number of matrix accesses, instead of calculating the sum of the values of the column elements themselves (step 6 of the Algorithm 1), we calculate the sum of coefficients for each column. After calculating the coefficients over N Monte Carlo trajectories, we multiply the resulting arithmetic mean of the coefficients by the column elements.

Parallel implementation of the algorithm
The vector estimators of the matrix iterations A k b are calculated as mathematical expectation over N independent Monte Carlo trajectories. These calculations can be efficiently performed in parallel. The traditional method for implementing Monte Carlo algorithms in parallel is to distribute random-independent trajectories among the available supercomputer cores. Usually, the number of trajectories N is divided by the number of available cores c, and each core executes N/c trajectories.
We consider the case when the matrix is too large to fit in the memory of one node and its elements cannot be calculated by any formula. The random vector algorithm that calculates the matrix iterations operates on matrix columns. Therefore, the matrix is divided into parts by columns.
We use two approaches to implement the algorithm for solving a linear system with a large matrix: 1. partitioning of the matrix among the available nodes such that part of the matrix is placed in the RAM of one node (hereinafter "RAM"); 2. storing the whole matrix on the hard disk, loading it into RAM and processing the matrix in parts (hereinafter "File").
MPI standard is used for parallel implementation of the algorithm.
In the first case, the matrix is divided into parts, which are distributed among n mpi MPI processes in the memory of available nodes. The number of n mpi MPI processes per node is chosen so that the node's RAM is sufficient to accommodate the matrix parts of all the processes running on the node. Each MPI process transposes its part of the matrix.
A parallel version of the power method is implemented, which calculates the maximum eigenvalue and the corresponding eigenvector. Each MPI process transforms its part of the matrix into a column stochastic matrix and calculates the probability distribution p 0 . It then computes probability tables and alias tables for Walker's method for its local columns. Using these tables, the MPI process generates q ⋅ n mpi random indices for each local column, puts q indices into the local array I, and sends q indices to each MPI processes. Thus, each MPI process receives an array of indices for all columns of the matrix. The next step is to compute the main loop over N trajectories (step 9 of the Algorithm 2). Each MPI process calculates Parallel implementations of randomized vector algorithm… coefficient values c for N∕n mpi independent trajectories, using its own array of random indices. When the indices for some column run out, a query is sent to the MPI process that stores the needed column. Having received such a request, the MPI process generates and sends an array of indices for the desired column using a separate thread. If there is no request, this thread is idle. The average value of the coefficients c is calculated by the reduction operation MPI_Allreduce over all MPI processes (step 10 of the Algorithm 2). Next, each MPI process multiples its local columns S i by the computed coefficient values c and sums them in the local vector y . One of the MPI processes calculates the total value of the vector y using the reduction operation MPI_Reduce and performs steps 13 and 14 of the Algorithm 2.
In the second case, the entire matrix is stored on the local hard drives of the nodes. Each MPI process simultaneously reads its local file and loads the matrix piece by piece into its RAM area. As in the first case, the number of n mpi MPI processes per node must be such that the RAM of the node is sufficient to store the matrix parts of all MPI processes running on the node. All operations with the matrix, transposition, finding the maximum eigenvalue, transformation to the column stochastic matrix and calculation of random indices for columns are performed in parts. That is, a part of the matrix is read from the file into RAM, the necessary operation is performed on it, a partial result is stored, and then the next part of the matrix is read from the file and stored in RAM in place of the first part, and so on. In this way, the array I of indices is computed locally for each column of the matrix. Note that here the random index is chosen according to the probability distribution using the rejection method [27] rather than Walker's alias method to save RAM space in order to be able to store as large matrix part as possible. Each MPI process independently computes coefficient values c for N∕n mpi trajectories using its local array I of random indices. When the indices for any column run out, the desired column is read from the file, the next q indices are generated from the distribution for that column and stored in the array I.
In addition, two other parallel programs are implemented using OpenMP API for both cases of matrix storage. The main loop over N trajectories (step 9 of the Algorithm 2) is parallelized in these programs. Each thread generates its own array I of random column indices and computes N∕n th independent Monte Carlo trajectories, where n th is the number of threads.

Efficiency of parallel implementations
Here, we analyze the efficiency of parallel implementations of the algorithm for solving the linear system Mx = b for M-matrices whose spectral radius is less than unity. An M-matrix is a real square non-singular matrix M if m ij ≤ 0 for i ≠ j and if M is monotonic, i.e., if M −1 ≥ 0 [22]. Here the M-matrix is calculated as M = uE − A , where E is the identity matrix, the matrix A ≥ 0 is a random matrix, a ij = 0 for i = j and a ij = rand(0, 1) for i ≠ j . The value u is equal to u = (A) + 1 , where (A) is the spectral radius of the matrix A. Thus, the system Mx = b is transformed to the system (1) where x = 1 u Ax + 1 u b.
The "Broadwell" partition of the "NKS-1P" cluster of the Siberian Supercomputer Center of the Siberian Branch of the Russian Academy of Sciences [28] is used for computations. Each "Broadwell" node contains two Intel Xeon E5-2697A v4 processors, 2.6 GHz, 16 cores, 40 MB Intel Smart Cache, with hyper-threading (32 cores, 64 threads per node). The RAM of the node is 128 Gb.
In this paper, we mainly parallelize the loop along Monte Carlo trajectories, step 9 of the Algorithm 2, because this loop takes most of the computation. Therefore, we measure the computation time and analyze the speedup of parallelization of this loop. We analyze the quality of parallel programs using MPI and OpenMP API separately. To evaluate the quality of the parallel implementation, we measure the speedup and efficiency of the parallel program. Speedup is calculated as follows S(i) = T(1)∕T(i) , where T(i) is the calculation time obtained when executing the program on i MPI processes or i threads. The efficiency is calculated as follows Q(i) = S(i)∕i , where i is the number of MPI processes or threads.
The following system is taken for tests: A n×n is a random matrix given above, the size of the system is n = 10 4 , the solution x of the system is taken as a unit vector, the right-hand side is equal to b = uE − A , where u = (A) + 1 . Computer experiments show that for such a system the number of iterations m and trajectories N must be at least m = 2n and N = 10 5 to obtain a solution with a relative error of about 0.03 using the random vector algorithm.
The size q of the array I i of random indices pre-generated for each column i = 1, 2, … , n is an implementation parameter. The larger it is, the less frequent are data transfers in the "RAM" implementation and memory accesses in the "File" implementation and, accordingly, the longer is the computation time. Therefore, the array size q should be taken as large as possible, taking into account the memory constraints on the nodes. In the experiments below, we take q = 10 4 . Table 1 shows calculation times for the "RAM" and "File" implementations using MPI and OpenMP API. The first row of the table gives the number n mpi of MPI processes for MPI programs and the number nth of threads for OpenMP programs. The last two columns show the execution time when using two nodes (denoted as (2)), as opposed to the remaining columns, which show the execution time on a single node. The "RAM" implementation is faster than "File" for a single node, as expected. The minimum computation time (9.8 seconds, highlighted in bold in Table 1) is achieved for "RAM" when using two nodes per 16 MPI processes. The execution time of "RAM" when using OpenMP API is less than when using the MPI standard for the same number of cores. Presumably, this is due to the fact that when using the OpenMP program, the matrix and tables for Walker's method are stored in shared memory and each thread generates its local arrays of column indices without data transfer, unlike the MPI program, where each MPI process requests and receives the necessary data from the other MPI processes. The overhead even within a single node increases the running time of the program. For the "File" implementation, the execution time for the MPI program is shorter than when using the OpenMP program. Presumably, this can be explained by synchronization delays between threads when accessing shared data. Hyper-threading is enabled on the "Broadwell" nodes, with 64 threads available on each node. However, the computation time for the "RAM" and "File" implementations increases when executed on a single node with 64 MPI processes or threads (column "64" in Table 1) compared to n mpi = 32 and n th = 32 per node (column "32" in Table 1). A particularly sharp increase in time is observed when implementing "RAM" using 64 MPI processes. This can be explained by the fact that this implementation uses an additional thread for each MPI process to generate and send random column indices. Thus, 32 MPI processes use 64 hardware threads, which is the maximum number of MPI processes to execute "RAM" on a "Broadwell" node. Figure 1 shows the calculation time, speedup and efficiency of two parallel implementations of the algorithm for solving the linear system using MPI API. The parallel programs are executed on the "Broadwell" nodes using different number of MPI processes. We use a maximum of 32 MPI processes per node. The programs with 64 MPI processes use two nodes. The calculation time of the "RAM" implementation is shorter than that of "File" (see the lines "RAM, MPI" and "File, MPI" in Table 1). However, the speedup and efficiency are higher for the "File" implementation. The rapid decrease in speedup and efficiency for "RAM" can be explained by the overhead of generating and transferring index arrays. In the current version of the program, a process that runs out of column indices sends a query and waits for a response. In the future it is planned to implement generation and data transfers during calculations. Figure 2 shows the calculation time, speedup and efficiency for the "RAM" and "File" parallel implementations using OpenMP API. In this case the parallel programs are executed on a single node of the "Broadwell" partition with different number of threads, n th = 1, 2, … , 64 . As in the first case, the calculation time of the "RAM" implementation is less than "File" (see lines "RAM, OpenMP" and "File, OpenMP" in Table 1). The speedup and efficiency is similar for both parallel implementations. As noted above, executing programs in 64 threads on a single node is inefficient. In this case, the computation time increases and the speedup drops dramatically.
Thus, the "RAM" implementation solves the considered system of size 10 4 faster than the "File" implementation.
Next, we consider the system of size n = 10 5 . To implement "RAM", each MPI process stores in its memory part of the source matrix, tables for Walker's method, array of column indices. Thus, as many processes can be run on a node as the amount of data fits in the node's RAM. It takes 190 GB of memory to store all the necessary data, which are more than the memory capacity of the "Broadwell" node. That is, this problem cannot be solved on a single node using the "RAM" implementation. Therefore, the problem is solved using multiple nodes. On the one hand, the Fig. 2 Calculation time, speedup and efficiency for parallel implementations depending on the number of threads. "RAM" is the case where matrix parts are stored in RAM. "File" is the case where the matrix is stored in files on the hard drives more processes are used, the less computation time is needed. On the other hand, each process additionally stores its own array of indices. Therefore, the more processes are used, the more memory is required. Thus, for each system size, the maximum number of MPI processes whose data can be stored in node memory should be determined. We found that the maximum number of MPI processes for the system of size n = 10 5 is 16. The calculation time of this problem is 330 seconds using 16 MPI processes on two nodes. The time to solve this problem with the "File" implementation using the same computing resources (16 MPI processes and two nodes) is 2905 seconds, that is 8.8 times longer than "RAM".
Thus, the parallel program "RAM" using OpenMP threads is the fastest one we have implemented. However, it is limited to the memory of a single node. For a "Broadwell" node, the maximum system size that "RAM" can solve with OpenMP threads at the index array size q = 10 4 is n = 56000 . The parallel program "RAM" using MPI is slightly slower than the program using OpenMP on the same resources. However, it allows to solve larger problems. For example, the maximum system size that "RAM" can solve using two "Broadwell" nodes with q = 10 4 is n = 1.1 ⋅ 10 5 . The "File" implementation is slower than "RAM", but can solve larger systems. For example, the maximum system size that "File" can solve with the index array size q = 10 4 is n = 3.4 ⋅ 10 6 . However, it will takes a very long time to solve this problem (presumably about a month). Therefore, we plan to speed up the program.

Conclusion
The purpose of this paper was to implement in parallel the randomized vector algorithm for solving large systems of linear equations. The stochastic algorithm calculates the solution by sampling the columns of the matrix according to the probabilities of these columns. We have used two approaches to implement the algorithm for solving a linear system with a large matrix. In the first approach ("RAM"), the matrix is divided into parts, which are distributed among MPI processes in the memory of available nodes. In the second approach ("File"), the matrix is stored in a file on each node's hard drive, loaded into RAM and processed in parts. MPI and OpenMP API are used for parallel implementations of both approaches. Independent Monte Carlo experiments for random column indices are distributed among MPI processes or OpenMP threads. Each MPI process and thread pre-generates an array of random column indices for each column of the matrix and consumes these indices for calculation. Thus, the number of processes and threads should be chosen so that all their data fit in the memory of the available nodes. We analyze the efficiency of parallel implementations for the system governed by dense matrices of size 10 4 and 10 5 . The "RAM" implementation solves the system in question faster than the "File" implementation. Thus, the "RAM" implementation is more preferable for solving large linear systems if the matrix can be partitioned to fit in the node's RAM. The developed implementations have limitations on the size of the problem due to the limited memory size of the node. In the future, we plan to develop algorithms that overcome these restrictions. In addition, we plan to speed up the "RAM" implementation using a combination of MPI and OpenMP, as well as data transfer during computations.