Efficient parallel kernel based on Cholesky decomposition to accelerate multichannel nonnegative matrix factorization

Multichannel source separation has been a popular topic, and recently proposed methods based on the local Gaussian model have provided promising results despite its high computational cost when several sensors are used. This drawback limits the practical application of this approach to tasks such as sound field reconstruction or virtual reality. In this study, we presented a numerical approach to reduce the complexity of multichannel nonnegative matrix factorization to address the task of audio source separation for scenarios with a large number of sensors, such as high-order ambisonics encoding. In particular, we proposed a parallel driver to compute the multiplicative update rules in MNMF approaches. It is designed to function on both sequential and multicore computers, as well as graphics processing units (GPUs). The solution attempts to reduce the computational cost of multiplicative update rules by using Cholesky decomposition and solving several triangular equation systems. The driver was evaluated for different scenarios, with promising results in terms of execution times for both the CPU and GPU. To the best of our knowledge, this proposal is the first to address the problem of reducing the computational cost of full-rank MNMF-based systems using parallel and high-performance techniques.


Introduction
Nonnegative matrix factorization (NMF) is a powerful tool for dimensionality reduction during audio processing.It decomposes a nonnegative data matrix (i.e., a matrix with nonnegative entries) into a product of two lower-rank matrices.This allows the approximation of the data matrix as the sum of the rank-1 nonnegative matrices.This technique is popular because of its universality, and ability to be applied to a wide range of audio sources, including speech, music, environmental sounds, as well as its flexibility, as it can be adapted to various constraints such as harmonicity of spectral patterns, smoothness of activation coefficients, and pretrained spectral patterns.
However, standard NMF is not suitable for handling multichannel data.Today, multichannel audio is becoming increasingly prevalent, with technologies such as distributed microphone arrays and ambisonic microphones, which are widely adopted for 360-degree videos and virtual reality (VR) experiences.Additionally, the objectbased audio format is widely used and regarded as a standard VR format.It consists of pairs of object metadata and the corresponding waveform signals.Placing an audio object anywhere in space provides listeners an immersive experience that can serve as the audio part of the six-degree-of-freedom (6DOF) system proposed in MPEG-I [1].Therefore, conversion from the high-order ambisonics (HOA) format to the object format is highly desired although it requires further processing of the captured signals.
To address the limitations of the standard NMF in handling multichannel data, various extensions have been proposed, such as stacking channels into a singlematrix structure [2] or considering a parallel factor (PARAFAC) model [3].However, these methods primarily focus on amplitude information and do not fully leverage the spatial information present in microphone array recordings.To address this issue, the beamspace data model proposed by Lee et al. [4] incorporates the projection of the input signal onto a set of steered directions and provides the inherent phase-difference information present in these recordings.Another approach, proposed in [5], projects signals from a uniform linear array onto the ray space (RS) domain, where spatial information is also encoded in the magnitude information.Other methods based on the RS domain incorporate spatial information using plenacoustic functions [6,7].An alternative method to model spatial information is using signal representations based on the spherical harmonic (SH) domain [8,9] or the spatial covariance matrix (SCM) [10].
Multichannel NMF (MNMF) modeling considers complex-valued short-time Fourier transform (STFT) coefficients as realizations of zero-mean circular complex-valued Gaussian random variables with structured variances (using NMF to model the signal power spectral density (PSD)) and covariances [11].However, MNMF suffers from a strong sensitivity to parameter initialization and high computational costs.The strong sensitivity to parameter initialization can be mitigated by constraining the mixing model as a weighted combination of fixed spatial kernels steering toward a subset of possible spatial directions [10,12] or by initializing the source PSD using single-channel deep learning techniques [13,14].The high computational costs limit the practical use of this method for tasks involving a large number of sensors in applications such as dereverberation [15], sound field reconstruction [16,17], or navigation for augmented/ extended reality [18].The main reason for this is the multiple matrix inversions during the parameter estimation procedure [19].
To mitigate this drawback, several diagonalization-based methods have been proposed to provide computationally efficient solutions [9,13,17,[20][21][22].These methods are limited to a certain array setup [22], reduce only the SCM diagonal values 1 3 Efficient parallel kernel based on Cholesky decomposition… [9] or rely on the statistical independence between the sources to derive the spatial characteristics [13,21].
However, to the best of our knowledge, the optimization of the original fullrank solution has not been studied in the literature.In this study, we evaluated the efficiency of the original full-rank MNMF in [23] and developed a novel approach based on the Cholesky decomposition to enable novel MNMF extensions for scenarios involving a large number of sensors.We focus specifically on the application of the MNMF model in the HOA domain for formats A and B, which are the most widely adopted sound field representations.We note that Cholesky decomposition is a powerful tool for efficiently solving linear systems and inverting matrices.In the field of signal processing, it has been used in a variety of applications such as instantaneous fundamental frequency estimation [24], speech enhancement [25], and audio source separation [26,27].In addition, it has been used in combination with other techniques, such as NMF, to reduce the dimensionality of data [27][28][29].However, this is the first study in which Cholesky decomposition was specifically used to accelerate the parameter estimation procedure in MNMF-based models.
In particular, we proposed a parallel multiarchitecture driver to compute the multiplicative update rules in MNMF.The proposed driver was designed for sequential and multicore computers and graphics processing units (GPU).The proposed approach can be used from numerical computing environments such as MATLAB or GNU Octave through MEX (MATLAB Executable) interfaces.
The remainder of this paper is organized as follows: In Sect.2, we review some of the most common mixture model representations and present the problem formulation.The proposed approach is presented in Sect.3. In Sect.4, the evaluation setup is described and the proposed system is evaluated for different scenarios.Finally, Sect. 5 concludes the paper.

Background
In this section, we review some of the most common mixture model representations reported in the literature.The foundation of multichannel NMF is then introduced from the local Gaussian model (LGM) formulation to derive the update rules.

Microphone domain
The source separation problem involves estimating the contribution s j,t ∈ ℝ I of each source j = 1, ..., J in each microphone i = 1, ..., I and at each time instant t = 1, ..., T .In the absence of noise, the mixture can be written as follows: (1) where y t = [y 1,t , ..., x I,t ] ∈ ℝ I are microphone array signals.Under reverberant con- ditions and assuming the point source hypothesis, the source signal sj t can be related to its contribution s j t through the following: where * denotes the convolution product, and j i denotes the impulse response of the mixing filter between the source j and microphone i.

HOA domain
In the HOA domain, any plane wave can be characterized by a sound signal s t and its direction of arrival .The unit vector indicates the direction of arrival of the plane wave (origin of the sound source).This vector can be decomposed into spherical coordinates as = (cos cos , sin cos , sin ) , where is the azimuth and is the elevation of the sound source.The SH component gains for the source in the direction ( , ) can be expressed as follows: where SH order and degree are denoted by n and m, respectively.In addition, P n|m| is the associated Legendre function of degree n and Each order n has i = 1, ..., I channels (SH signals) with I = (n + 1) 2 .SH compo- nents are generally ordered using the ACN ambisonics channel ordering 1 as a vector z( ) ∈ ℝ K containing each Z nm ( , ).
For a general set of multiple localized sources (multiple plane waves) with signals s j t originating from the direction j , the anechoic ambisonics mixture y t ∈ ℝ I can be expressed as follows: As explained in [30], the reverberant conditions can be modeled by considering an image-source model of L images, modeling reflections, and late reverberation (up to a desired limit).Thus, the mixture model can be expressed as a function of the propagation delay j l and propagation filter h j l ( ) , which model the absorption and ( 2) where j are the direct path delays and Ω is the chosen maximum sample length for the ambisonics IRs.For causality, the first nonzero index of h j ( ) is set to = j l .

Local Gaussian model
The model assumes that the I-channel vector of a STFT bin for jth source can be modeled as a multivariate complex Gaussian, as follows: where s j ft ∈ ℂ I denotes the spatial image of the jth source in the STFT domain, R ] ∈ ℂ I×I denotes the covariance matrix of the complex Gaussian dis- tribution ℂ , f is the frequency bin index, and t is the time frame index.
We represent the spatial image of a mixture of multiple sources y f ,t ∈ ℂ I as the sum of complex Gaussians, as follows: where R j f ,t ∈ ℂ I×I denotes the SCM.Under the assumption that the sources are mutually independent, the SCM of the mixture R j f ,t can be modeled as the sum of the SCMs of all sources as follows: and the log-likelihood of the spatial image y ft for the model parameters can be expressed as follows: where the SCM of the mixture is modeled by Rft ( )) and the parameter will be defined in the next section.The maximization of this likelihood can be interpreted as the minimization of the log-determinant divergence between the empirical SCM, Rft = y ft y H ft , and the estimated SCM, Rft ∈ ℂ I×I . (6) where C( ) denotes a cost function that we need to minimize with respect to the model parameters .We denote log-determinant divergence by D LD .

Multichannel NMF (MNMF)
In [23], the authors proposed a MNMF framework, where the SCM mixture Rft is assumed to be a positive-definite Hermitian and modeled as a superposition of timeinvariant SCMs G f ,k ∈ ℂ I×I coupled with a scale value f ,t that represents the PSD and can be modeled using a classical NMF structure.The scale value can be modeled using the classical NMF structure as follows: where k denotes the NMF component index, w fk and h kt represent the basis functions and their corresponding time-varying gains, respectively.The SCM mixture Rft can be expressed as follows: where the model parameter = G fk , w fk , h kt can be estimated by minimizing the cost function in Eq. ( 11) using classical algorithms such as the expectation-maximization (EM) [31] or the majorization-minimization (MM) [23] to derive the update rules.In particular, for the model in Eq. ( 13), the update rules using MM are as follows: Then, we must solve an algebraic Riccati equation of the form G fk AG fk = B to update G fk .Details are omitted for brevity in this paper and can be reviewed in [23].
However, updating of the model parameters requires a large computational cost of the order O(I 3 ) .This limits the use of the framework when the number of chan- nels increases.For example, in the HOA, modeling a fourth-order representation

3
Efficient parallel kernel based on Cholesky decomposition… requires at least (4 + 1) × 2 = 25 microphones to avoid spatial aliasing, which ren- ders the method infeasible in practical situations.As mentioned in the Introduction section, several diagonalization-based methods have been proposed to provide computationally efficient solutions [9,13,[20][21][22] at the cost of being limited to certain array setups [22], reducing only the SCM diagonal values [9] or relying on the statistical independence between the sources to derive the spatial characteristics [13,21].However, to the best of our knowledge, the optimization of the original full-rank solution has not been studied in the literature.In the next section, we present an approach for addressing this issue.

Proposed approach for reducing the complexity of MNMF
In this study, a parallel approach for updating the full-rank matrices of MNMF systems is proposed.In particular, the objective of this study is to provide an efficient solution to the problem described in Sect.2.3, that is, to reduce the computational cost of the multiplicative update rules in MNMF approaches when handling a high number of microphone signals, such as in the case of HOA encoding.
As presented in Sect.2.3, the multiplicative update rules (see Eq. 14 and Eq.15) involve the computation of SCM R−1 ft for each time-frequency point (f, t) and matrix multiplications.These operations entail a high computational cost for two reasons: 1) they must be repeated as many times as the number of iterations required for the convergence of the method and as the number of time-frequency points of the input signal spectrogram, and 2) the matrices are large when handling multichannel recordings with a high number of microphones.
Consequently, the design of a driver that enables the efficient computation of multiplicative update rules is required to develop a feasible MNMF system for real scenarios.Therefore, we proposed a parallel multiarchitecture driver designed to operate on multicore computers and GPU.The proposed software was developed in C language using, as appropriate, OpenMP or the CUDA suite.This driver can also be used in numerical computing environments, such as MATLAB or GNU Octave, through MEX (MATLAB Executable) interfaces.
To achieve this goal, we proposed using the following Cholesky decomposition for Hermitian and positive semi-definite matrices.Without loss of generality, the target operation to speed up can be expressed as follows: where , ∈ ℝ and A, B, C ∈ ℂ I×I .Notably, A, B, and C are Hermitian and positive semi-definite matrices.As shown, the most expensive operation in Eq. 16 is and, in particular, A −1 , because the number of channels I is extremely high in HOA signals.Our proposed method consists of reformulating Eq. 17, to solve several systems of linear equations.Therefore, we first define the following system of linear equations:

This implies
To address these issues, we proposed the application of Cholesky decomposition to the matrix A .Therefore, if A ∈ ℂ I×I is a positive semi-definite Hermitian matrix, Cholesky decomposition factorizes it into an upper triangular matrix and its conjugate transpose as follows: In our approach, Cholesky decomposition was addressed using the LAPACK implementation based on Level 3 BLAS calls.The total number of floating point operations is approximately 4  3 I 3 for complex flavors according to LAPACK documentation.Using the Cholesky decomposition of A , Eq. 19 can be redefined as the follow- triangular systems of linear equations: and These systems were solved in our approach using the LAPACK implementation based on Level 3 BLAS calls.These operations (Cholesky decomposition and solution of the equation system) require 52 3 I 3 .
Following the resolution of equations, Eq. 16 can be reformulated as follows: (18) 1 3 Efficient parallel kernel based on Cholesky decomposition… Finally, the trace of the matrix product XY is addressed by considering that only the elements of the diagonal must be computed instead of the entire matrix product.The pseudocode of the proposed approach is presented in detail in Algorithm 1.The code for the proposed parallel multiarchitecture driver is freely available online. 2

Evaluation and experimental results
In this section, the proposed system is evaluated in terms of the execution time and speedup.We conducted several experiments to analyze the performance and reliability of the proposed method using a synthetic dataset.For this purpose, we generated several multichannel synthetic mixtures with different durations and channel numbers.All mixtures were created with a sampling frequency of 44,1 kHz.In particular, the time-frequency representation used for the mixtures was obtained using a 2048-point STFT and half overlap between adjacent frames.Finally, the number of channels considered varied from 64 to 4096.
For the testbed, we focused on two different systems.First, we used a server with two Intel ® Xeon ® E5-2603 v3 processors, each with six cores.It operates at 1.60 GHz, and HyperThreading and Turbo Boost are deactivated.This server had 1 TB of RAM and a GPU TESLA P100-PCIe with 16 GB of RAM.The theoretical peak performance for floating point operations in double precision (TPPDP) of this system is 307.2GFLOPS, according to the Intel documentation, 3 and the TPPDP of its GPU is approximately 4.7 TFLOPS according to the NVIDIA documentation. 4herefore, the theoretical speedup of the GPU with respect to the CPU was approximately 16. Next, the experiments were conducted on a server with two Intel ® Xeon ® Silver 4314 processors, each with 16 cores.This server operates at 2.40 GHz and has 128 GB of RAM and a GPU RTX A6000-PCIe with 48 GB of RAM.In this case, the TPPDP of the CPUs is approximately 2.5 TFLOPS and the TPPDP of the GPU is approximately 1.2 TFLOPS according to the NVIDIA documentation. 5herefore, we expect the CPU to achieve a theoretical speedup of approximately two times compared to the GPU on this server.Both systems run Linux CentOS Linux 7, Intel oneAPI (release 2022.2,May 2022), and CUDA/cuBLAS (release 11.6).

Results
The limitations of the proposed approach for both testbeds are discussed in this section.Thus, in this experiment, we measured the complexity of the developed driver as a function of the size of the target matrices.The maximum number of operations, analogous to Eq. 16, which can be performed simultaneously, is denoted by the size of the GPU and CPU memories, and thus, by the size of the matrices considered.Table 1 summarizes the number of simultaneous operations depending on the number of channels of the input audio mixtures that can run in the described testbeds.The last column indicates the approximate number of audio frames processed in each case.The maximum problem size is limited by the different memory capacities of the two servers used in the experiments.One server had only 128 GB of CPU RAM, whereas the other had 16 GB of GPU RAM.Thus, both values were selected to determine the maximum problem size and ensure a fair comparison between the two servers.
Different CPU and GPU schemes were considered in the development of the proposed driver.Regarding CPU approaches, the first was based on the Intel oneAPI Math Kernel Library (oneMKL).In this approach, all testbed cores were extensively exploited using substantially parallelized matrix product instructions from the library.By contrast, a compulsive parallelism approach was implemented based on OpenMP directives.In this case, the problem was addressed by running multiple sequential matrix products in parallel.
Regarding GPU approaches, several techniques have been evaluated for use on GPUs.Techniques such as zero-copy or unified memory were not used (or did not provide satisfactory results) in our approach, as memory allocation for matrices was performed outside the proposed driver.The first approach considered was the classical approach based on synchronous communication, in which a continuous flow of data was established to the GPU which processed the data as it arrived.Next, we proposed an approach based on streams.Finally, we combined the events and streams to use massive parallel programming on a GPU.
Figure 1 shows the results obtained in terms of execution times for the Xeon E5-2603 and Xeon Silver 4314 servers as a function of the number of channels of input audio mixtures and the number of simultaneous operations performed.
We began by analyzing the performance of the Xeon E5-2603 server.We can observe that regarding CPU approaches, the best results for both small and large matrices are obtained when multiple sequential matrix products are run in parallel using OpenMP directives.This strategy is referred to as OMP throughout the rest of the paper.In the particular case of extremely large matrices, the results obtained by the oneMKL approach are close to those obtained by OMP.These results indicate that addressing the problem with oneMKL versus OMP is only suitable when the Efficient parallel kernel based on Cholesky decomposition… matrices have a significant size.Regarding the GPU schemes, we observed a similar trend for all described approaches.In general, the stream scheme outperformed all other approaches in terms of execution time.Notably, for matrices with 2048 and 4096 channels, the times obtained for the event-based approach were similar to those based on streams.For small sizes, the OMP version outperforms the GPU versions.
The Xeon Silver 4314 results are shown in Fig. 1b.In general, the behavior observed is similar to that obtained using the Xeon E5-2603 server.However, in this case, the OMP approach outperforms the other systems, both for the CPU and GPU.This is because of the use of a more powerful CPU (1200 GFLOPS) and a GPU with worse performance (1210 GFLOPS).For large sizes, the speedup of the GPU with respect to the CPU was close to one, as expected.In addition, we can observe that the oneMKL approach outperforms the OMP approach for large matrices (i.e., 4096 channels).Finally, among all GPU approaches, the stream-based scheme obtained the best results.
Figure 2a shows the results obtained by the Xeon E5-2603 server, which limits the RAM consumption to 750 GB.This implies that approximately seven times as many simultaneous operations, similar to Eq. 16, can be run.In this case, the times obtained increased by approximately one order of magnitude.In addition, all approaches were scaled as expected.We can observe that the stream scheme obtained the best results.By contrast, Fig. 2b depicts the speedup of the stream version compared with the OMP version for the Xeon E5-2603 server.These two approaches were selected because they provided the best results for GPU and CPU.We can observe that for audio with few channels, the speedup provided by the GPU scheme with respect to the CPU scheme is not exceedingly high.However, the speedup increased as the problem size increased, indicating that it is more suitable to use GPU approaches in these cases.This was the expected behavior in view of the computed theoretical speedup.
Fig. 1 Experimental results as a of the number of channels of the input audio mixtures and the number of simultaneous operations performed in the testbeds Finally, to assess the resource utilization of our method, we measured its performance in relation to the theoretical peak performances of both servers.Figure 3 shows the results as functions of the number of channels in the input audio mixtures.For the CPU results, we chose the OMP strategy because it provided the best results, and we varied the number of cores used.Additionally, we included the results for the GPU using streams, events, and classic strategies.Figure 3a presents the results for the Xeon E5-2603 server.We can observe that the obtained performance increases as the number of channels increases because the problem dimensions also increase.For large sizes, the CPU performance reached 90%.In the case of the GPU, the performance is worse, and 40% performance achieved only for large sizes.This is because the problem size may not be large enough to fully utilize the capabilities of the GPU, leading to a lower performance Efficient parallel kernel based on Cholesky decomposition… compared to the CPU.By contrast, Fig. 3b shows the results for the Xeon Silver 4314 server.This behavior is similar to that of other systems.However, in this case, we can observe that the performance for 32 cores decreases.This may be owing to resource saturation, energy constraints, or thermal issues [32].In terms of the GPU, the results obtained for large sizes were approximately 70%.

Conclusion
In this study, we proposed a numerical approach to address the audio source separation task based on MNMF for recordings with a large number of channels, such as HOA encoding.In particular, we proposed a parallel multiarchitecture driver to compute the multiplicative update rules in MNMF approaches and optimize the computation requirements in the full-rank model.The proposed driver was designed to operate on both sequential and multicore computers, as well as GPU.The proposed software was developed in C language using, as appropriate, OpenMP or the CUDA suite.The driver can also be used from numerical computing environments, such as MATLAB or GNU Octave, through MEX (MAT-LAB executable) interfaces.The proposed solution attempts to reduce the computational cost of multiplicative update rules by using Cholesky decomposition and solving several triangular equation systems.
The proposal was evaluated for different scenarios, with promising results in terms of execution times for both the CPU and GPU.To the best of our knowledge, this proposal is the first to address the problem of reducing the computational cost of MNMF-based systems using parallel and high-performance techniques.
In summary, the proposed solution shows promising results in terms of computational cost reduction for MNMF-based systems.In future work, we plan to integrate this driver into a sound source separation model based on MNMF and investigate its use in the ambisonic and/or SH domains.This enables a more comprehensive evaluation of the proposed method and its potential applications in audio processing.In addition, we plan to investigate the potential of the proposed driver for integration into other audio processing pipelines such as audio object-based format conversion and sound field reconstruction.Overall, this study contributes to the advancements in processing techniques, particularly in the field of multichannel audio source separation.

3
1 http:// ambis onics.ch/ 1 Efficient parallel kernel based on Cholesky decomposition… attenuation corresponding to each lth image of each qth source.The resulting reverberant model is expressed as follows:

Fig. 2 Fig. 3
Fig.2results as a function of the number of channels of the input audio mixtures in the Xeon E5-2603 server.These results were obtained by limiting RAM usage to 750 GB