Wavelength-Proportional Interpolation and Extrapolation of Virtual Microphone for Underdetermined Speech Enhancement

We previously proposed the virtual microphone technique to improve speech enhancement performance in underdetermined situations, in which the number of channels is virtually increased by estimating extra microphone signals at arbitrary positions along the straight line formed by real microphones. The eﬀectiveness of the interpolation of virtual microphone signals for speech enhancement was experimentally conﬁrmed. In this work, we apply the extrapolation of a virtual microphone as preprocessing of the maximum signal-to-noise ratio (SNR) beamformer and compare its speech enhancement performance with that of using the interpolation of a virtual microphone. Furthermore, we aim to improve speech enhancement performance by solving a trade-oﬀ relationship between performance at low and high frequencies, which can be controlled by adjusting the microphone interval. We propose a new arrangement where a virtual microphone is placed at a distance from the reference real microphone proportional to the wavelength at each frequency. From the results of our experiment in an underdetermined situation, we conﬁrmed speech enhancement performance using the extrapolation of a virtual microphone is higher than that of using the interpolation of a virtual microphone. Moreover, the proposed wavelength-proportional interpolation and extrapolation method improves speech enhancement performance compared with the interpolation and extrapolation. Furthermore, we present the directivity patterns of a spatial ﬁlter and conﬁrmed the behavior that improves speech enhancement performance.

A simple solution to the above problem is to increase the number of microphones. However, this requires costly special equipment, such as a synchronized A/D converter, and a large amount of wiring. For this reason, we previously proposed the virtual microphone technique, in which the number of microphones is not actually but virtually increased [40][41][42][43]. In this technique, additional observed signals, namely, virtual microphone signals, are estimated at arbitrary positions along the straight line formed by real microphones. Signal processing using a virtually extended microphone array is possible by using virtual microphone signals in addition to real microphone signals.
The virtual microphone technique involves the interpolation and extrapolation of a virtual microphone depending on its position. In our previous studies, the interpolation was mainly used for speech enhancement [40,41] and the extrapolation was mainly used for sound image localization [42,43]. Thus, we apply the extrapolation of a virtual microphone to speech enhancement and compare its speech enhancement performance with that of using the interpolation of a virtual microphone.
The actual microphone array has an optimal placement of microphone for frequency. In general, at high frequencies, i.e., for a signal with a short wavelength, a shorter microphone interval is advantageous for preventing spatial aliasing and thus avoiding the degradation of speech enhancement performance. Conversely, at low frequencies, i.e., for a signal with a long wavelength, a longer interval is advantageous for obtaining a sufficient time difference, or equivalently, a sufficient phase difference to construct a spatial filter, which improves speech enhancement performance. This means that there is a trade-off relationship between performance at low and high frequencies, and it can be controlled by adjusting the microphone interval. To maximize observed phase differences while avoiding spatial aliasing, a microphone should be placed so that the microphone interval becomes half the wavelength at each frequency. In actual microphone array, a nonuniformspacing microphone array has been used to deal with the trade-off relationship [44][45][46]. In this technique, a number of microphones are placed at nonuniform intervals, and signal processing is performed using a microphone pair with an appropriate microphone interval for each frequency band. This allows the microphone interval to be optimized for each frequency band, but inevitably requires many microphones, increasing the cost. Therefore, it is difficult to implement this technique on widely used small devices such as smartphones and voice recorders. On the other hand, in the virtual microphone technique, the virtual microphone can be placed at any position on the same straight line as the real microphones by the interpolation and extrapolation. In addition, since the virtual microphone signal is independently estimated at each frequency bin, it is possible to change the position of the virtual microphone at each frequency. On this basis, we propose a new technique of virtual microphones that solves the trade-off relationship between performance at low and high frequencies, namely, wavelength-proportional virtual microphone (WPVM) technique [47]. Here, the virtual microphone is placed at a distance from the reference real microphone proportional to the wavelength at each frequency for speech enhancement, and both interpolation and extrapolation are used.
In this study, we evaluate speech enhancement performance using the maximum signal-to-noise ratio (SNR) beamformer [17,48] by the virtual microphone technique, where we use the extrapolation of a virtual microphone and WPVM technique. First, to examine the effectiveness and robustness against the directions of target and interferer sound sources, we perform speech enhancement with the extrapolated virtual microphones and WPVM in various acoustic environments. Next, we present the directivity patterns of spatial filters to illustrate the effect of WPVM technique on filter design.
The structure of this paper is as follows. In section 2, we explain the virtual microphone technique. In section 3, we propose WPVM technique. In section 4, we explain the maximum SNR beamformer. In section 5, we experimentally evaluate the performance of the extrapolation of the virtual microphone and WPVM technique. Additionally, we present the directivity patterns to confirm the behavior of those methods. Finally, the paper is concluded in section 6.

Preliminary
In this section, we introduce the virtual microphone technique involving interpolation based on β-divergence [40,41] and extrapolation of a virtual microphone [43]. In this paper, the interpolation and extrapolation of a virtual microphone, which are conventional methods using the same position of the virtual microphone at all frequencies, are collectively referred as the fixed virtual microphone technique. In this technique, all microphone signals are processed in the time-frequency domain. A virtual microphone signal v(ω, t) is generated from the observed signals of two real microphones x i (ω, t), where x i (ω, t) is the ith microphone signal (i = 1, 2) at angular frequency ω in the tth time frame. The number of channels of the microphone array is virtually increased by using virtual microphone signals in addition to the real microphone signals. The arrangement of the real and virtual microphones is shown in Fig. 1, where α is a coefficient that determines the position of the virtual microphone.
In an environment where there are multiple sounds arriving from different directions, the relationship between the microphone position and the observed signal is generally complicated. In the virtual microphone technique, by assuming W-disjoint orthogonality (W-DO) [23] for mixed signals, we can simplify the model of the observed signal. W-DO indicates the strong sparsity of a signal in the time-frequency domain, i.e., the component from a sound source dominates one time-frequency slot. By assuming W-DO, even when multiple sounds arrive, we can regard them as a single sound in each time-frequency slot.
In this technique, the phase and amplitude of a virtual microphone signal are estimated individually.
Here, different models can be applied for the phase and amplitude estimation, making the generation of the virtual microphone signals simple. Additionally, this formulation naturally leads to the nonlinearity of generation of virtual microphone signals, which is an essential property to apply this technique as preprocessing in linear signal processing. Here, the phase and amplitude of x i (ω, t) are respectively defined as

Estimation of phase of virtual microphone signal
When a sound wave arrives from a sufficient distance relative to the microphone interval, the propagating wave can be approximated as a plane wave. In both interpolation and extrapolation, we can estimate the phase ϕ v of the virtual microphone signal using the linear equation The phase has the value ϕ i ± 2πn, where n is an arbitrary natural numbers. Thus, the phase of the virtual microphone signal is estimated under the assumption of 2.3 Estimation of the amplitude of virtual microphone signal In the estimation of the amplitude of the virtual microphone signal, the formulas are different for interpolation and extrapolation. The appropriate interpolation of the amplitude of the virtual microphone depends on many conditions such as the direction of sources and reverberation. Therefore, it is difficult to faithfully model the actual amplitude attenuation. Therefore, instead of using a physical model, amplitude interpolation based on β-divergence, which has simple processing and parameter adjustment, was proposed [40]. Here, the amplitude of the virtual microphone is interpolated as follows: where 0 < α < 1.
For the extrapolation, the conceivable amplitude of the virtual microphone is more complex than that for the interpolation. When (5) is applied to extrapolation, it may output unrealistic amplitudes such as a complex amplitude, a negative amplitude, or an amplitude diverging to positive infinity except for β = 1. Therefore, in this study, as the simplest way to avoid these problems, we use the amplitude of the signal of real microphone that is closer to the position of the virtual microphone. Thus, the amplitude of the extrapolated virtual microphone signal is [42,43] A

Estimation of virtual microphone signal
From the above, the virtual microphone signal v(ω, t, α) When we need many virtual microphones, we can use an arbitrary number of α values to generate the same number of virtual microphones.

Wavelength-Proportional Virtual Microphone
As mentioned in the introduction, there is a tradeoff relationship between performances at low and high frequencies in array signal processing techniques. For example, at high frequencies and short wavelengths, a shorter microphone interval prevents spatial aliasing. Conversely, at low frequencies and long wavelengths, a longer microphone interval provides a sufficient phase difference as spatial information.
In this paper, we propose a new arrangement, in which the position of the virtual microphone is proportional to the wavelength at each frequency [47]. We call the proposed technique the wavelength-proportional virtual microphone (WPVM). The arrangement of real and virtual microphones is shown in Fig. 2. In this method, the coefficient of the position of the virtual microphone α is given by where λ is wavelength, d is the distance between the real microphones, k is a wavelength coefficient, and c is the velocity of sound. The wavelength coefficient k is the interval between reference microphone x 1 and the virtual microphone v relative to the wavelength λ(ω). This equation means that the virtual microphone is placed at a position k times the wavelength corresponding to the frequency to be processed; thus, the total length of the microphone array including the virtual microphone is large at low frequencies and small at high frequencies. For example, when k = 0.5, the position of the virtual microphone is 42.5 cm at 400 Hz, 17 cm at 1 kHz, and 4.25 cm at 4 kHz. In this case, the maximum phase difference between x 1 and v is π, so spatial aliasing does not occur at all frequencies.

Maximum SNR Beamformer
In this study, to evaluate the performance of the extrapolation of the virtual microphone and WPVM technique, we carry out the extrapolation and WPVM technique as preprocessing of the maximum SNR beamformer [17,48]. The advantage of this beamformer is that it does not explicitly require the direction of sound sources. In speech enhancement by a beamformer, the multichannel filter w(ω) is constructed for the N -channel observation signals x(ω, t).
The sound y(ω, t) in which the target sound is enhanced can be obtained by applying filter w(ω) to observed signal x(ω, t) as follows: The construction of the maximum SNR beamformer requires prior information on the covariance matrices of the target-only period R T (ω) and interference-only period R I (ω). From this information, the maximum SNR beamformer constructs a filter so that the SNR, γ(ω), of the target to the interference signal becomes maximum as follows: Although, a constructed spatial filter w(ω) has a scaling ambiguity in the maximum SNR beamformer, a compensation method was proposed in [48].
When the virtual microphone technique is used, the observed signals including virtual microphone signal and the constructed filters are Thus, the enhanced signal can be obtained by (11). The virtual microphone technique can be similarly applied to other microphone array signal processing techniques as well as the maximum SNR beamformer.

Experiment
In the experiment, we compared speech enhancement performance of the maximum SNR beamformer using the extrapolation of the virtual microphone with that using the interpolation. Furthermore, we also evaluated the enhancement performance with the WPVM technique.

Experimental conditions
The layout of the sound sources, which is set up to consider speech enhancement in a conversational scene, is shown in Fig. 3. One target speaker and two interferers are assumed to be in the scene. Furthermore, two real microphones, M 1 and M 2 , and one virtual microphone, M v , are assumed, as shown in the figure. Other experimental conditions are listed in Table 1. The sampling frequency is 8 kHz, so spatial aliasing would occur if the interval between the two real microphones were longer than 4.25 cm. The interval between the real microphones is 2.83 cm, thus there is no spatial aliasing between them. We use four female speeches and four male speeches as a target source, and a female speech  and a male speech as each interference source. Half the speeches are in Japanese and the other half are in English. In total, 32 (8×2×2) combinations of target and interference speeches are used for the experiment. The target speaker is located in three directions, that is, at azimuth of 0 • (front), −20 • (left), and 20 • (right), as shown in Fig. 3. The same applies to the two interferers. Therefore, a total of 27 (3×3×3) combinations of target and interferer directions are examined. The observed signals are simulated by convolving the speeches and a set of measured impulse responses in the RWCP Sound Scene Database [49]. The impulse responses are measured in a room (T R = 300 ms).
In the experiment to compare speech enhancement performance between the case of using interpolation and extrapolation, the coefficient of the position of the virtual microphone α is varied from 0.1 to 30 (i.e., the interval between M 1 and M v is varied from 0.283 cm to 84.9 cm), where 0 < α < 1 indicates interpolation and α > 1 indicates extrapolation. α = 1 indicates that no virtual microphone is used (i.e., only the two real microphones are used). In the interpolation, since it has been experimentally confirmed that β = −20 provides the highest performance [40], we set β to −20.
In the evaluation of speech enhancement performance with the WPVM technique, to compare differences in performance owing to k, the wavelength coefficient k is set to 0.25, 0.5, 1, and 2. The SNR of the target signal to interference signals is set to 0 dB. To evaluate speech enhancement performance, we use the signalto-distortion ratio (SDR) and signal-to-interference ratio (SIR) as objective evaluation criteria [50]. A concise representation of the results is obtained by averaging these criteria over 864 (32×27) trials for speakers and directions. Figure 4 shows the relationship between the coefficient of the virtual microphone α and the SDR and SIR. Note that the horizontal axis has a logarithmic scale.

Results
The curved line indicates speech enhancement performance using the fixed virtual microphone technique, which uses the same value of α for all frequencies. According to Fig. 4, the SDR was improved by up to 1.5 dB compared with that without the virtual microphone (α = 1) by using interpolation (α < 1), whereas it was improved by up to about 2.5 dB by using extrapolation (α > 1), i.e., the SDR is 1 dB higher when using extrapolation than when using interpolation. Similarly, the SIR was improved by 2.5 and 4.5 dB by using interpolation and extrapolation, respectively. From these results, it can be seen that the extrapolation of the virtual microphone is more effective than the interpolation.
The other straight lines show the enhancement performance when using WPVM technique with wavelength coefficient k. In the evaluation of WPVM technique, for the SDR, it is confirmed that the performance is highest for k = 0.5. Its performance was improved by up to 1.3 and 0.3 dB compared with those of interpolation and extrapolation, respectively. In contrast, for the SIR, the performance is highest for k = 1. Its performance was improved by up to 3 and 1 dB compared with those of interpolation and extrapolation, respectively. In contrast, the results for k = 2 and k = 0.25 were inferior to that using the fixed virtual microphone technique for some values of α.

Discussion
To clarify the reason underlying these results, we illustrate the directivity patterns of the spatial filter of the maximum SNR beamformer. We focused on a specific combination of directions of the target and interference sources: 0 • for the target, 60 • for interference 1, and −60 • for interference 2. Speech enhancement performance of each method is shown in Fig. 5, and is similar to that in Fig. 4.
For the same combination of directions of as above, the directivity patterns of the maximum SNR beamformer obtained by using interpolation and extrapolation, and WPVM technique are respectively shown in Figs. 6 and 7, where the values of α with the best enhancement performance were selected for interpolation and extrapolation.
According to Fig. 6 (a), the spatial filter with the interpolated virtual microphone (α = 0.4 at all frequencies) has nulls in the frequency range from 1 to 4 kHz and no nulls at frequencies below 1 kHz. This means that sounds below 1 kHz cannot be sufficiently suppressed. According to Fig. 6 (b), the spatial filter with the extrapolated virtual microphone (α = 13 at all frequencies) has many sharp nulls, which implies the occurrence of spatial aliasing. As a result, sounds from various directions, such as those near the target source, are suppressed in addition to the interference sound. However, unlike in interpolation, nulls exist even at frequencies below 1 kHz, which means that sounds below 1 kHz can be appropriately suppressed. In general, human speech has more energy at low frequencies than at high frequencies. Since the beamformer with extrapolation can improve the performance at low frequencies by widening the microphone interval, we conclude that the extrapolation contributes to the improvement of speech enhancement performance. For the beamformer with WPVM technique (Fig. 7), four sharp nulls are found in the directivity pattern for k = 2 ( Fig. 7 (a)). This indicates the occurrence of spatial aliasing at all frequencies. On the other hand, two fuzzy nulls are found in the directivity pattern for k = 0.25 (Fig. 7 (d)), which indicates that the phase difference between each microphone is too small at all frequencies to construct a spatial filter with sharp nulls. In contrast, two nulls are clearly observed for k = 0.5 (Fig. 7 (c)). Moreover, two belt-shaped nulls are clearly observed for k = 1 (Fig. 7 (b)) indicating that no spatial aliasing occurs. As the reason for the improved speech enhancement performance, by using an appropriate k, it is possible to maximize the observed phase difference within a range where spatial aliasing does not occur, thereby making it possible for the beamformer to generate sharp nulls. As a feature of the directivity patterns for WPVM technique, similar directivity is found at all frequencies, indicating that it has directivity characteristics independent of frequency.
For these results, the nulls tend to slightly deviate from the direction of the interference sound sources. We attribute this to the effect of room reverberation, which is known to introduce bias.
Summarizing this discussion, when the microphone interval is small, an insufficient phase difference between microphones exists at low frequencies, resulting in nulls not being properly generated. In contrast, when the microphone interval is large, spatial aliasing occurs at high frequencies. WPVM technique using an appropriate wavelength coefficient k can cope with these two problems; thus, this method shows the highest performance.

Conclusion
In this paper, we applied extrapolation of a virtual microphone with the maximum SNR beamformer to speech enhancement in an underdetermined situation, and confirmed that its speech enhancement performance is better than that with interpolation of a virtual microphone. In addition, we proposed a new arrangement where a virtual microphone is placed at a distance from the reference real microphone proportional to the wavelength at each frequency. The advantages of this method are that no spatial aliasing occurs and the phase difference between microphones is sufficient to construct a spatial filter at all frequencies by setting an appropriate wavelength coefficient k.
In the experiment, we evaluated speech enhancement performance on the basis of the SDR and SIR in an underdetermined situation. By comparing the proposed method with the conventional method, we found that the SDR was improved by about 1.3 dB and the SIR by about 3 dB. These results indicate that the proposed WPVM technique is effective for speech enhancement using the maximum SNR beamformer in an underdetermined situation.

Availability of data and materials
The datasets used and analyzed during the current study are available from the corresponding author on reasonable request.

Competing interests
The authors declare that they have no competing interests.
Author's contributions RJ performed the experiments and wrote the majority of the manuscript, and other authors reviewed and revised the manuscript. All authors made contributions to the conception and design of the work, analyzed the data, and interpreted the results. All authors read and approved the final manuscript.

Funding
This work was supported by the Japan Society for the Promotion of Science (JSPS) KAKENHI through a Grant-in-Aid for Scientific Research under Grants 16H01735 and 19H04131 and 19J20420, and the SECOM Science and Technology Foundation.