Deep Learning-based Noise-Robust Flexible Piezoelectric Acoustic Sensors for Speech Processing


 Flexible piezoelectric acoustic sensors (f-PAS) have attracted significant attention as a promising component for voice user interfaces (VUI) in the era of artificial intelligence of things (AIoT). The signal distortion issue of highly sensitive biomimetic f-PAS is one of the most challenging obstacle for real-life applications, due to the fundamental difference compared with the conventional microphones. Here, a noise-robust flexible piezoelectric acoustic sensor (NPAS) is demonstrated by designing the multi-resonant bands outside the noise dominant frequency range. Broad voice coverage up to 8 kHz is achieved by adopting an advanced piezoelectric membrane with the optimized polymer ratio. Deep learning-based speech processing of multi-channel NPAS is demonstrated to show the outstanding improvement in speaker recognition and speech enhancement compared to a commercial microphone. Finally, the NPAS independently identified the multi-user voices in a crowd condition, showing simultaneous speaker separation.


Introduction
Voice user interface (VUI), the most intuitive human-machine interaction (HMI), is a promising technology for personalized services, such as smart home appliances, intelligent virtual assistant, and biometric authentication in the Artificial Intelligence of Things (AIoT) era . [1][2][3][4][5][6][7] Commercialized microelectromechanical system (MEMS) microphones exhibit a flat response with low sensitivity in the range from 20 Hz to 8 kHz [8][9][10] . To enhance the signal-to-noise ratio (SNR) for far-distance detection, these capacitive microphones should be integrated with amplifying circuits, which results in the corresponding increase in noise and power consumption 11 . Furthermore, the single channel of MEMS microphones generates insufficient voice information, causing low accuracy in voice recognition. In contrast, the human ear solves the above issue by adopting the resonant vibration of basilar membrane and the multi-channel voice detection with 10,000 hair cells [12][13][14] .
Recently, flexible piezoelectric acoustic sensors (f-PAS), mimicking the human cochlea, have attracted significant attention for sensing the voice spectrum by controlling resonant frequency bands via ultrathin trapezoidal membrane [15][16][17][18][19] . Biomimetic f-PAS with high sensitivity and multi-channel signals exhibited an exceptional speaker recognition rate in miniaturized dimensions 19 .
The extremely sensitive response of lead-zirconate-titanate (PZT) based f-PAS can induce the significant interference between voice signals and ambient sounds 19,20 . A precise detection of voice features (0.1 -8 kHz), regardless of surrounding noise and environmental conditions, is crucial to identify the speech signals from multi-users [21][22][23] . The previous f-PAS demonstrated the speaker recognition in anechoic conditions with a limited frequency coverage of up to 4 kHz 17,19 . For commercial application, the f-PAS should prove consistent and broad frequency response with wide voice coverage bands in noisy environments. The conventional MEMS microphones have overcome the distorted voice signals via noise reduction circuits, statistical/adaptive filters, and machine learning (ML) based noise filtering 21,[24][25][26] . Recently, a new approach of noise-robust ML algorithms further improved the performance in speaker recognition and speech enhancement by treating voice input data like an image or calculating a weighted value for each signal 27,28 . However, these noise filtering and ML technologies were designed to process signals from capacitive-type MEMS microphones with a single-channel input [29][30][31] . Therefore, the multi-channel f-PAS should be investigated based on totally different resonant mechanisms with new algorithms.
Herein, we report a noise-robust flexible piezoelectric acoustic sensor (NPAS) for highly accurate speech processing in real-life environments. The noise-robustness was realized via three Using the CNN algorithm, the multi-channel NPAS achieved a 96% speaker recognition rate with a 62% reduction in error rate compared to the commercialized microphone. Speech enhancement of the NPAS was also demonstrated by the selective processing of multi-channel signals via deep U-net model. Finally, the AI-based NPAS successfully separated multi-user voices from a crowd, indicating independent speaker's speeches can be identified and digitalized simultaneously.

Biomimetic NPAS and deep learning-based speech processing
Fig. 1a schematically illustrates the overall concept of deep learning-based speech processing via highly sensitive biomimetic NPAS. (i) The flexible piezoelectric membrane with a noise-robust resonant response was fabricated by mimicking the mechanism of human cochlear. The basilar membrane of trapezoidal shape detects multi-frequency components depending on the width, which can allow the human audible range from 20 Hz to 20 kHz [12][13][14] . With a voice/noise frequency analysis, this biomimetic structure enables multi-tunable resonant bands for a noise-insensitive piezoelectric response by using the inversely proportional relationship between the resonance frequency and the width of NPAS 32 , where is the resonance frequency, and , , , and indicate the thickness, width, elastic modulus, and density of NPAS membrane, respectively. Most of the voice information for speaker/speech recognition is distributed in frequency range of 0.1 -8 kHz while noise sources of industry, office, home, and transportation environments are dominant below 0.1 kHz [33][34][35][36][37] . These distinctive frequency characteristics of voice and noise signals were utilized to achieve a less-distorted NPAS by locating resonance frequencies outside the noise range (only in the phonetic spectrum). To cover the entire voice spectrum up to 8 kHz with high sensitivity, the doping technique was used to improve the piezoelectric coefficient of PZT membrane based on the following equation [38][39][40] , where 33 is the piezoelectric coefficient, 0 is the vacuum permittivity, is relative permittivity, and is the polarization.  The crystal quality of piezoelectric film is important to improve the sensitivity and detection distance of flexible piezoelectric acoustic sensors 20,42,49 . However, perovskite materials should be annealed at high temperature for the crystallization, which is not compatible flexible plastic substrates 41 . Fig. 2a shows the X-ray diffraction (XRD) analysis data of the PNZT membrane on PET substrates where Sens. is the sensitivity, dBV is the units of decibels with respect to 1 volt, is the root mean square voltage, and 0 is the reference of 1 volt. The PNZT membrane presented 4 dBV higher sensitivity compared to a PZT film, which proves that the Nb dopant can enhance sensor performance with superior piezoelectric properties. A broad resonant bandwidth (∆ ~ 400 Hz) of PNZT film was obtained with the 25 µm thick flexible substrates, indicating the low quality factor (Q factor, ~ 1.7) at the first resonance frequency of NPAS. The Q factor is inversely proportional to the bandwidth as in the following equation, where 0 is the resonance frequency, and ∆ is the frequency bandwidth below 3 dB of the resonant peak value. The effect of polymer ratio on the bandwidth and sensitivity was also theoretically calculated to verify that the 25 µm thick PET could be used for broadening the detectable frequency range of NPAS, as depicted in Supplementary  Note that the most discriminative information for speaker/speech recognition is located in the high frequency region (3.5 -8 kHz) of voice spectrum 58,59 . The multi-resonant responses of NPAS were theoretically investigated to prove that the enhanced piezoelectric properties improved the voice spectrum coverage. Fig. 3c illustrates the FEM results for the NPAS membrane to analyze the mechanical displacements under the monochromatic sound waves of resonance modes. At the 4th mode of 1810 Hz, the maximum displacement of 180 nm was observed near the region of channel 4. As the resonance frequency was increased up to the 12nd mode, the maximum displacement region shifted towards the narrow position of channel 7 (see Fig. 3c and Supplementary Fig. 17). Note that the sensor performance decreases as the IDE channel width becomes narrow because of the linear relationship between sensitivity and active piezoelectric area.
However, the middle region of highly sensitive NPAS membrane was resonated intensively with the 13rd harmonic mode sound, which enabled the broad resonance band in the range of 6.5 -8 kHz, as displayed in Fig. 3b and Supplementary Fig. 18.
In flexible resonant acoustic sensors, highly efficient piezoelectric conversion is crucial to detect minute sound without the amplification circuit that causes increase in power consumption and noise fluctuation, as presented in Supplementary Fig. 19 11,61 .  (Fig. 3f, i), showing the less-interference property compared to 63 dB of the commercial capacitive microphone.
As presented in Fig. 3f, ii, the noise-robust characteristics were also proved at non-resonant high frequencies due to the broad resonant bandwidth. Fig. 3g shows the sensitivity and SNR of highly sensitive NPAS as a function of distance. The relationship between the sound pressure level and the distance is defined by the following equation, where ∆ is the difference in the sound pressure level, is the final distance between the speaker and NPAS, and is the initially measured distance from the sound source. As described in Supplementary Fig. 21a, the reference pressure of 94 dB SPL was obtained from the initial position ( ), where the sensitivity and SNR of NPAS were -26 dBV and 94 dBV, respectively. The sensitivity and SNR of NPAS were inversely proportional to the distance because the SPL decreased depending on the measurement distance 49 . In addition, Supplementary Fig. 21b where is the cross-entropy loss function, is the number of speakers, is the label, is the probability predicted by the CNN model, and is the MFCC input. As illustrated in Supplementary   Fig. 24, the speaker recognition was conducted using an attention method that automatically applies the weighted values in the crucial channels of NPAS under noise levels from -10 dB to 20 dB.
To verify the noise-robust voice detection, the NPAS speech signals were compared with MEMS microphone. decreased from 91% to 68%. It is noteworthy that the difference in accuracy rate was higher as the noise level increased. In addition, the recognition rate of NPAS outperformed the MEMS microphone for both clean speech signal and single noisy signal ( Supplementary Fig. 25, 26).
where is the speech enhancement loss function, is the ground clean waveform, � is the denoised waveform, ‖•‖ 1 is the Manhattan distance, N is the number of STFT losses, and is the n-th resolution of multi-resolution STFT loss. End-to-end speech enhancement was performed with a single waveform by averaging the multi-channel signals or selecting one channel data among the seven NPAS signals. Supplementary Fig. 27 presents the waveforms of MEMS and NPAS signals  Supplementary Fig. 28, 29. The NPAS data for single channel 3 and twoaveraging showed the high score in CSIG and CBAK, respectively. Note that the single channel 3 has the notable high frequency characteristics over the range of 4 ~ 5 and 7 ~ 8 kHz. Fig. 5f presents the enhancement in composite measure for overall speech quality (COVL) of NPAS via the intentional channel selection. The COVL score of MEMS microphone, seven-averaging, two-averaging, and single channel 3 were rated as 2.6, 3.0, 3.5, and 3.6 at the high noisy level of -5 dB SNR, respectively.
The NPAS channel 3 exhibited a 40 % increase in COVL score compared to the commercialized microphone. The exceptional quality of speech-enhanced NPAS signals was enabled by the noiserobust STFT features and multi-channel voice data. Fig. 5g schematically illustrates the speaker separation using the multi-channel NPAS. In the experiment, the speakers and crowd were located at 2 m and 3 m distant from the NPAS, respectively.
The voices of different speakers can be regarded as noisy data due to the interference among speech signals. The multi-user signals were detected based on the frequency response and directional characteristics of NPAS ( Supplementary Fig. 13, 18). The independent vector analysis (IVA) algorithm was utilized to separate the voice waveform of each speaker by real-time processing (Supplementary Movie 2). Note that the IVA algorithm requires a microphone array to separate the multi-speaker with high accuracy 63 . Fig. 5h shows the comparison of signal-to-distortion ratio (SDR), and signal-to-interference ratio (SIR) for each speaker depending on the number of NPAS channels.
The high SDR and SIR are important metrics of speaker separation, representing clear speech data without the voice of each other user. As the number of channels was increased from 1 to 7, the SDR and SIR of speaker A were improved up to 6 and 9.3, respectively. The separated speech signals of speaker B exhibited more distorted but less interfered voice information compared to speaker A, that was analyzed by 1.8 dB lower SDR, and 0.5 dB higher SIR, respectively. These results suggests that the multi-channel NPAS could be used as an acoustic sensor array for separating multi-speakers in a crowd. Fig. 5i displays the separation performance of seven-channel NPAS by measuring the SDR and SIR of speaker A as a function of iterations. The SDR and SIR values of separated speech signals were enhanced by ~ 1.6 times (3.6 dB and 4.3 dB, respectively) when the number of iterations was 10. This efficient data processing was attributed to the multi-channel speech signals of the single NPAS chip, indicating the potential of NPAS in real-time IoT applications of multi-speaker separation and recording.

Discussion
In summary, we developed a noise-robust and broad spectrum-covering NPAS for deep learning-based speech processing by mimicking the multi-resonant mechanism of human cochlear.

Fabrication of the NPAS
A PNZT chemical solution (QUINTESS Co. Ltd., 0.4 M) was spin-coated on rigid sapphire substrates (Hi-Solar Co.), followed by a rapid thermal annealing (RTA) procedure for crystallization. The deposition process was repetitively conducted to form a 1 µm thick PNZT film. Subsequently, the surface of PNZT membrane was treated with O2 plasma using inductively coupled plasma-reactive ion etching (ICP-RIE, SNTEK Co.). The ultraviolet (UV) sensitive polyurethane (PU, Norland Optical Adhesive) was spin-coated to attach a 25 µm thick PET to the crystallized PNZT thin film.
The PNZT membrane was transferred onto flexible substrates by irradiating XeCl laser (wavelength of 308 nm) on the transparent mother substrates. After the ILLO process, seven IDEs channels (Cr/Au, thickness of 10 and 100 nm) were patterned on the surface of detached PNZT thin film using conventional microfabrication. The multi-channel NPAS was covered with a PU passivation layer to prevent mechanical and electrical damage. Finally, a poling process was conducted to align the piezoelectric dipoles after interconnection of NPAS and PCB.

Material Characterizations
The crystallographic properties of the PNZT thin film were characterized by a multipurpose thin-film X-ray diffractometer (D/MAX-2500, RIGAKU), a high resolution Raman/PL system (LabRAM HR Evolution Visible/NIR, HORIBA), and a field emission transmission electron microscope (Talos F200X, FEI). The compositional analyses of PNZT on both sapphire and PET plastic substrates were conducted using a multi-purpose X-ray photoelectron spectroscope (Sigma Probe, Thermo VG Scientific) and an energy dispersive X-ray spectroscope (SU5000, Hitachi). The morphological images were investigated with a focused ion beam scanning electron microscope (Helios G4, FEI), a field emission scanning electron microscope (SU5000, Hitachi), and an optical microscope (VHX-1000E, Keyence). The polarization-electric field hysteresis was analyzed using a ferroelectric measurement system (Precision Premier II, Radiant Technologies).

Mechanical and Electrical Signal Measurement
The mechanical displacements of the NPAS were characterized using an LDV (

Resonance Simulation
FEM simulation (COMSOL Multiphysics 5.2 software) of the NPAS was conducted to theoretically calculate the sensitivity, spectrum bandwidth, resonant frequencies, and vibrational displacements.
The curvilinear membrane shape was constructed as an actual NPAS structure (5 mm of w1, 20 mm of w2, and 30 mm of l). The resonant frequency (Eq. (1)) was simulated to investigate the resonance distribution, and oscillation displacement in the NPAS membrane. The sensitivity of PZT and PNZT thin film for the frequency response of NPAS was compared by using the Eq. (3). The resonant bandwidth was also calculated as a function of the polymer to piezoelectric film ratio.

Speaker Recognition
A deep learning-based network (CNN) was utilized to classify 40 speakers for both data collected using a commercial cellular phone (Samsung, Galaxy S8) and the multi-channel NPAS. The noisy voice dataset was prepared by mixing 10 ~ 40 types of noises (indoor and outdoor sound sources) with clean TIDIGTIS speeches. After the waveforms were sliced into multiple frames through a preemphasis filter, a window function was applied to each frame. The filter banks were computed by using the STFT-converted frames, which enabled extraction of the MFCC features with a DCT method. The CNN classifier was trained for 3000 epochs until the convergence with MFCC features extracted from Phone S8 and NPAS. Error rates were compared by calculating the simple equation of (100 -recognition rate (%)).

Speech Enhancement
De-noised voice signals were extracted via end-to-end speech enhancement using a single noisy waveform with a sampling rate of 16 kHz. To produce an input waveform, the single voice signal was where is the number of input channels, and is the number of output channels ( in,1 = 1, and 1 ≤ ≤ L). The Exponential Linear Unit (ELU) activated the outputs of the first convolutional layer, which was then passed to the second convolutional layer and Gated Linear Unit (GLU) activation.
The outputs of each encoder layer were passed to the subsequent layer and the corresponding decoder layer via skip connection, providing the output of the encoder (denoted as X) to the attention layer.
After the output of the attention layer was passed over the two GRU layers with a hidden size of 48 × 2 −1 , the latent representation was produced as Z = X + GRU(Attention(X)) . This representation was further passed to the decoder network having transpose convolutional layers constructed in the same manner as the encoder layers. In contrast with the encoder, the layers in the decoder network were numbered in a reverse direction from L to 1. The skip connections connect the i-th decoder input with the output of the i-th encoder. The DEEP-SEA model was trained utilizing a multi-resolution STFT loss, where is the spectral convergence loss, is the magnitude loss, ‖•‖ is the Frobenius norm, and | (•)| is the STFT magnitudes of waveform. The ranges of STOI, PESQ, and CSIG/CBAK/COVL were 0 ~ 100, -0.5 ~ 4.5, and 1 ~ 5, respectively (a high score indicates the high quality of de-noised sound wave).

Speaker Separation
An independent vector analysis (IVA) was used to investigate the separation of the multi-users voices with the NPAS. The test was conducted by setting parameters such as the input sound sources (~ 14),