The lack of clinical treatment for speech impediments caused by aphasia or dysarthria has been promoting various studies toward improving nonacoustic communication efficiency1–5. Silent speech recognition is one of the most promising approaches for addressing the above problems, in which facial movements are tracked by visual monitoring6–10 or nonvisual capturing of various biosignals11–14. Visual monitoring, a well-known vision recognition, is the most direct method to map speech-related movements and has the highest spatial resolution15, 16. Nevertheless, the continuous shooting of the face in a static environment is indispensable to avoid an accuracy drop by body motion or light-induced artifacts, which leads to user inconvenience for daily communication routines. Furthermore, human–machine interfaces that exploit wearable electronics1–5, 12–14, 17–23 for biosignal recording are used in a relatively dynamic environment. Electrophysiological signals, such as electroencephalography (EEG)11, 24–26, electrocorticography (ECoG)27–29, and surface electromyography (sEMG)12, 30, 31, have been extensively studied for SSI. Neural signals, including EEG and ECoG, contain an enormous amount of information regarding brain activity in specific local regions that are activated during speech. However, EEG suffers from signal attenuation due to the skull and scalp32, thereby impeding the differentiation of a large number of words driven by complex electrical activities33. By contrast, ECoG exhibits a much higher signal-to-noise ratio (SNR), but it has limitation in clinical use because it is an invasive approach involving craniotomy. The sEMG, which measures electrical activities from facial muscles, can be extracted noninvasively and has relatively less complexity. Nonetheless, the low spatial resolution regarding SNR34 and interelectrode correlation35, 36 hinders its application in the classification of a larger number of words. Furthermore, external issues, including signal degradation mainly due to body wastes, such as sweat and sebum alongside skin irritation, preclude long-term monitoring in real life37.
Facial strain mapping using epidermal sensors provides another prospective platform to achieve a silent speech interface (SSI) with outstanding performance. Various studies have explored the robustness of strain gauges in diverse facial movement detection applications13, 38, 39, such as facial expression recognition and silent speech recognition. However, the large deformation of facial skin generated during expression or speech mostly relies on stretchable organic material-based strain sensors fabricated in a bottom–up approach13, 38, 40. These devices can make conformal contact with the skin and endure tensile stress in severe deformation environments but suffer from their intrinsic device-to-device variation and poor long-term stability. These properties are critical drawbacks regarding deep learning-assisted classification because their repeatability is directly related to system accuracy. By contrast, inorganic materials, such as metals and semiconductors, are representative materials for fabricating strain gauges with high reliability and fast response times. The resistance of a conventional metal-based strain sensor varies according to the geometrical changes under the applied strain, resulting in a relatively low gauge factor (~ 2). However, for a semiconductor-based strain gauge, the piezoresistive effect is a dominant factor regarding the resistance change41–43. Under applied strain, the shift in bandgap induces carrier redistribution, thereby changing the mobility and effective mass of semiconductor materials. Because the resistance change caused by the piezoresistive effect has a few orders higher magnitude than that caused by the geometrical effect, the semiconductor-based strain gauge has an incomparable gauge factor (~ 100) to the metal-based strain gauge.
In this study, we propose a novel SSI using a strain sensor based on single crystalline silicon with a 3D convolution deep learning algorithm to overcome the shortcomings of the existing SSI. The silicon gauge factor can be calculated using the equation: \(G=(\varDelta R/R)/(\varDelta L/L)=1+2\nu +\pi {\rm E}\), where ν and π are the Poisson’s ratio and piezoresistive coefficient, respectively. Boron doping with a concentration of 5 × 1018 cm−3 was adopted to minimize resistance change due to external temperatures44 while maintaining its relatively high piezoresistive coefficient (~ 80% of its value)45. High Young’s modulus (Ε) of Si contributes to the fast response time as well as sensitivity according to the equation: \({\rm T}=\eta /{\rm E}\), where Τ is the relaxation time and η is the viscous behavior term. However, since single crystalline silicon exhibits inherent rigidity with a high Young’s modulus, stretchability must be achieved by modifying its structure into a fractal serpentine design17, 19, 46. Our epidermal strain sensor was fabricated in a self-standing ultrathin (< 8 µm) mesh and serpentine structure without requiring an additional elastomeric layer, thereby providing enhanced air and sweat permeability47 and comfort when attached. Additionally, we devised a biaxial strain sensor that can measure directions and magnitudes two-dimensionally by placing two extremely small-sized (< 0.1 mm2) strain gauges in the horizontal and vertical directions, respectively. Based on a heuristic area feature study, four biaxial strain sensors were attached to the part where the skin changes the most during silent speech. Because direct electrical contact is not required for strain measurements, our devices can leverage double-sided encapsulation, which minimizes signal degradation caused by external factors. Strain data of 100 words randomly selected from Lip Reading in the Wild (LRW)48, each with 100 repetitions from two participants, were collected and used for deep learning model training. Our model with a 3D-convolution algorithm produced 87.53% recognition accuracy, which is an unprecedented high performance for this number of words compared with the existing SSIs using a strain gauge. Analysis of data measured over multiple days from two subjects suggested that our system captured each word characteristic rather than the individual user’s characteristics or precise attachment locations. We believe that this result is comparable with the state-of-the-art result of the SSI using the sEMG dry electrode, whose dimension is approximately two orders of magnitude exceeding our strain gauge12. We also fabricated an sEMG sensor with identical dimensions as our strain gauge, which exhibited 35.00% accuracy. This comparison verifies the advantage of our system’s high scalability, facilitating extended word classification.
Overview of SSI with a strain sensor
Figure 1a shows the stacked structure of our stretchable sensor embedding two silicon nanomembrane (SiNM)-based strain gauges (thickness ~ 300 nm) located perpendicular to each other in flexible polymer layers. The total thickness of the fabricated device was less than 8 µm, enabling the conformal attachment to the skin when a water-soluble tape was used as a carrier of temporary tattoo. During silent speech, muscle movements around the mouth induce skin deformation, which can be precisely monitored using perpendicularly placed strain gauges. Highly sensitive SiNM-based strain gauges and flexible polyimide film have relatively high Young’s moduli of approximately 130 and 1 GPa, respectively, making them inappropriate candidates for stretchable devices. Therefore, the whole components of our sensor are patterned into mesh and serpentine structures to achieve stretchability and long-term stability for this application17, 19, 46.
Each part of the face skin differs in stretching degree and direction when speaking silently, depending on the targeted words. Accordingly, determining proper sensor locations significantly contributed to SSI performance. An auxiliary vision recognition experiment that extracts the area features of the face was conducted for this purpose (Supplementary Fig. 1a). Among the randomly partitioned 24 compartments around the mouth, the sections with larger areal changes during silent speech were assumed to involve more strain gradients. Relevance-weighted class activation map (R-CAM) analysis revealed significant changes in the areas just below the lower lip (Sections 1, 4, 5, and 9 in Supplementary Fig. 1b)49. Additionally, the ablation study revealed that significant differences were unobserved in the recognition accuracy between acquiring data from the half side and both sides of the face, as the facial skin moved almost symmetrically during silent speech (Supplementary Fig. 2a). Considering the ease of attachment and diversity of signals, the four sites were determined as S1(A), S2(B), S3(C), and S4(D) (Fig. 1b), matching Sections 15, 16, 20, and 24, respectively, in Supplementary Fig. 2b.
Figure 1c shows that the four strain sensors, each incorporating two SiNM gauges, captured the resistance change in the time domain through eight independent channels when a word such as “ABSOLUTELY” was silently pronounced. Different resistance changes were monitored at each channel according to the varying shape of the subject’s mouth by mapping the normalized resistance changes at each channel over time into a 2 × 4 heatmap. Concatenating these matrices according to the time sequences, the targeted words can be digitized into a three-dimensional (3D) matrix containing each designated position and time information.
Figure 1d shows the overall flow of the hardware and software processes of our SSI. In this study, when an enunciator silently uttered a random word out of the 100 words, strain information from the eight channels was recorded by a data acquisition (DAQ) system (Supplementary Figs. 3 and 4). Considering the positional correlation between the biaxial gauges, the 1D signal data were processed as sequential 2D images as input data. We adopted a 3D convolutional neural network (CNN) to encode spatiotemporal features from an SiNM strain gauge signal. We trained our network with five-fold cross-validation and analyzed how it makes decisions based on explainable artificial intelligence.
Hardware characterization of the biaxial strain sensor
The facial skin expands and contracts in all directions based on a specific point when a person speaks. Therefore, the degree of skin extension and information on the direction are necessary for accurate tracking of facial skin movement. Here, we designed a biaxial strain sensor that independently quantified the strain in two mutually orthogonal directions by integrating a pair of strain gauges positioned in the horizontal and vertical directions, respectively (Fig. 2a).
To characterize the electrical properties of the SiNM-based strain gauge, uniaxial tensile stress was applied on the x- and y-axis up to 30%, considering the elastic limit of the facial skin during silent speech50. Finite element analysis (FEA) of the strain distribution demonstrates that a horizontal gauge experiences much higher strain with 30% x-axis stretching compared to vertical gauge, and vice versa with 30% y-axis stretching (Fig. 2b-c). This result corresponds with the actual uniaxial stretching test. Supplementary Note 1 and Supplementary Table 1 detail the Piezoresistive Multiphysics model used in FEA. Figure 2d-e shows the relative resistance change of the horizontal and vertical gauges, respectively, showing a stepwise increase regarding the increment in applied strain. When a collateral force was applied to the strain gauge, it induced a dominant resistance change, whereas the orthogonal force induced a relatively small resistance change. Along with high sensitivity, reliable DAQ is important for SSI applications. To confirm our sensor repeatability, a cyclic stretching test was also conducted by attaching the device to an elastomer with a modulus comparable with that of human skin. Even after 50000 repetitions of 30% stretching, our strain sensor showed negligible change in its resistance, confirming its high reliability.
A metal-based strain gauge with an identical structure to an SiNM-based strain gauge was also fabricated to check the feasibility of this application. Figure 2g shows the comparison of relative resistance changes between SiNM-based and metal-based gauges while stretching up to 30%. The result showed that the SiNM-based gauge was approximately 42.7, 28.9, and 20.8 times more sensitive for 10%, 20%, and 30% stretching, respectively, than those of the metal gauge. Figure 2h shows the captured relative resistance change of two gauges for eight channels while silently pronouncing the same word “WITHOUT” Through its high gauge factor, the SiNM-based gauge exhibited a remarkable waveform, whereas the metal-based gauge showed almost indistinguishable changes. For a normalized waveform, which is an input form for feature extraction, the SiNM-based gauge exhibited a distinct resistance change between 0.5 and 1.5 s, whereas no conspicuous change was monitored because of the similar level of noise for metal-based gauge.
Three-dimensional CNN for SiNM strain gauge signal analysis
Our goal was to classify the 100 words from the SiNM strain gauge signals with a time length of 2 s measured at 300 frames per second. To utilize both spatial and temporal information, we used a 3D CNN model for the classification task. Figure 3a illustrates the detailed architecture of the model. Our model comprised seven 3D convolution layers and three fully connected (FC) layers. We used the kernel size of (3,3,3), padding (1,1,1), and stride (1,1,1) except the Conv3 layer where we used the kernel size of (3,1,3), padding (1,0,1), and strides (2,1,2) for downsampling. For each layer, we used instance normalization and ReLU activation. The pooling layer was not used to preserve localized spatial information. We flattened the output features of the last convolution layer (Conv7), then it was connected to several FC layers for classification. We used cross-entropy loss and the Adam optimizer51 to train our 3D CNN. More details are provided in Supplementary Table 2.
We utilized the cross-entropy loss and Adam optimizer to train our 3D CNN. More details are provided in Supplementary Table 2.
Results of silent speech recognition
We performed a word classification task with our SSI system to 100 datasets per 100 words recorded by two subjects (See Supplementary Table 3 for details). To provide an insight on the generalized performance of our proposed system to an independent dataset, we performed five-fold cross-validation tests with randomly mixed datasets. Figure 3b shows the results. The accuracy of the five-fold cross-validation test ranged from 80.1–91.55%, and the average was 87.53%. We also evaluated word classification accuracy by varying the number of trained data, and compared the results with a conventional support vector machine (SVM)-based classification model (Fig. 3c)52. Not surprisingly, the accuracy of the “Fold 5” validation set test improved as the trained data increased, from 23.70% with 10 cases to 87.50% with 80 cases. Our model showed at least 15% higher accuracy than the SVM model when the number of trained dataset is larger than or equal to 20. We also investigated performance variation depending on the number of sensors used. As shown in Fig. 3d, word accuracy improved from 49.87–87.53% as the number of sensors increased from 2 to 8. We obtained these results by averaging all the feasible combinations of horizontal and vertical channels.
To evaluate our SSI performance, we compared it with the following classifier models: correlation and SVM. Figure 3e shows the confusion matrix of the recognition results for 20 datasets (Fold 3) using these classifier models. The correlation model used one target dataset as a reference, and the results were calculated using the cosine similarity of each word between the reference and other datasets. This experiment predicted words with the highest similarity scores. We repeated this operation by changing the reference dataset. Figure 3e shows the results obtained by averaging all the cases. Our proposed method’s accuracy reached 91.55% for the “FOLD 3” validation set, significantly exceeding those of the correlation and SVM (average accuracy: 10.26% and 76.30%, respectively). Supplementary Tables 4 and 5 present the accuracy per word of our SSI. Furthermore, we evaluated the performance variation of our model to unseen data, of which accuracy may drop due to the mismatch of sensor location and subject dependency. Although the unseen datasets taken from the completely different domain from the test datasets were used, however, the classification accuracy could gradually be improved if we adapted the model using a transfer learning, which increased the accuracy sharply even up to 88%. This demonstrated that our sensors extracted meaningful values even if the attached points could be slightly misplaced. Supplementary Table 6 details the accuracy of the results.
Visualization
To visualize the high-dimensional features learned from deep learning models, we utilized t-distributed stochastic neighbor embedding (t-SNE)53, which is commonly used to map high-dimensional features into two- or three-dimensional planes. We visualized high-dimensional feature outputs of the 3D convolutional deep learning model in two dimensions (Fig. 4a and Supplementary Fig. 5). The t-SNE results for the 100 classes of the test dataset showed that each class was well grouped together. We selected 10 specific classes out of the 100 words and visualized them for further analysis (Fig. 4b). Observably, the words with similar pronunciations were mapped to be close to each other (“INCREASE” vs “DEGREES” and “FAMILY” vs “FAMILIES”). Supplementary Tables 4 and 5 summarize quantitative results obtained by these confusing words. The raw signal waveform of these similar words resembled each other (Fig. 4c). Therefore, it is inevitably difficult for the word-based classification model to distinguish between these similar pronounced words. Notably, our model can provide correct classification by detecting changes in the muscles around the mouth.
We analyzed the characteristic of our deep learning-based classification model through R-CAM49, a method for visualizing how much each region is affected by a classification task. Figure 4d illustrates the R-CAM results to the words, “ABSOLUTELY” and “AFTERNOON”. For both words, our model focused on the part in which the S2 sensor signal (third- and fourth-row signals) showed dominant characteristic movements. Regarding the word “ABSOLUTELY,” our model focused on the downward and upward convexities of sensor S2 at the time of 0.6 s. Concerning “AFTERNOON,” similarly, our model focused on the downward convex point in both cases, which is at around 1 s for “AFTERNOON(i)” and at around 0.7 s for “AFTERNOON(ii).” The results demonstrated that our model was not overfitted to signal data but focused on characteristic signal parts where the resistance variance was large.
Comparison of word recognition performance with sEMG
As shown in Fig. 5a–c, three types of epidermal sEMG electrodes with various dimensions were also fabricated to determine the dependence of electrode size to acquired signal quality. The surface area of the small-sized electrode almost resembled the unit cell of our strain gauge so that we could fairly compare the scalability of the two systems, whereas those in medium- and large-sized electrodes were comparable with the conventional epidermal sEMG electrodes for other SSIs12, 31. A pair of two-channel sEMG electrodes and one commercial EMG reference electrode were attached to the buccinators and near the posterior mastoid, respectively, and the sEMG signal was obtained at a sampling frequency of 1000 Hz when the subject’s jaw was clenched. The raw sEMG signal was preprocessed with a commercial EMG module comprising three filters and an amplifier before being transmitted to a DAQ module (Supplementary Fig. 6). The calculated SNR (1.517, 5.964, and 8,378 for small-, medium-, and large-sized electrodes, respectively) increased as the electrode dimension increased because of the lowered surface impedance (see Fig. 5a–c bottom), revealing the limitation in improving the spatial resolution of sEMG data.
To compare the word classification accuracy of sEMG based model with that of our stran gauges-based system, four pairs of small-sized sEMG electrodes were attached to the facial muscles, which are generally selected for SSI, including buccinators, levator anguli oris, depressor anguli oris, and the anterior belly of digastric (Supplementary Fig. 7)12. As in the case of DAQ using our strain gauge, 100 datasets of sEMG signals were obtained from the two subjects when silently speaking 100 words, followed by hardware and software signal processing (Supplementary Fig. 6). The preprocessed datasets were randomly partitioned into five folds, and each fold feature was extracted and cross-validated using the same method (see the flowchart in Fig. 3b). Figure 5d shows the confusion matrix of classification results, where the average recognition accuracy was 35.00%. Although the state-of-the-art performance with high accuracy (~ 92%) was demonstrated, the system electrode size was two orders of magnitude larger than that of this work12. With the sEMG waveforms from 100 words data as inputs, the feature embeddings output by the deep learning model are shown in Fig. 5e. The 2D t-SNE mapping showed that the points with the same color were scattered rather than clustered at a specific location, indicating the difficulty in learning the representation of the scattered raw data information. Supplementary Fig. 8 shows the magnified t-SNE plot with labeling of 100 words. This result, probably due to the diminished SNR, symbolized the impeding factor of sEMG for extended word recognition because more data with high spatial resolution induce a higher classification accuracy of extended wordsets.