Action Unit-Based 3D Face Reconstruction Using Transformers

doi:10.21203/rs.3.rs-4310180/v1

Download PDF

Research Article

Action Unit-Based 3D Face Reconstruction Using Transformers

https://doi.org/10.21203/rs.3.rs-4310180/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

The reconstruction of 3D face shapes and expressions from single 2D images remains unconquered due to the lack of detailed modeling of human facial movements such as the correlation between the different parts of faces. Facial action units (AUs), which represent detailed taxonomy of the human facial movements based on observation of activation of muscles or muscle groups, can be used to model various facial expression types. We present a novel 3D face reconstruction framework called AU feature-based 3D FAce Reconstruction using Transformer (AUFART) that can generate a 3D face model that is responsive to AU activation given a single monocular 2D image to capture expressions. AUFART leverages AU-specific features as well as facial global features to achieve accurate 3D reconstruction of facial expressions using transformers. We also introduce a loss function which is to force the learning toward the minimal discrepancy in AU activations between the input and rendered reconstruction. The proposed framework achieves an average F1 score of 0.39, outperforming state-of-the-art methods.

3D face reconstruction

Facial action unit

Transformer

Deep learning

In recent years, rapid advances in deep learning technology have led to numerous innovative advances in computer vision and graphics research. 3D face reconstruction from 2D images has received a tremendous amount of attention in computer vision and has made major progresses thanks to the highly accurate modeling capability of deep learning. 3D face reconstruction enables a wide range of applications such as speech-driven 3D facial animation, 3D avatar generation, virtual makeup, performance capture, virtual and augmented reality, and human-robot interaction [2–7].

Most existing studies use pre-computed 3D morphable models (3DMMs) with prior knowledge about facial geometry and appearance to improve the accuracy and fidelity of 3D face reconstruction [8, 9]. Recent studies utilize deep learning frameworks based on self-supervised learning to predict 3DMM parameters from input images. They can create plausible 3D face without ground-truth 3D facial scan data by employing various loss functions, such as the landmark reprojection loss, photometric loss, and face recognition loss, to train the deep neural networks [1, 10–13].

Recently, various new loss functions and architectures have been introduced to address the limitations of existing methods with respect to reconstruction accuracy of the rich and detailed facial expressions [12, 13, 46, 47]. In particular, the method of capturing emotions and reconstructing them into 3D faces demonstrates notable efficacy [12]. In contrast, the Facial Action Coding System (FACS) is a system describing a taxonomy of AUs for encoding facial movements and expressions, based on the observation of muscle activations [15]. It is observed that that within the existing 3D face reconstruction process, there is commendable proficiency in handling emotions, while the performance in encoding AUs is comparatively modest [48]. There exist a number of studies that have emphasized the importance of utilizing AUs in the process of 3D face reconstruction [46, 47]. However, they do not explicitly consider the correlations between AUs occurring in the frame-based reconstruction process and require the use of AU labels during training, leading to a lack of guaranteed performance in in-the-wild scenarios. In this paper, we leverage AU features extracted from in-the-wild images in the frame-based reconstruction process. Our approach enables accurate 3D face reconstruction while accounting for AUs, by utilizing a Transformer to model the correlations between AUs within frames. The correlation between AUs is an important factor to be modeled since human facial expressions are formed by multiple AUs in general. Therefore, a proper method of modeling and leveraging the correlation, not just the straight-forward utilization of the information about individual AUs, on top of global facial features may play a crucial role in reconstructing accurate facial expressions.

In this paper, we propose AUFART (AU feature-based 3D FAce Reconstruction with Transformer) which enables detailed modeling of various facial expression types based on AU information for 3D face reconstruction. Unlike existing methods that use only global facial features generated from the face in an image using an encoder network,, our method can enhance the performance of the 3D face reconstruction model by providing richer representation of subtle details in facial expressions. A transformer-based 3D face reconstruction model is used to take advantage of the AU-specific features as well as the relationships between these features through the cross-attention mechanism. Several novel AU-based loss functions are also proposed. The reconstructed 3D faces generated by our method is found to be more responsive to the activated AUs in input images.

In summary, our proposed framework comprises three key contributions: (i) We propose a Transformer-based 3D face reconstruction framework that leverages the features of AUs in the frame-based 3D face reconstruction process, explicitly considering their correlations; (ii) We integrate a state-of-the-art AU feature extraction module for effective AU feature extraction from in-the-wild images, along with a Transformer model for reconstructing 3D faces from these features. This integration enables high-accuracy facial reconstruction even in diverse environmental conditions and allows modeling of challenging correlations among less easily captured AUs; (iii) Additionally, to ensure precise 3D restoration of AU information, we design an AU-based loss function for training our proposed 3D face reconstruction framework.

2.1 3D Morphable Models

3DMM is statistical models capable of capturing and representing various facial changes in low-dimensional space. These models are built from a vast amount of 3D facial scan data. Vetter and Blantz explained a method for reconstructing a 3D face from a single image with a pre-computed 3DMM in an analysis-by-synthesis fashion [8]. While the traditional 3DMM is based on Principal Component Analysis (PCA) for facial shape, more recent models such as FLAME, Basel Face Model, FaceWarehouse have separated shape, expression, and appearance spaces, enabling richer representations [8, 9].

FLAME is trained on 33,000 scan data and represents shape, pose, and expression parameters in the well-separated spaces through an effective parameter separation process. FLAME consists of a template mesh, shape blendshapes, pose blendshapes, and expression blendshapes. Each blendshape is composed of displacements from the template mesh with PCA applied to shape and expression. An iterative optimization approach was used to separate the spaces of each parameter during the model training phase. As a result, FLAME has made 3D facial reconstruction more accurate and manageable than the other 3DMM models. For this reason, FLAME is most widely used as a powerful and expressive tool in modeling facial geometry and expressions in many research works involving 3D faces including ours.

2.2 3D Face reconstruction

The popularity of deep learning-based methods that learn the mapping between 2D images and 3D face models directly has grown rapidly over the last few years [10]. Early deep learning-based 3D face reconstruction methods faced challenges related to the dataset and training strategies. A huge number of 3D facial scan data corresponding to 2D images had to be collected to train a deep learning-based model, which incurred a large amount of labor and cost. Self-supervised learning frameworks that try to minimize the difference between input images and rendered images have been proposed to address this issue. They utilize a differentiable rendering layer to enable end-to-end learning by calculating the difference between input and rendered images without ground-truth 3D faces [44]. For each of the frameworks, a training strategy has been proposed for effective self-supervised learning. RingNet and DECA apply a landmark-based training strategy by predicting landmarks for input images and using them indirectly as pseudo ground truth [1, 11]. They use landmark reprojection loss which computes the distance between the ground-truth 2D face landmark and its corresponding landmark on the surface of the 3DMM, projected onto the image. Additionally, EMOCA employs a perception-based training strategy by utilizing a deep learning-based emotion recognition model as a feature extractor to minimize the distance of features for input and rendered images [12].

2.3 Facial Action Unit

AU detection involves analyzing facial expressions to detect independent movements in each region of the face [15]. Universally recognizable expressions such as surprise, anger, and sadness coexist, but actual facial movements. and expression styles vary between individuals [16]. Facial Action Coding System (FACS) has been developed to represent human expressions independent of each individual [15]. FACS is a taxonomy system that encodes facial movements into AUs based on observations of the activation of facial muscles or muscle groups. Compared to categorical emotion models, AUs offer a more comprehensive and objective description of facial expressions [14].

A considerable amount of research has been actively conducted in automated AU detection which is useful in tasks related to image-based facial behavior analysis [23]. AU detection can be formulated as a multi-label classification problem, and most research works propose to use machine learning techniques. More recently, the correlation between AUs is taken into account as the underlying relationships are found to play an important role in modeling facial expressions [40]. The AU Relationship-aware Node Feature Learning (ANFL) in ME-GraphAU utilizes a Convolutional Neural Network (CNN) and Graph Neural Network (GNN)-based model for AU detection, considering the relationships between AUs [17]. A CNN-based network generates a facial representation for the input image. Then an AU-specific Feature Generator (AFG) which is composed of Fully Connected layers (FC layer) and Global Average Pooling layer (GAP layer) extracts AU-specific features from the overall facial representation. A GNN-based network produces an AU relation graph to model the relationships between the extracted AU features. The AU relation graph includes relationships for each pair of AUs and predicts the activation probabilities and co-occurrence patterns of AUs. ME-GraphAU demonstrates state-of-the-art performance in AU detection benchmarks BP4D and DISFA [19, 24]. In this paper, we apply these AU characteristics to 3D face reconstruction, enhancing the performance of 3D expression representation.

The main design goal of AUFART is to build a self-supervised learning-based 3D face reconstruction framework that takes advantage of the information on AU activation given a single monocular 2D image. Figure 1 shows the overall architecture AUFART framework.

3.1 Architecture

AUFART learns relationships among AU-specific features and global facial representations to predict accurate 3D face reconstruction parameters. Activation of AUs has individual relationships with each other and describes overall facial expressions [17, 22]. We model the relationships among the AU-specific features and the global facial features by a transformer with cross-attention.

We use the pre-trained AFG block from ME-GraphAU to generate the AU-specific features from the face in an image. The AFG is encouraged to generate the AU-specific features dedicated to the AU detection model. The AU-specific features contain both AU activation status and their associations for each facial display. These features can enhance the capability of the 3D face reconstruction model by providing a richer representation of subtle details in facial expressions. The AFG takes an input image, passes it through the backbone network, and generates the AU-specific features as:

$${V}_{AFG}=\left\{{v}_{1},{v}_{2},\dots ,{v}_{N}\right\}, {v}_{i}\in {\mathbb{R}}^{512}, N=27,$$

where N is the number of AU-specific features. We also use the pretrained 3D face reconstruction model DECA as a facial global feature generator. The DECA encoder is composed of a CNN and a FC layer. The CNN extracts the global face representation ${\varvec{X}}_{\varvec{D}\varvec{E}\varvec{C}\varvec{A}}\in {\mathbb{R}}^{2048}$ while the FC layer generates the 3D face reconstruction parameters ${\varvec{\Theta }}_{\varvec{D}\varvec{E}\varvec{C}\varvec{A}}\in {\mathbb{R}}^{236}$ from ${\varvec{X}}_{\varvec{D}\varvec{E}\varvec{C}\varvec{A}}$. The global face representation ${\varvec{X}}_{\varvec{D}\varvec{E}\varvec{C}\varvec{A}}$ contains generalized global features of the face in an input image. The global face representation ${\varvec{X}}_{\varvec{D}\varvec{E}\varvec{C}\varvec{A}}$ is projected to ${\varvec{v}}_{\varvec{G}\varvec{L}\varvec{B}}\in {\mathbb{R}}^{512}$ with FC layer $\varvec{L}$:

$${v}_{GLB}={X}_{DECA}^{T}L, L\in {\mathbb{R}}^{2048\times 512}.$$

The overall procedure of generating input features of our model named ${\varvec{V}}_{\varvec{A}\varvec{F}\varvec{G}}$ and ${\varvec{v}}_{\varvec{G}\varvec{L}\varvec{B}}$ from an input with AFG and DECA is illustrated in Fig. 1.

We use a transformer-based 3D face reconstruction model which learns semantic relationships within generated features ${\varvec{V}}_{\varvec{A}\varvec{F}\varvec{G}}$ and ${\varvec{v}}_{\varvec{G}\varvec{L}\varvec{B}}$ and regresses 3D face reconstruction parameters ${\varvec{\Theta }}_{\varvec{A}\varvec{U}\varvec{F}\varvec{R}\varvec{T}}$. A cross-attention mechanism in our transformer-based model enhances the interplay between ${\varvec{V}}_{\varvec{A}\varvec{F}\varvec{G}}$ and ${\varvec{v}}_{\varvec{G}\varvec{L}\varvec{B}}$ by enabling the exchange of mutual information between input features. This dynamic interaction allows the model to consider a global context, learning dependencies and correlations among these features. The model consists of layer normalizations (LN), multi-layer perceptron layers (MLP layers), and multi-head cross-attention layers (MHC layer). We add a learnable regression token [REG] and apply input embedding and position embedding to the set of the input features. In the cross-attention process, AU-specific features ${\varvec{V}}_{\varvec{A}\varvec{F}\varvec{G}}$ are used as queries, while global facial features ${\varvec{v}}_{\varvec{G}\varvec{L}\varvec{B}}$ are treated as keys and values:

$${z}_{0}=\left[{v}_{REG}; {v}_{1}E,{v}_{2}E,\dots ,{v}_{N}E\right]+{E}_{pos},$$

$${z{\prime }}_{l}=\text{M}\text{H}\text{C}\left(\text{L}\text{N}\left({z}_{l-1}\right),{v}_{GLB},{v}_{GLB} \right)+{z}_{l-1},$$

$${z}_{l}=\text{M}\text{L}\text{P}\left(LN\left({z{\prime }}_{l}\right)\right)+{z{\prime }}_{l},$$

Output:$y=\text{M}\text{L}\text{P}\left(\text{L}\text{N}\left({z}_{L}^{0}\right)\right),$ (6)

where ${\varvec{E}}_{\varvec{p}\varvec{o}\varvec{s}}$ is the position embedding, and $\varvec{E}$ is input embedding. The $\text{M}\text{H}\text{C}$ receives query, key, and value input in order. The learnable regression token [REG] is represented as ${\varvec{v}}_{\varvec{R}\varvec{E}\varvec{G}}$ and added to the front of the input features. The output $\varvec{y}$ through the above process is used as our 3D face reconstruction parameter ${\varvec{\Theta }}_{\varvec{A}\varvec{U}\varvec{F}\varvec{R}\varvec{T}}$.

Once the 3D face reconstruction parameter values ${\varvec{\Theta }}_{\varvec{A}\varvec{U}\varvec{F}\varvec{R}\varvec{T}}$ are generated, we use the FLAME decoder for the 3D face reconstruction. Subsequently, we employ a differentiable renderer to generate a rendered image from the reconstructed 3D face. The differentiable renderer makes it possible to compute gradients during the rendering process, enabling end-to-end training. Finally, we minimize the losses between the input image $\varvec{I}$ and the rendered image ${\varvec{I}}_{\varvec{R}\varvec{e}}$ to train our model.

3.2 Loss function

Given a dataset of 2D face images, AUFART is trained by minimizing:

$${L}_{total}= {L}_{auLmk}+{L}_{auRel}+{L}_{auFeat}+{L}_{reg}$$

with AU-weighted landmark reprojection loss ${\varvec{L}}_{\varvec{a}\varvec{u}\varvec{L}\varvec{m}\varvec{k}}$, AU-based relative distance loss ${\varvec{L}}_{\varvec{a}\varvec{u}\varvec{R}\varvec{e}\varvec{l}}$, AU feature loss ${\varvec{L}}_{\varvec{a}\varvec{u}\varvec{F}\varvec{e}\varvec{a}\varvec{t}}$, and parameter regularizer ${\varvec{L}}_{\varvec{r}\varvec{e}\varvec{g}}$.

AU-weighted landmark reprojection loss. This loss dynamically assigns higher weights to the landmark positions corresponding to activated AUs during the computation of the landmark reprojection loss. The landmark reprojection loss in existing studies assigns fixed weights for each facial part in every image [1, 11]. However, the movements of landmarks triggered by the activation of AUs serve as an effective means to describe the AUs [43]. ${\varvec{L}}_{\varvec{a}\varvec{u}\varvec{L}\varvec{m}\varvec{k}}$ assigns dynamic weights to the facial regions where AUs are activated to encourage the accurate representation of AUs in the reconstructed face. This enables AUFART to pay more attention to activated AUs during the training process. The AU-weighted landmark reprojection loss function is defined as:

$${L}_{auLmk}={\sum }_{i=1}^{N}{\sum }_{j=1}^{{L}_{i}}{p}_{i}{‖{k}_{j}-s\varPi \left({M}_{j}\right)+t‖}_{1},$$

where $\varvec{N}$ is the number of AUs used in this loss function, ${\varvec{L}}_{\varvec{i}}$ is the number of landmarks related to ${\varvec{i}}^{\varvec{t}\varvec{h}}$ AU, ${\varvec{p}}_{\varvec{i}}$ is the activation status of the ${\varvec{i}}^{\varvec{t}\varvec{h}}$ AU predicted by ME-GraphAU, ${\varvec{k}}_{\varvec{j}}$ is the ${\varvec{j}}^{\varvec{t}\varvec{h}}$ landmark coordinate in the input image and the ${\varvec{M}}_{\varvec{j}}$ is corresponding landmark on the FLAME model’s surface. $\varvec{s},\varvec{\varPi },\varvec{t}$ represent the predicted camera parameters, denoting the isotropic scale $\varvec{s}$, orthographic 3D-to-2D projection matrix $\varvec{\varPi }$, and 2D transition $\varvec{t}$, respectively. We employ the Mediapipe landmark detector to predict landmarks from 2D images, utilizing a total of 105 landmarks distributed across the eyebrows, eyes, nose, and mouth regions [27]. Table 1 provides details on the facial landmarks associated with AUs, and Fig. 2(a) illustrates the 105 landmark indices and positions.

AU-based relative distance loss. The AU-based relative distance loss computes the relative distance between AU configural features for image landmarks and the projected 3D landmarks. The AU configural features involve calculating relative distances between facial landmark points and are used to determine AUs [27]. For example, AU 4 (Brow Lowerer) is determined based on the distance between the landmark points 21 and 22, which correspond to the inner eyebrow landmarks on the left and right. This type of loss function is similar to eye closure loss of DECA, which computes an error in the relative offset between landmarks on the upper and lower eyelids for image landmarks and their corresponding projected 3D landmarks. We extend this approach in the context of AU by incorporating configural features. The AU-based relative distance loss computes the errors in configural features of image landmarks $\varvec{k}$ and corresponding 3D landmarks $\varvec{M}$ projected onto the image plane:

$${L}_{auRel}=\sum _{i=1}^{23}{‖{c}_{i}^{k}-{c}_{i}^{s\varPi \left(M\right)}‖}_{1},$$

where ${\varvec{c}}_{\varvec{i}}^{\varvec{k}}$ and ${\varvec{c}}_{\varvec{i}}^{\varvec{s}\varvec{\Pi }\left(\varvec{M}\right)}$ are ${\varvec{i}}^{\varvec{t}\varvec{h}}$ configural features of image landmarks $\varvec{k}$ and projected 3D landmarks $\varvec{s}\varvec{\Pi }\left(\varvec{M}\right)$. The proposed configural features from are defined using 66 landmarks model, but we modify landmark model with 68 landmarks from HRNet [28]. The 68 landmark indices are illustrated in Fig. 2(b) and configural features corresponding to each AU are described in Table 2.

Table 1

Table captions should be placed above the tables.
Facial parts	Related AUs	Involved landmarks
Brow	Brow Lowerer	0, 1, 2, …, 19
Inner brow	Inner Brow Raiser	1, 3, 5, 6, 8, 9, 11, 13, 15, 16, 18, 19
Outer brow	Outer Brow Raiser	Elements excluding Inner brow from Brow
Eye	Lid Tightener	20, 21, …, 51
Lower eye	Cheek Raiser	20, 21, …, 27, 33, 36, 37, …, 43, 49
Upper eye	Upper Lid Raiser	Elements excluding Lower eye from Eye
Nose	Nose Wrinkler	52, 53, …, 64
Mouth	Lip Pucker, Lip Stretch, Lip Tightener	65, 66, …, 104
Upper mouth	Upper Lip Raiser	65, 66, 69, 70, …, 76, 85, 86, …, 94, 103, 104
Mouth corner	Lip Corner Puller, Lip Corner Depressor	71, 72, 73, 74, 79, 80, 81, 82, 85, 86, 88, 89, 90, 91, 92, 93, 97, 98, 99, 100, 103, 104

Table 2

Table captions should be placed above the tables.
Facial AU	Configural features
Inner Brow Raiser	${c}_{1}=‖{p}_{21}-{p}_{39}‖, {c}_{4}=‖{p}_{26}-{p}_{45}‖.$
Outer Brow Raiser	${c}_{5}=‖\frac{{p}_{19}-{p}_{20}}{2}-\frac{{p}_{37}-{p}_{38}}{2}‖, {c}_{6}=‖\frac{{p}_{23}-{p}_{24}}{2}-\frac{{p}_{43}-{p}_{44}}{2}‖.$
Brow Lowerer	${c}_{7}=‖{p}_{21}-{p}_{22}‖$.
Upper Lid Raiser	Similar to ${c}_{5}$, ${c}_{6}$, and ${c}_{8}=‖\frac{{p}_{37}-{p}_{38}}{2}-\frac{{p}_{40}-{p}_{41}}{2}‖, {c}_{9}=‖\frac{{p}_{43}-{p}_{44}}{2}-\frac{{p}_{46}-{p}_{47}}{2}‖.$
Lid Tightener	Similar to ${c}_{8}$, ${c}_{9}$.
Nose Wrinkler	${c}_{10}=‖{p}_{27}-{p}_{29}‖$.
Upper Lip Raiser	${c}_{11}=‖{p}_{60}-{p}_{65}‖, {c}_{12}=‖{p}_{62}-{p}_{63}‖,{c}_{13}=‖{p}_{32}-{p}_{50}‖, {c}_{14}=‖{p}_{33}-{p}_{51}‖,{c}_{15}=‖{p}_{34}-{p}_{52}‖, {c}_{16}=‖{p}_{41}-{p}_{48}‖,{c}_{17}=‖{p}_{46}-{p}_{54}‖.$
Lip Corner Puller	${c}_{18}=‖{p}_{48}-{p}_{54}‖,{c}_{19}=‖\frac{{p}_{39}+{p}_{40}+{p}_{41}}{3}-{p}_{48}‖,{c}_{20}=‖\frac{{p}_{42}+{p}_{47}+{p}_{46}}{3}-{p}_{54}‖.$
Lip Corner Depressor	Similar to ${c}_{19}$, ${c}_{20}$.
Lip Stretcher	Similar to ${c}_{18}$.
Lip Tightener	${c}_{21}=‖{p}_{51}-{p}_{57}‖$.
Jaw Drop	Similar to ${c}_{21}$ and ${c}_{22}=‖{p}_{50}-{p}_{58}‖, {c}_{23}=‖{p}_{52}-{p}_{56}‖.$

AU feature loss. The AU feature loss computes the distances between the AU-specific features of the input image $\varvec{I}$ and the rendered image ${\varvec{I}}_{\varvec{R}\varvec{e}}$. Optimizing this loss during training encourages the reconstructed 3D face to convey AU activations that are visually similar to the image. We utilize AFG to generate the AU-specific features from both images:

$${L}_{auFeat}={‖\text{A}\text{F}\text{G}\left(I\right)-\text{A}\text{F}\text{G}\left({I}_{Re}\right)‖}_{2}$$

Parameter regularization. ${\varvec{L}}_{\varvec{r}\varvec{e}\varvec{g}}$ regularizes expression $\varvec{\psi }$, pose $\varvec{\theta }$, camera $\varvec{c}$ parameters with regularization coefficient and is specified as:

$${L}_{reg}={{\lambda }_{\psi }‖\psi ‖}_{2}^{2}+{{\lambda }_{\theta }‖\theta ‖}_{2}^{2}+{{\lambda }_{c}‖c‖}_{2}^{2}$$

4.1 Implementation details

AUFART was trained with a total of approximately 300,000 images from VGGFace2, Aff-wild2, CelebA-HQ, FFHQ, and BUPT-CB [29–33]. We used PyTorch3D to render the reconstructed 3D face onto the image plane [37]. In addition, we used the Adam optimizer with the learning rate of 1e-05, the batch size of 16, and 15 epochs. For parameter regularization, 1e-05 is applied to the expression parameter, and 0.1 is applied to the pose parameter. The loss function weighting parameters for each loss function is 0.75 for the ${\varvec{L}}_{\varvec{a}\varvec{u}\varvec{L}\varvec{m}\varvec{k}}$, 0.25 for the ${\varvec{L}}_{\varvec{a}\varvec{u}\varvec{R}\varvec{e}\varvec{l}}$, and 0.75 for the ${\varvec{L}}_{\varvec{a}\varvec{u}\varvec{F}\varvec{e}\varvec{a}\varvec{t}}$. The AUFART model predicts the values of $\varvec{\psi }$, $\varvec{\theta }$, and $\varvec{c}$ only among the 3D face reconstruction parameters, and DECA predicts the values of ${\varvec{\beta }}^{\varvec{D}\varvec{E}\varvec{C}\varvec{A}}$, ${\varvec{l}}^{\varvec{D}\varvec{E}\varvec{C}\varvec{A}}$, ${\varvec{a}}^{\varvec{D}\varvec{E}\varvec{C}\varvec{A}}$.

4.2 Quantitative evaluation

Currently, there is no standard benchmark specifically designed to evaluate the performance of the expression reconstruction while there exist a number of benchmarks for quantitative evaluation of the facial identity in the context of 3D face reconstruction [18]. The point-wise distance between a 3D face and a GT scan is not an appropriate performance metric for the 3D reconstruction of facial expressions since it is dominated by the identity parameters of the two images. Therefore, we propose to compare the AU activation states of input images and rendered images detected by ME-GraphAU using F1 scores. We use DISFA dataset for quantitative evaluation as it serves as one of the training datasets for ME-GraphAU. This choice is made with the confidence that AU detection model will effectively detect AU activation states during the evaluation.

DISFA contains 27 subjects watching a video and consists of 130,815 frames. Table 3 presents results for 9 subjects, each identified by their respective subject numbers, out of a total of 27 subjects. Additionally, the average results for the remaining 18 subjects are collectively labeled as "others". This choice is grounded in the nature of the DISFA dataset, where frames within sequences predominantly exhibit neutral expressions. Consequently, the selection of 9 videos with high AU activations, or expressiveness, is deemed appropriate for evaluating the performance of expression reconstruction. In Table 4, the evaluation outcomes for each AU are presented, with the assessed AUs being those that the ME-GraphAU model can predict.

In the per-subject evaluation results presented in Table 3, AUFART outperforms both DECA and EMOCA for all subjects. Table 4 provides the per-AU performance of the three methods. AUFART demonstrates superior performance compared to both DECA and EMOCA for AUs related to the upper face area (AU 4, 5, 7), particularly, for AU 4 (Brow Lowerer). It exhibits 5-fold and 1.5-fold performance increase over DECA and EMOCA, respectively. In the lower face areas, AUFART outperforms DECA significantly whereas it slightly outperforms EMOCA. In summary, the average F1 scores are 0.39, 0.18, and 0.30 for AUFART, DECA, and EMOCA, respectively. It confirms that AUFART can achieve higher performance than DECA and EMOCA. Reconstruction of face details such as forehead wrinkles and eyebrow movements, which can be detected by AU1 (Inner Brow Raiser) and AU2 (Outer Brow Raiser) [38], were not considered and left for future work.

Table 3

Per-subject F1 score evaluation results for AU detection on input images and rendered images in DISFA
Subject	Method
Subject	AUFART	DECA	EMOCA
03	0.46	0.23	0.24
06	0.35	0.13	0.32
11	0.45	0.18	0.30
12	0.46	0.30	0.43
16	0.43	0.05	0.24
18	0.45	0.19	0.25
23	0.27	0.16	0.24
25	0.34	0.21	0.25
27	0.33	0.07	0.30
Others	0.35	0.18	0.20
Avg.	0.39	0.18	0.30

Table 4

Per-AU F1 score evaluation results for AU detection on input images and rendered images in DISFA
AU	Method
AU	AUFART	DECA	EMOCA
01	0.00	0.00	0.00
02	0.00	0.00	0.00
04	0.47	0.09	0.30
05	0.24	0.08	0.12
07	0.78	0.47	0.64
09	0.17	0.00	0.21
10	0.83	0.47	0.81
12	0.84	0.40	0.86
15	0.13	0.00	0.02
20	0.04	0.01	0.06
23	0.28	0.03	0.02
26	0.56	0.46	0.36
Avg.	0.39	0.18	0.30

4.3 Qualitative evaluation

For the qualitative evaluation, we used a in-the-wild face image dataset called 300W and DISFA dataset. Figure 3 shows the 3D face reconstruction results for the 300W dataset. We highlight the sub-image areas of high and low accuracy with green/blue and red-boxes, respectively. It is clearly noticed that AUFART can generate better eyebrow and eye movements than DECA and EMOCA. This confirms that our transformer-based model cannot just learn the features related to AU 1, 2, and 4 from the input images but can also be guided by AU-based loss functions. It shows more robust reconstruction performance for upper facial movements as well. Similarly, as it can be observed in the blue-boxes AUFART demonstrates higher reconstruction accuracy for various mouth shapes in the lower face areas, especially for AU 12 (Lip Corner Puller) and AU 15 (Lip Corner Depressor) compared to DECA and EMOCA.

The experimental results for the DISFA dataset are presented in Fig. 4. It shows the images of subjects 3, 16, and 27 in the DISFA dataset and compares whether changes in facial expression are captured accurately in two adjacent frames for each subject. For Subject 3, DECA shows facial expressions with less variation between adjacent frames. In contrast, both AUFART and EMOCA show clear expression changes between adjacent frames. AUFART and EMOCA capture and reconstruct accurate facial movements within each frame. Subject 16’s activation of AU 25 (Lips part) is observable in both AUFART and DECA, whereas it is not observed in EMOCA. Additionally, the activation of AU 1 and 2 is observable in AUFART and EMOCA, but not in DECA. In Subject 27, AUFART captures the subtle activation changes of AU 1 between adjacent frames. Consequently, the performance of AUFART is confirmed to be comparable to state-of-the-art models such as DECA and EMOCA, while surpassing them in certain scenarios.

In summary, the qualitative evaluation results for the 300W and the DISFA dataset indicate that AUFART outperforms DECA and EMOCA in the 3D reconstruction of facial expressions induced by AU activations.

In this paper, we propose a 3D face reconstruction framework, named AUFART which is based on a transformer-based model guided with AU features. Our framework incorporates with pretrained feature generators for AU-specific features and the global facial features from state-of-the-art AU detection model and 3D face reconstruction model. Our transformer-based model is able to capture the relationships among generated AU-specific features and global facial features and predicts accurate 3D face reconstruction parameters. In addition, we introduce AU-based loss functions to force the learning toward the minimal discrepancy in AU activations between the input and rendered reconstruction. AUFART achieves more accurate 3D face reconstruction of AUs, which were not fully considered in existing frame-based 3D face reconstructions studies. We compare the AU activation states of input images and rendered images detected by state-of-the-art AU detection model. It shows a performance improvement of at least 30% compared to the frame-based state-of-the-art 3D face reconstruction methods DECA and EMOCA, achieving an average F1 score of 0.39. This highlights the superior performance of our proposed method, especially in capturing and reconstructing facial expressions related to AUs. In the future, our study may not only explore detailed 3D facial restoration based on AU features but also investigate temporal modeling for 3D face reconstruction, taking into consideration the temporal characteristics of AUs.

Author Contribution

Hyeonjin Kim conceptualized the main idea, designed experiments, conducted research, and wrote the manuscript. Pei Wang assisted in data collection and experimental work. Professor Hyukjoon Lee verified the main idea, provided guidance on research direction, and contributed to manuscript writing. All authors participated in reviewing and revising the manuscript.

Yao Feng, Haiwen Feng, Michael J. Black, and Timo Bolkart. Learning an animatable detailed 3D face model from in-the-wild images. ACM Transactions on Graphics (ToG), Vol.40, No.88, pp.1-13, 2021.
Alexandros Lattas, Stylianos Moschoglou, Baris Gecer, Stylianos Ploumpis, Vasileios Triantafyllou, Abhijeet Ghosh, and Stefanos Zafeiriou. AvatarMe: Realistically renderable 3D facial reconstruction “in-the-wild.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 760–769, 2020.
Kristina Scherbaum, Tobias Ritschel, Matthias Hullin, Thorsten Thormählen, Volker Blanz, and Hans-Peter Seidel. Computer-suggested facial makeup. Comput. Graph. Forum, vol. 30, no. 2, pages 485-492, 2011.
Chen Cao, Derek Bradley, Kun Zhou, and Thabo Beeler. Real-time high-fidelity facial performance capture. ACM Trans. Graph., 34(4), Jul 2015.
Liwen Hu, Shunsuke Saito, Lingyu Wei, Koki Nagano, Jae- woo Seo, Jens Fursund, Iman Sadeghi, Carrie Sun, Yen- Chun Chen, and Hao Li. Avatar digitization from a single image for real-time rendering. ACM Trans. Graph., 36(6), Nov. 2017.
Diego R. Faria, Mario Vieira, Fernanda C.C. Faria, and Cristiano Premebida. Affective facial expressions recognition for human-robot interaction. Proc. In 26th IEEE Int. Symp. Robot Hum. Interact. Commun. (RO-MAN), pages 805-810, Aug. 2017.
Jinbo Xing, Menghan Xia, Yuechen Zhang, Xiaodong Cun, Jue Wang, and Tien-Tsin Wong. Codetalker: Speech-driven 3d facial an- imation with discrete motion prior. arXiv preprint arXiv:2301.02379, 2023.
Tianye Li, Timo Bolkart, Michael. J. Black, Hao Li, and Javier Romero. Learning a model of facial shape and expression from 4D scans. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 36(6):194:1–194:17, 2017.
Pascal Paysan, Reinhard Knothe, Brian Amberg, Sami Romdhani, and Thomas Vetter. A 3d face model for pose and illumination invariant face recognition. In 2009 sixth IEEE international conference on advanced video and signal based surveillance, pages 296–301. IEEE, 2009.
Araceli Morales, Gemma Piella, and Federico M. Sukno. Survey on 3d face reconstruction from uncalibrated images. Computer Science Review, 40:1–35, 2021.
Soubhik Sanyal, Timo Bolkart, Haiwen Feng, and Michael J. Black. Learning to regress 3D face shape and expression from an image without 3D supervision. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 7763–7772, 2019.
Radek Danecek, Michael J. Black, and Timo Bolkart. EMOCA: Emotion driven monocular face capture and animation. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 20311–20322, 2022.
Panagiotis P Filntisis, George Retsinas, Foivos Paraperas- Papantoniou, Athanasios Katsamanis, Anastasios Roussos, and Petros Maragos. Spectre: Visual speech-informed per- ceptual 3d facial expression reconstruction from videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5744–5754, 2023.
Tetiana Martyniuk, Orest Kupyn, Yana Kurlyak, Igor Krashenyi, Jiˇri Matas, and Viktoriia Sharmanska. Dad- 3dheads: A large-scale dense, accurate and diverse dataset for 3d head alignment from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20942–20952, 2022.
Paul Ekman and Wallace V. Friesen. Facial Action Coding System: A Technique for the Measurement of Facial Movement. Consulting Psychologists Press, 1978.
Geethu Miriam Jacob and Bjorn Stenger. Facial action unit detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7680–7689, 2021.
Cheng Luo, Siyang Song, Weicheng Xie, Linlin Shen, and Hatice Gunes. Learning multi-dimensional edge feature-based au relation graph for facial action unit recognition. arXiv preprint arXiv:2205.01782, 2022.
Zhen-Hua Feng, Patrik Huber, Josef Kittler, Peter Han- cock, Xiao-Jun Wu, Qijun Zhao, Paul Koppen, and Matthias R ̈atsch. Evaluation of dense 3D reconstruction from 2D face images in the wild. In International Conference on Automatic Face & Gesture Recognition (FG), pages 780–786, 2018.
S. Mohammad Mavadati, Mohammad H. Mahoor, Kevin Bartlett, Philip Trinh, and Jeffrey F. Cohn. Disfa: A sponta- neous facial action intensity database. IEEE Transactions on Affective Computing, 4(2):151–160, 2013.
Volker Blanz and Thomas Vetter. A morphable model for the synthesis of 3d faces. In Proceedings of the 26th annual conference on Computer graphics and interactive techniques, pages 187–194, 1999.
Chen Cao, Yanlin Weng, Shun Zhou, Yiying Tong, and Keliang Zhou. FaceWarehouse: A 3D facial expression database for visual computing. Transactions on Visualiza- tion and Computer Graphics, 20:413–425, 2014.
Brais Martinez, Michel F. Valstar, Bihan Jiang, and Maja Pantic. Automatic Analysis of Facial Actions: A Survey. in IEEE Transactions on Affective Computing, vol. 10, no. 3, pp. 325-347, 1 July-Sept. 2019.
Cheng-Hao Tu, Chih-Yuan Yang, and Jane Yung-jen Hsu. IdenNet: Identity-aware facial action unit detection. In 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), pages 1–8. IEEE, 2019.
Xing Zhang, Lijun Yin, Jeffrey F Cohn, Shaun Canavan, Michael Reale, Andy Horowitz, Peng Liu, and Jeffrey M. Girard. Bp4d-spontaneous: a high-resolution spontaneous 3d dynamic facial expression database. Image and Vision Com- puting, 32(10):692–706, 2014.
Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks. arXiv:1710.10903, 2017.
Yury Kartynnik, Artsiom Ablavatski, Ivan Gr- ishchenko, and Matthias Grundmann. Real-time fa- cial surface geometry from monocular video on mo- bile GPUs. In Third Workshop on Computer Vision for AR/VR, Long Beach, CA, 2019.
Nazil Perveen and Chalavadi Krishna Mohan. Configural representation of facial action units for spontaneous facial expression recognition in the wild. In VISIGRAPP (4: VISAPP), pages 93–102, 2020.
Ke Sun, Yang Zhao, Borui Jiang, Tianheng Cheng, Bin Xiao, Dong Liu, Yadong Mu, Xinggang Wang, Wenyu Liu, and Jingdong Wang. High-resolution representations for labeling pixels and regions. arXiv preprint arXiv:1904.04514, 2019.
Qiong Cao, Li Shen, Weidi Xie, Omkar M Parkhi, and Andrew Zisserman. VGGFace2: A dataset for recognising faces across pose and age. In International Conference on Auto- matic Face & Gesture Recognition (FG), pages 67–74, 2018.
Dimitrios Kollias and Stefanos Zafeiriou. Aff-Wild2: Ex- tending the Aff-Wild database for affect recognition. arXiv preprint arXiv: 1811.07770, 2018.
Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Large-scale celebfaces attributes (celeba) dataset. Retrieved August, 15:2018, 2018.
Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 4401–4410, 2019.
Yaobin Zhang and Weihong Deng. Class-balanced training for deep face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 824–825, 2020.
Jiankang Deng, Jia Guo, Evangelos Ververas, Irene Kotsia, and Stefanos Zafeiriou. Retinaface: Single-shot multilevel face localisation in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5203–5212, 2020.
Alex Bewley, Zongyuan Ge, Lionel Ott, Fabio Ramos, and Ben Upcroft. Simple online and realtime tracking. In 2016 IEEE international conference on image processing (ICIP), pages 3464–3468. IEEE, 2016.
Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4690–4699, 2019.
Nikhila Ravi, Jeremy Reizenstein, David Novotny, Taylor Gordon, Wan-Yen Lo, Justin Johnson, and Georgia Gkioxari. Accelerating 3d deep learning with pytorch3d. arXiv:2007.08501, 2020.
Y-I Tian, Takeo Kanade, and Jeffrey F Cohn. Recognizing action units for facial expression analysis. T-PAMI, 23(2):97–115, 2001.
Dolley Shukla, Chandra Shekhar Mithlesh, and Manisha Sharma. A survey on different video scene change detection techniques. International Journal of Science and Research (IJSR). National Conference on Knowledge, Innovation in Technology and Engineering (NCKITE). 2015.
Tengfei Song, Lisha Chen, Wenming Zheng, and Qiang Ji. Uncertain Graph Neural Networks for Facial Action Unit Detection. In Proceedings of the AAAI Conference on Artificial Intelligence, 35(7), 5993-6001, 2021.
Adrian Bulat and Georgios Tzimiropoulos. How far are we from solving the 2d & 3d face alignment problem?(and a dataset of 230,000 3d facial landmarks). In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 1021-1030, 2017.
Tengfei Song, Zijun Cui, Wenming Zheng, and Qiang Ji. Hybrid message passing with performance-driven structures for facial action unit detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6267–6276, 2021.
Shangfei Wang, Yanan Chang, and Can Wang. Dual learning for joint facial landmark detection and action unit recognition. IEEE Transactions on Affective Computing, 2021.
A. Tewari, M. Zollh¨ofer, H. Kim, P. Garrido, F. Bernard, P. P´erez, C. Theobalt, MoFA: Model-based deep convolutional face autoencoder for unsupervised monocular reconstruction, in: Proc. of IEEE ICCV, 2017, pp. 1274–1283.
J. Yang, F. Zhang, B. Chen and S. U. Khan, "Facial expression recognition based on facial action unit", Proc. 10th Int. Green Sustain. Comput. Conf. (IGSC), pp. 1-6, Oct. 2019.
Kuang, Chenyi, Jeffrey O. Kephart, and Qiang Ji. "AU-Aware Dynamic 3D Face Reconstruction From Videos With Transformer." Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2024.
Kuang, Chenyi, et al. "AU-Aware 3D Face Reconstruction through Personalized AU-Specific Blendshape Learning." European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022.
Tellamekala, Mani Kumar, et al. "Are 3D Face Shapes Expressive Enough for Recognising Continuous Emotions and Action Unit Intensities?." IEEE Transactions on Affective Computing (2023).

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

Action Unit-Based 3D Face Reconstruction Using Transformers

Status:

Version 1

Abstract

Figures

1 Introduction

2 Related Works

2.1 3D Morphable Models

2.2 3D Face reconstruction

2.3 Facial Action Unit

3 Method

3.1 Architecture

3.2 Loss function

4 Experiments

4.1 Implementation details

4.2 Quantitative evaluation

4.3 Qualitative evaluation

5 Conclusion

Declarations

Author Contribution

References

Additional Declarations

Status:

Version 1

Facial AU	Configural features
Inner Brow Raiser	\({c}_{1}=‖{p}_{21}-{p}_{39}‖, {c}_{4}=‖{p}_{26}-{p}_{45}‖.\)
Outer Brow Raiser	\({c}_{5}=‖\frac{{p}_{19}-{p}_{20}}{2}-\frac{{p}_{37}-{p}_{38}}{2}‖, {c}_{6}=‖\frac{{p}_{23}-{p}_{24}}{2}-\frac{{p}_{43}-{p}_{44}}{2}‖.\)
Brow Lowerer	\({c}_{7}=‖{p}_{21}-{p}_{22}‖\).
Upper Lid Raiser	Similar to \({c}_{5}\), \({c}_{6}\), and \({c}_{8}=‖\frac{{p}_{37}-{p}_{38}}{2}-\frac{{p}_{40}-{p}_{41}}{2}‖, {c}_{9}=‖\frac{{p}_{43}-{p}_{44}}{2}-\frac{{p}_{46}-{p}_{47}}{2}‖.\)
Lid Tightener	Similar to \({c}_{8}\), \({c}_{9}\).
Nose Wrinkler	\({c}_{10}=‖{p}_{27}-{p}_{29}‖\).
Upper Lip Raiser	\({c}_{11}=‖{p}_{60}-{p}_{65}‖, {c}_{12}=‖{p}_{62}-{p}_{63}‖,{c}_{13}=‖{p}_{32}-{p}_{50}‖, {c}_{14}=‖{p}_{33}-{p}_{51}‖,{c}_{15}=‖{p}_{34}-{p}_{52}‖, {c}_{16}=‖{p}_{41}-{p}_{48}‖,{c}_{17}=‖{p}_{46}-{p}_{54}‖.\)
Lip Corner Puller	\({c}_{18}=‖{p}_{48}-{p}_{54}‖,{c}_{19}=‖\frac{{p}_{39}+{p}_{40}+{p}_{41}}{3}-{p}_{48}‖,{c}_{20}=‖\frac{{p}_{42}+{p}_{47}+{p}_{46}}{3}-{p}_{54}‖.\)
Lip Corner Depressor	Similar to \({c}_{19}\), \({c}_{20}\).
Lip Stretcher	Similar to \({c}_{18}\).
Lip Tightener	\({c}_{21}=‖{p}_{51}-{p}_{57}‖\).
Jaw Drop	Similar to \({c}_{21}\) and \({c}_{22}=‖{p}_{50}-{p}_{58}‖, {c}_{23}=‖{p}_{52}-{p}_{56}‖.\)