In recent years, rapid advances in deep learning technology have led to numerous innovative advances in computer vision and graphics research. 3D face reconstruction from 2D images has received a tremendous amount of attention in computer vision and has made major progresses thanks to the highly accurate modeling capability of deep learning. 3D face reconstruction enables a wide range of applications such as speech-driven 3D facial animation, 3D avatar generation, virtual makeup, performance capture, virtual and augmented reality, and human-robot interaction [2–7].
Most existing studies use pre-computed 3D morphable models (3DMMs) with prior knowledge about facial geometry and appearance to improve the accuracy and fidelity of 3D face reconstruction [8, 9]. Recent studies utilize deep learning frameworks based on self-supervised learning to predict 3DMM parameters from input images. They can create plausible 3D face without ground-truth 3D facial scan data by employing various loss functions, such as the landmark reprojection loss, photometric loss, and face recognition loss, to train the deep neural networks [1, 10–13].
Recently, various new loss functions and architectures have been introduced to address the limitations of existing methods with respect to reconstruction accuracy of the rich and detailed facial expressions [12, 13, 46, 47]. In particular, the method of capturing emotions and reconstructing them into 3D faces demonstrates notable efficacy [12]. In contrast, the Facial Action Coding System (FACS) is a system describing a taxonomy of AUs for encoding facial movements and expressions, based on the observation of muscle activations [15]. It is observed that that within the existing 3D face reconstruction process, there is commendable proficiency in handling emotions, while the performance in encoding AUs is comparatively modest [48]. There exist a number of studies that have emphasized the importance of utilizing AUs in the process of 3D face reconstruction [46, 47]. However, they do not explicitly consider the correlations between AUs occurring in the frame-based reconstruction process and require the use of AU labels during training, leading to a lack of guaranteed performance in in-the-wild scenarios. In this paper, we leverage AU features extracted from in-the-wild images in the frame-based reconstruction process. Our approach enables accurate 3D face reconstruction while accounting for AUs, by utilizing a Transformer to model the correlations between AUs within frames. The correlation between AUs is an important factor to be modeled since human facial expressions are formed by multiple AUs in general. Therefore, a proper method of modeling and leveraging the correlation, not just the straight-forward utilization of the information about individual AUs, on top of global facial features may play a crucial role in reconstructing accurate facial expressions.
In this paper, we propose AUFART (AU feature-based 3D FAce Reconstruction with Transformer) which enables detailed modeling of various facial expression types based on AU information for 3D face reconstruction. Unlike existing methods that use only global facial features generated from the face in an image using an encoder network,, our method can enhance the performance of the 3D face reconstruction model by providing richer representation of subtle details in facial expressions. A transformer-based 3D face reconstruction model is used to take advantage of the AU-specific features as well as the relationships between these features through the cross-attention mechanism. Several novel AU-based loss functions are also proposed. The reconstructed 3D faces generated by our method is found to be more responsive to the activated AUs in input images.
In summary, our proposed framework comprises three key contributions: (i) We propose a Transformer-based 3D face reconstruction framework that leverages the features of AUs in the frame-based 3D face reconstruction process, explicitly considering their correlations; (ii) We integrate a state-of-the-art AU feature extraction module for effective AU feature extraction from in-the-wild images, along with a Transformer model for reconstructing 3D faces from these features. This integration enables high-accuracy facial reconstruction even in diverse environmental conditions and allows modeling of challenging correlations among less easily captured AUs; (iii) Additionally, to ensure precise 3D restoration of AU information, we design an AU-based loss function for training our proposed 3D face reconstruction framework.