Speech-driven 3D human face animation has made admirable progress. However, synthesizing 3D facial speakers with head motion is still an unsolved problem. This is because head motion, as a speech-independent appearance representation, is difficult to model by a speech-driven approach. To solve this problem, we propose 3DHead-Talk, which generates 3D face animations combined with extreme head motion. In this work, we face a key challenge to generate natural head movements that match the speech rhythm. We first form an end-to-end autoregressive model by combining a dual-tower and single-tower Transformer, with a speech encoder encoding the long-term audio environment, a facial grid encoder encoding subtle changes in the vertices of the 3D facial grid, and a single-tower decoder automatically regressing to predict a series of 3D facial animation grids. Next the predicted 3D facial animation sequence is edited by a motion field generator containing head motion to obtain an output sequence containing extreme head motion. Finally, the natural 3D face animation under extreme head motion is presented in combination with the input audio. The quantitative and qualitative results show that our method outperforms current state-of-the-art methods. And stabilizes the non-area region while maintaining the appearance of extreme head motion.