Current deep learning methods for heterogeneous face recognition (HFR) rely on pairwise multimodal image data for training, but such data are difficult to collect. In this paper, we propose an unsupervised deep learning method based on unpaired multimodal image data. This method employs a variational autoencoder (VAE) and a discriminator from a generative adversarial network (GAN) to disentangle the given heterogeneous image data into domain-independent semantic features and domain-dependent style features. Specifically, the VAE utilizes its latent space to disentangle features and encode explicitly domain-independent semantic features that are used to match face images from different modalities. The discriminator is used to discriminate the domains of images generated by the VAE, which can improve the domain recognition ability of the VAE. Moreover, multiple-scale feature aggregation is incorporated into the encoder part of the VAE to make the domain-independent semantic features contain multiple-scale construction information. Experimental results obtained on three widely used face datasets are presented to demonstrate the effectiveness of the proposed method. Our code will be available on GitHub.