Species distribution models (SDMs) have become a fundamental tool in ecology, biodiversity conservation, biogeography and natural resource management (Franklin, 2013; Guillera-Arroita et al., 2015; Guisan et al., 2013; Guisan & Thuiller, 2005; Newbold, 2010). Traditional SDMs typically correlate the presence (or presence/absence) of species at multiple sites with relevant environmental covariates (temperature, precipitation, altitude, land cover, soil type, etc.) to estimate habitat preferences or predict distributions; these outputs are commonly used to inform ecological and biogeographical theory as well as conservation decisions (Bekessy et al., 2009; Ferrier et al., 2002; Keith et al., 2014; Pearce & Lindenmayer, 1998; Re ~ it Ak ~ akaya et al., 1995). Researchers in the field of ecology and geology use remote sensing images for automatic classification, providing a convenient means of classifying forests and classifying land use types. Although with the continuous development of remote sensing technology and the improvement of remote sensing image accuracy, many researchers started to try to incorporate remote sensing images as variables in the species distribution models (Brown et al., 2014; Cerrejón et al., 2020; K. S. He et al., 2015; Sumsion et al., 2019; B. Zhang et al., 2020), they only extracted the point information corresponding to species in remote sensing images and did not use the image information, which is still essentially the same as using environmental covariates to model, for example, Cerrejón et al (2020) (Bhattarai et al., 2020) used remote sensing data combined with species distribution models to predict and map bryophyte communities and diversity patterns in a 28436 km2 northern forest in Quebec (Canada). Bhattarai et al (2020) (Scholl et al., 2020) used Sentinel-1 synthetic aperture radar (SAR) and Sentinel-2 multispectral images in combination with a total of 191 covariates from northern New Brunswick, Canada several site variables and mapped the distribution and abundance of spruce glowworm (SBW) host species using a random forest.
Just as the proposal of the imagenet dataset (Deng et al., 2009) has greatly advanced the development of deep learning techniques, many researchers in the field of remote sensing image-based SDM research have started to produce and open source their own datasets for comparing the performance and accuracy of different models, while advancing the technology in the field. Marconi et al (2019) held a competition for tree species classification based on remote sensing images, which offered three tracks on canopy segmentation, tree alignment and species classification, contributing to the development of remote sensing data for ecological and biological methods (Marconi et al., 2019). Deneu et al (2020) organized a competition called GeoLifeCLEF to study the relationship between the environment and the possible occurrence of species, a dataset that collected 1.9 million observations with their corresponding remote sensing data, the largest remote sensing dataset to date open-sourced for studying species distribution (Deneu et al., 2020.).
As deep learning algorithms make breakthroughs in various unimodal tasks, researchers are beginning to focus on multimodal tasks that are closer to the real world. Currently, multimodal models are beginning to be widely used in areas such as image captioning, Text-to-image Generation, Visual Question Answering, and Visual Reasoning (C. Zhang et al., 2019). Jun et al (2020) proposed a multimodal deep attention network model based on a deep self-attention network applied to image captioning, which can input an image and output the text description corresponding to this image, and they won the MSCOCO Image Captioning Challenge in the then They won first place in the MSCOCO Image Captioning Challenge (Yu et al., 2020). Tao et al (2018) proposed a deep attentional multimodal similarity model to train a graphical text generator, which can supplement missing details in images based on the input text descriptions, and improved the best score by 14.14% on the CUB datasets, while improving the best score by 170.25% on the COCO datasets (Xu et al., 2018). Ben-Younes et al (2017) proposed a multimodal model based on Tucker decomposition for the visual question and answer domain (Ben-Younes et al., 2017). Such models typically input a textual description of a question and a corresponding image, and output the answer to that question. Ham et al (2017) propose a model of Dual Attention Networks (DAN) that uses visual and textual attention mechanisms to capture the interactions between vision and language, which can be used for multimodal inference and matching (Nam et al., 2017).
Before the emergence of the Transformer model (Vaswani et al., 2017), the backbone of multimodal modeling was primarily convolutional neural networks, which were first proposed for natural language processing (nlp) with good results, and then the Transformer model and its variants were applied to the field of vision, also with good results, and now some researchers have began to apply the Transformer model to multimodal domains. Lu J et al (2019) proposed the ViLBERT model for visual question and answer tasks, which used the bert model as the main architecture and achieved the best results on all four visual response datasets tested (Lu et al., 2019). Chen Y et al (2019) proposed a joint image text representation network called UNITER, which used a multilayer Transformer mechanism and achieved the best results on six V + L datasets (Chen et al., 2019). Li C et al (2021) proposed a new pre-training method SemVLP using the Transformer network as a pre-training model, which is effective in being able to align cross-modal representations with different semantic granularities (Li et al., 2021). We need to note that most of the current multimodal models based on Transformer are used for tasks that deal with the combination of vision and language, while models for species distribution prediction have not been proposed yet.
Although in the direction of species distribution prediction research, multiple modal information including environmental variables and remote sensing images can be obtained for a single sample point, so far we have not found researchers using multimodal models to study species distribution. Therefore, we propose a multimodal model based on Transformer for species distribution prediction to explore whether the accuracy of a species distribution prediction model using remote sensing images and environmental variables as inputs is higher than that of a model using only remote sensing images or environmental variables as inputs? How much does the different structural fusion methods of the Transformer-based backbone species distribution model affect the model results?