## Experimental Data

The real experimental data we used are collected from ZhongDa Hospital of Southeast University in Nanjing city, Jiangsu Province. The experimental data contains 1000 cases of pregnant women from 2020 to 2021, among which 221 cases are gestational diabetes mellitus. To ensure the predictive performance of the proposed method, several preprocessing steps must be executed on the raw data, including desensitization, normalization, and alignment. After preprocessing, each individual includes 27 items involving their medical history, personal history, genetic history, marital history, physical examination, obstetric status, and clinical laboratory data (Fig. 1).

The left dark blue part in Fig. 1 represents the distribution of individual information data. Where, each row represents an individual, and each column represents a one-dimensional feature. As indicated in Fig. 1, there are a large number of missing values in many columns, which are marked with white boxes. The right curve is the histogram of missing data of each individual. It shows the number of feature dimensions of each individual. 27 means that there are 27 variables in the data, and 16 means that only 16 variables of the corresponding sample are complete and with 11 missing values in this sample (Fig. 1).

## Completion of Missing Values

To sufficiently utilize all collected data, we first try to complete all missing data. Here, a matrix factorization method is adopted to complete the missing values in the original data [18]. The advantage of the method is that it is fast and has good generalization due to its avoidance of complex decomposition of matrix singular values. Suppose data matrix \(\text{x}\approx \widehat{\text{x}}=\text{U}{\text{V}}^{\text{T}}\). Our goal is to optimize an approximate matrix \(\widehat{\text{X}}\) of matrix X. The approximate matrix \(\widehat{\text{X}}\) can fill with all missing values in X. In addition, the squared error \({\left|\right|{\text{R}}_{{\Omega }}(\text{X}-\widehat{\text{X}})\left|\right|}_{\text{F}}^{2}\) is often used to describe the approximation between the original matrix X and the reduced approximation matrix \(\widehat{\text{X}}\). It is defined as Eq. (1):

$${\left[{R}_{\varOmega }\right(X-\widehat{X}\left)\right]}_{ij}=\left\{\begin{array}{c}{x}_{ij}-\widehat{{x}_{ij},} if(i,j)ϵ\varOmega \\ 0, otherwise\end{array}\right.$$

1

Thus, the loss function of our data compelation method is defined in Eq. (2):

$$\text{J}={\left|\right|X-\widehat{X}\left|\right|}_{ }^{2}={\left|\right|X-U{V}^{T}\left|\right|}_{ }^{2}$$

$$=\sum _{i,j,{x}_{ij}\ne NaN}{({x}_{ij}-\sum _{l=1}^{k}{u}_{il}{v}_{jl})}^{2}$$

2

Here, i and j denote the rows and columns of matrix X, respectively, and it is required to satisfy \({\text{x}}_{\text{i}\text{j}}\)≠NaN. We can use the gradient descent method to solve this loss function [19]. To be specific, the values of matrix U and matrix V were randomly initialized, and then the loss function J can be computed as an error shown in Eq. (3). The gradient is calculated based on the error value, and update the gradient with the formula shown in Eq. (4):

$${e}_{ij}={x}_{ij}-\sum _{l=1}^{k}{u}_{il}{v}_{jl}$$

3

$${u}_{il}={u}_{il}-\alpha \frac{\partial J}{\partial {u}_{il}}={u}_{il}+2\alpha {e}_{ij}{v}_{jl};{v}_{jl}={v}_{jl}-\alpha \frac{\partial J}{\partial {v}_{jl}}={u}_{il}+2\alpha {e}_{ij}{u}_{il}$$

4

Through iterative updates, we can obtain the optimal solution to complete all missing values.

## Evaluation of Feature Importance

To analyze the importance of each feature dimension, the Random Forest is adopted. The random forest has the potential to assist in discovering important clinical markers for the diagnosis of gestational diabetes mellitus. The RandomForestClassifier() function in the sklearn library is used to build a random forest model, and its parameters were tuned by a standard grid search scheme. Then, the output feature_importance_ terms are employed to evaluate the importance of each feature dimension. The results of the importance analysis are shown in Fig. 7. The importance analysis of different feature items is of great significance for assisting doctors in diagnosis.

## TF-GDM prediction model

Transformer is a deep neural network based on a self-attention mechanism. Inspired by its powerful presentation capability, researchers extend the transformer to classification tasks. In this paper, we constructed a classification model called TF-GDM based on a novel transformer for disease prediction.

## Overview

The TF-GDM model proposed in this paper is specifically used for the prediction and diagnosis of gestational diabetes mellitus. It takes 27-dimensional data as input, such as medical history, personal history, genetic history, marital history, physical examination, obstetric conditions, and clinical laboratory data. Then, the features are extracted using Transformer Encoders with a multi-head attention mechanism. The self-attention mechanism is mainly utilized to learn the importance of features and the relationship between different features. Next, all features are fully connected through the prediction module with dimension reduction. Finally, the probability of disease is predicted through a SoftMax layer. If the probability value of the disease is greater than 0.5, it is predicted to be diseased. Otherwise, it is predicted to be free of disease. The framework of the proposed TF-GDM model is shown in Fig. 2.

As shown in Fig. 2, the TF-GDM mainly consists of two parts: an encoder module and a prediction module. The encoder module presents input data into features, and the prediction module predicts the disease from the encoded features. Before feeding the input data into the encoder, it needs to be normalized first, because the value ranges of different features vary greatly. To ensure the training performance of the network, all dimensions of the input data must be normalized. Then, the transformer based model is utilized to predict gestational diabetes. Different from the original transformer, our TF-GDM only contains the Encoder Module. Since only feature extraction is needed in classification models, the decode is not necessary. The proposed TF-GDM contains only two encoders, each containing a multi-head attention module and a Feed Forward module. After the encoder module, the encoded features are obtained, and the prediction results can be got through the output module. The output module contains several fully connected linear layers and a SoftMax layer. Finally, the predicted results are marked 1 for disease and 0 for non-disease. Note that, our TF-GDM model is not sensitive to the location distribution information in the input data. Therefore, the position encoding layer is not introduced into the Transformer based encoder module. The different dimensions of the input data can be randomly distributed.

## Encoder Module

Data is the basic unit of model processing, and each piece of data is represented by a vector. Therefore, after preprocessing, vectorization, and normalization, each piece of data is represented as \({\text{x}}_{\text{i}}\), and all data can be represented as a matrix \(\text{X}\), as shown in Eq. (5).

$$\text{X}={[{\text{x}}_{1},{\text{x}}_{2},{\text{x}}_{3},...,{\text{x}}_{\text{n}}]}^{\text{T}}\in {\text{R}}^{\text{n}\times \text{d}}$$

5

Where, n represents the piece number of input data and d represents the dimension of each piece of data. In this paper, the data dimension d = 27 and data number n = 300. In particular, the collected medical data itself has no location information. That is to say, the module is not sensitive to position information. Therefore, the proposed model abandons the position encoding layer and directly feeds the normalized data into the encoder module.

The network structure of the encoder module of the TF-GDM module is shown in Fig. 2. The entire Encode module contains two encoders. Each Encoder consists of a multi-head attention layer and a feedforward neural network layer. And feedforward network has two full-connected linear layers and a ReLU activation function. Meanwhile, residual connection and layer normalization are added behind both the multi-head attention layer and the feedforward network layer. The detailed structure of the multi-head attention layer is shown in Fig. 3.

The self-attention mechanism is always used to learn the relationship between different dimensions within the data. Since there is a certain correlation between the collected various body indicators in the input data we can adopt the self-attention mechanism to learn the correlation within the data and to improve the prediction performance of the model.

In the self-attention layer, there are three vectors, namely query vector Q, key vector K, and value vector V. The Q, K, and V vectors are obtained by multiplying the matrix X with three different weight matrices WQ, WK, and WV respectively:

$$\text{Q}=\text{X}\bullet {\text{W}}^{\text{Q}}$$

6

$$\text{K}=\text{X}\bullet {\text{W}}^{\text{K}}$$

7

$$\text{V}=\text{X}\bullet {\text{W}}^{\text{V}}$$

8

When obtaining self-attention information, the Q vector is first used to query all candidates, and each candidate has a pair of K and V vectors. The query process is the dot product between the Q vector and all candidates of K vectors to calculate the similarity. Then the dot product result is weighted to the corresponding V vector after scaling and SoftMax to obtain the final self-attention result. The process of the self-attention layer can be summarized as:

1)Calculate the similarity f between Q and K:

$$f(\text{Q},{\text{K}}_{\text{i}})=\text{Q}\bullet {\text{K}}_{\text{i}}$$

9

2)SoftMax operation is performed on the obtained similarity to obtain the weight coefficients \({{\alpha }}_{\text{i}}\):

$${\alpha }_{i}=\frac{f(Q,{K}_{i})}{{\sum }_{j=1}^{n}{e}^{f(Q,{K}_{i})}}$$

10

3)For the calculated weights, all V vectors are weighted by weight coefficients \({{\alpha }}_{\text{i}}\) and summed to obtain the attention vector is obtained:

$$\text{A}\text{t}\text{t}\text{e}\text{n}\text{t}\text{i}\text{o}\text{n}(\text{Q},\text{K},\text{V})=\sum _{\text{i}=1}^{\text{n}}{{\alpha }}_{\text{i}}{\text{V}}_{\text{i}}$$

11

4)Then, the result of 3) is scaled to improve the speed of convergence:

$$\text{A}\text{t}\text{t}\text{e}\text{n}\text{t}\text{i}\text{o}\text{n}（\text{Q},\text{K},\text{V}）=\frac{\text{A}\text{t}\text{t}\text{e}\text{n}\text{t}\text{i}\text{o}\text{n}(\text{Q}, \text{K}, \text{V})}{\sqrt{\text{d}}}$$

12

The schematic process of the self-attention layer is shown in Fig. 3 (a).

To further improve the performance of self-attention, the mechanism of multi-head attention is utilized to enhance the network performance. The schematic diagram of the multi-headed attention is shown in Fig. 3(b). In this paper, the head number of attention heads h is set to 3. The input data X is fed into three self-attention layers respectively and gets three new weighted matrices \({\text{Z}}_{\text{i}}(\text{i}=\text{1,2},3)\). Then the complete matrix \(\text{Z}\) is obtained by concatenating \({\text{Z}}_{\text{i}}\) in the column direction: \(\text{Z}=({\text{Z}}_{1}, {\text{Z}}_{2},{ \text{Z}}_{3})\). A fully-connected linear layer is applied as the output of multi-head attention. Finally, the residual connection and Layer Normalization are performed after multi-head attention as the output of the multi-head self-attention layer.

The multi-head self-attention layer is the core of the whole model, which can significantly improve the performance of self-attention. At the same time, it can give out multiple "representation subspaces" of the attention layer.

The main features of the TF-GDM are output by the feedforward network. In the attention module, the data of each dimension can be transmitted separately in the network through the attention values. However, in the feedforward network, the data of all dimensions are collected and transformed into features in parallel as the output of the encoder. The structure of the feedforward neural network is FC-ReLU-Dropout-FC-Dropout. Similarly, residual connection and Layer Normalization are performed at the end of the feedforward network.

The output of the feedforward network will be used as the input of the multi-head attention of the next Encoder. The structure of the next Encoder is the same as the first one.

## Prediction Module

The introduction of the prediction module is to convert high-dimensional features into a two-dimensional vector, representing the prediction as probability values. The prediction module takes the output of the encoder module as input. It consists of three linear full-connected layers and a SoftMax layer. Through these fully connected layers, the dimension of the features is gradually reduced. In this way, the situation of low prediction performance caused by the sudden drop in dimension is avoided. The detailed structure is shown in Fig. 4. The first fully connected layer is followed by a dropout layer, a ReLu activation layer, and a dropout layer; The features are then projected into a two-dimensional prediction vector O via two fully connected linear layers. The dropout layers are adopted to prevent the model from overfitting during training, and the dropout rate is set to 0.5.

Finally, a Softmax layer is applied to turn the prediction vector into normalized probability values:

$${p}_{i}=\frac{{O}_{i}}{{\sum }_{i=1}^{2}{e}^{{O}_{i}}}$$

13

The high probability value is regarded as the final prediction result.

## Loss Function

Due to the unbalanced distribution of the data, a binary balanced cross-entropy loss [20] is used as the loss function of our TF-GDM model. Compared with binary cross-entropy, the binary balanced cross-entropy introduces a balance parameter to balance the positive and negative samples. The loss is defined as:

L\(=\frac{1}{\text{N}}\left({\sum }_{{\text{y}}_{\text{i}}=1}^{\text{m}}-{\alpha }\text{l}\text{o}\text{g}\left(\widehat{\text{p}}\right)+{\sum }_{{\text{y}}_{\text{i}}=0}^{\text{n}}-(1-{\alpha })\text{l}\text{o}\text{g}(1-\widehat{\text{p}})\right)\) (14)

Where α is the weight parameter, m and n are the numbers of positive and negative samples respectively. It can be calculated with \(\frac{{\alpha }}{1-{\alpha }}=\frac{\text{n}}{\text{m}}\). In this paper, \({\alpha }\) is set to 0.1:

Through the iterative training of the model, the training loss is continuously reduced. Finally, when the loss no longer decreases and reaches a balance, the model converges and the training is stopped.

## Evaluation of Model

In this paper, Accuracy, Precision, Recall, F1 Score, and ROC curve are selected for model evaluation. Accuracy is the ratio of correct predictions out of all predictions made by an algorithm.

$$\text{A}\text{c}\text{c}\text{u}\text{r}\text{a}\text{c}\text{y}=\frac{\text{T}\text{P}+\text{T}\text{N}}{\text{T}\text{P}+\text{T}\text{N}+\text{F}\text{P}+\text{F}\text{N}}$$

15

Precision is the ratio of true positives over the sum of predicted positives.

$$\text{P}\text{r}\text{e}\text{c}\text{i}\text{s}\text{i}\text{o}\text{n}=\frac{\text{T}\text{P}}{\text{T}\text{P}+\text{F}\text{P}}$$

16

The recall is the ratio of true positives overall positives.

$$\text{R}\text{e}\text{c}\text{a}\text{l}\text{l}=\frac{\text{T}\text{P}}{\text{T}\text{P}+\text{F}\text{N}}$$

17

Here, TP, TN, FP, and FN denote true positive, true negative, false positive, and false negative, respectively. The F1 score combines three metrics into one single metric that ranges from 0 to 1 and it takes both Precision and Recall into account.

$$\text{F}1=\frac{2\left(\text{P}\text{r}\text{e}\text{c}\text{i}\text{s}\text{o}\text{n}\text{*}\text{R}\text{e}\text{c}\text{a}\text{l}\text{l}\right)}{(\text{P}\text{r}\text{e}\text{c}\text{i}\text{s}\text{o}\text{n}+\text{R}\text{e}\text{c}\text{a}\text{l}\text{l})}$$

18

ROC (Receiver Operating Characteristic) curve is a robust model evaluation criterion that measures the performance of a classification model by plotting the rate of true positives versus false positives. AUC (Area Under the ROC curve) is the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one. The closer the AUC value is to 1 or the closer the ROC curve is to the upper left, the better the model performance [21].