Towards Accurate State of Charge Estimation for Lithium-ion Batteries using Self-supervised Transformer Model: A Deep Learning Approach

: Accurate state of charge (SOC) estimation of lithium-ion (Li-ion) batteries is crucial in prolonging cell lifespan and ensuring its safe operation for electric vehicle applications. In this article, we propose the deep learning-based transformer model trained with self-supervised learning (SSL) for end-to-end SOC estimation without the requirements of feature engineering or adaptive filtering. We demonstrate that with the SSL framework, the proposed deep learning-enabled transformer model achieves the lowest root-mean-square-error (RMSE) of 1.2% and a mean-absolute-error (MAE) of 0.7% on the test dataset at various ambient temperatures. With SSL, the proposed model can be trained with as few as 5 epochs using only 20% of the total training data and still achieves less than 1.9% RMSE on the test data. Finally, we also demonstrate that the learning weights during the SSL training can be transferred to a new Li-ion cell with different chemistry and still achieve on-par performance compared to the models trained from scratch on the new cell.


Introduction
The transportation and electricity production sectors account for more than 50% of total green-house gas emissions 1 as both rely on fossil fuels as t h e energy source. Promising solutions are the electrification of the transportation industry and the decarbonization of electrical grids 2,3 . However, the mass adoption of electric vehicles and renewable energy remains low due to the high adoption cost which can be attributed to the Li-ion batteries 4 . A major challenge in Li-ion batteries research is the state of charge (SOC) estimation which signifies the amount of charge left in a Li-ion battery cell 5 . Accurate SOC estimation allows the Li-ion battery cells to be used to its maximum potential before disposal, resulting in tremendous cost savings in the manufacturing and adoption costs 6 . Nevertheless, it is a notoriously hard to quantify SOC as it cannot be practically measured by sensors outside laboratory environment with existing sensor technologies 7 .
Two most common approaches used in SOC estimation are the model-based and data-driven approaches 8 . Model-based approach leverages on an in-depth understanding of domain knowledge such as the internal chemical reaction in the cell, electrical properties of the components used to model them and complex mathematical equations to model the SOC 9 . Prominent model-based techniques include the Sliding Mode Observer 10 , Luenberger Observer 11 and Kalman filters 12 . While model-based approach can result in reliable and accurate models, it requires an extensive domain knowledge, rigorous feature engineering, and requires relatively long development time 13 . Apart from that, model-based approach also does not scale well across battery cells different chemistry. As a result, alterations in cell chemistry requires a re-development of the model 14 . Additionally, model-based approach also does not account for anomalies in cells such as manufacturing inconsistencies, unpredictable operating conditions, cell degradation, and so forth 15 . Due to these shortcomings, more researchers are shifting their attention to using the data-driven approach for SOC estimation. In this approach, the SOC is directly modeled from observable signals such as voltage, current and temperature of the Li-ion b a t t e r y cell sampled over diverse operating conditions across different cell chemistry and manufacturers 16 . One method that has been gaining traction lately is the use of a data-driven technique known as deep learning (DL) 17 . DL has great potential for SOC estimation due to its powerful capability to learn any function given the right data according to the universal approximation theorem 18 . In essence, DL can be used to directly approximate the relationship between the measurable cell signals (voltage, current, temperature) and the SOC with no additional processing such as using adaptive filters 19 . This eliminates the needs of manual feature engineering which can take a considerable amount of time and expert domain knowledge and still produce accurate SOC estimation results. Pioneering works by authors in 20 and Chemali et al. 21 introduce long short-term memory (LSTM) and deep neural network DNN to directly estimate SOC from cell voltage, current and temperature with no additional filters. The proposed model achieved lowest MAE of 2.17% over a varying ambient temperature dataset. Authors in 22,23 proposed the utilization of gated recurrent unit (GRU) model to directly estimate SOC over a wide ambient temperature range with a RMSE of 3.5% under untrained temperature. There are also authors who proposed deep convolutional models such as in [24][25][26] with approximately less than 3% MAE on untrained data. Nonetheless, there are works on hybridizing convolutional and recurrent models such as 27,28 with under 2% RMSE on varying ambient temperature.
However, there exist several research gaps with the existing DL methods for SOC estimation. Firstly, all the cited works uses the supervised learning (SL) scheme to train the models which is known to require massive amount of data to accomplish 29 . Even in the scenarios where adequate data is available the training time for DL models typically takes many hours or days to complete 20 . Secondly, the models that are trained on one cell chemistry do not apply to other cell chemistry. Even though preliminary work indicates that the transfer learning is possible 24 , further tests are still required to verify its accuracy if it applies to more cell with differing chemistry. In most cases a model that is trained on one Li-ion battery cell data does not generalize well across another cell and may require re-training of the model from scratch. Thirdly, most DL models use the recurrent DL architecture which may prove to work well with sequence data such as the SOC but are hard and slow to train 30 . Recurrent models also do not leverage on the parallel GPU computation that could significantly improve training time 31 . Lastly, even though recurrent architectures such as the LSTM or GRU can handle long sequences fairly well, they are still susceptible to vanishing gradient for longer sequences 32 . Due to these limitations, recurrent models are generally superseded by another architecture known as the Transformer in many domains such as computer vision 33,34 and natural language processing 35,36 .
In this article, we introduce a new DL architecture for SOC estimation known as the Transformer. Apart from that, we propose a training framework that leverages on self-supervised learning (SSL) 29,37 and make it possible to train the Transformer on scarce amounts of Li-ion data in a short time and achieve higher estimation accuracy compared to models trained with conventional fully-supervised method. With the proposed framework, we demonstrate that the learned parameters from one cell can be transferred to another by fine-tuning the model on little data with very short training time (approximately 30 minutes on a GPU). This proposed framework also incorporates various recent DL techniques such as using Ranger optimizer with learning rate finder, time series data augmentation, and Log-Cosh loss function to boost the accuracy of the Transformer. Finally, we conclude the study by comparing and validating the performance of our proposed model to other recent DL architectures for SOC estimation. The key contributions of this study are as follows:  We introduce the transformer DL architecture for end-to-end SOC estimation with no feature engineering or adaptive filtering.  We propose the SSL training framework to train the proposed architecture in a very short amount of time and achieves improved estimation accuracy compared to conventional training framework.  The proposed model's parameters are transferable to a different cell type and requires only five training epochs to achieve RMSE ≤ 2.5%.  The SSL training framework enables the proposed model to be re-trained with as few as 20% of the total training data and still achieve lower error compared to the models that do not use the SSL framework.  We evaluate and validate the performance of the proposed model to other state-of-the-art DL architectures for SOC estimation.

Results
In the first subsection, we highlight the estimation accuracy of the proposed model trained at room temperature and varying ambient temperatures and compare the estimation robustness against other DL models. In the subsequent section, we study the influence of pre-training on the model and show that the pre-trained model outperforms the non-pre-trained models in estimation accuracy and convergence time. Additionally, we highlight that the learned weights of the pre-trained model can be transferred to perform estimation on a different cell with different chemistry with short re-training or fine-tuning.
Estimation accuracy under constant ambient temperature. In this section, we demonstrate the estimation accuracy and efficacy of the proposed model based on data sampled at room temperature. The error metrics of all models are tabulated in Table 1, sorted in ascending order of t h e test error. We found that the proposed model performance on SOC estimation accuracy of the RMSE in terms of training, validation and testing are 1.1087%, 0.8661%, 0.9056% and the MAE are 0.3289%, 0.4059%, and 0.4459%, respectively. It is observed that the proposed model outperforms all other models on the test dataset of RMSE and MAE tabulated in Table 1.
For comparison, a baseline Transformer model that was trained using conventional training framework is included. Results indicate that the baseline model only scores abysmally compared to other models except t h e DNN, suggesting that the training framework plays a pivotal role in the robust performance of the proposed model. We observe that t h e recurrent models (GRU and LSTM) outperformed t h e i r convolutional and hybrid counterpart, which is not surprising as the recurrent models are specifically designed to handle sequence data well. Both the GRU and LSTM were configured to use single hidden layer with 100 neurons. Despite the compromise in estimation accuracy, convolutional models may be advantageous in training complexity compared to the recurrent models, as they are relatively easier to train and could better utilize GPU parallelization. The Resnet model is based on the Residual Network architecture 38 adapted to work sequential data 39 . Inception Time is based on the implementation in 40 which has shown superior benchmark performance in sequential data. ResCNN consist of convolutional neural network with Residual blocks as inputs. FCN consist of only convolution operation with no pooling operations has shown promising performance in 41 . GRU-FCN and LSTM-FCN are the hybrid models combining recurrent and convolutional models to obtain the advantages of both. The estimation plot on the test dataset consisting of US06, LS92 and UDDS drive cycle is shown in Fig. 1.  Estimation accuracy under varying ambient temperatures. In this subsection, we compare the estimation accuracy of the proposed model to other widely used DL models of various architectures. Among all DL architectures compared in the study, the proposed transformer model achieved the lowest RMSE of 1.1075%, 1.3139% and 1.1914% and MAE of 0.4441%, 0.5680% and 0.6502% on the test drive cycles outperforming even the recurrent models which has been widely used for SOC estimation as shown in Table 2. We also note that the convolutional models such as the Resnet 41

Influence of Pre-training
Size of Training Data. In this section, we investigate the influence of unsupervised pre-training on the amount of required data to train the proposed model with a low error rate. We divided our experiments into three scenarios.
In the first scenario, we are pre-trained and re-trained/fine-tuned the model with all (100%) available training data. The second and third scenarios were with 50% and 20% of the training data, respectively. In all scenarios, we noted t h a t the training time and error metrics. All training in this section was only performed for only 5 epochs to illustrate the short amount of training time required to achieve low error rate. There are three modes of training used in this setup namely the pre-training (PT), re-training (RT), fine-tuning (FT), and full training (T). In PT, the model was trained on unlabeled dataset with unsupervised learning. In RT, the mode was trained on a labeled dataset with supervised learning. In FT, the model was frozen except for the last layer and trained with supervised learning. In T, the model was trained from scratch with supervised learning. Table 3 shows the results obtained. We observe that when we pre-trained and re-trained the model on all available data (row 1), the error metric is lower compared to the model that is not pre-trained (row 3) at approximately same duration of training time. We observe that in the event where we pre-trained and fine-tuned the model (row 3), even though the model only updates the weights of the final layer, it still scores a respectable 2.01% RMSE with a reduction of about 10 minutes training time compared to the previous two modes. The effect of pre-training is even more pronounced in the second scenario where we only re-train and fine-tune the models on 20% of available data. At approximately the same amount of training time, the pre-trained model (row 4) scores lower on the non-pre-trained model (row 6). In this section we showed that pre-training helps in reducing the test error with approximately same amount of training time especially when data is scarce.    We divided our test into three testing scenarios. In the first scenario, we pre-trained the model on the LG cell and tested the model on Panasonic cell with 100% the amount of training data available as shown in the first three rows. In the second scenario we performed the same routine as the first but this time with a reduced amount of training data. Only 20% of the total data was used in this test group. In the third scenario we pre-trained and retrained the model on the same cell type with all available data. Unsurprisingly, the best performing mode is when the model pre-trained and re-trained on the Panasonic cell and the worst performing model is when the model was pre-trained and re-trained on the LG cell. However, when the model was pre-trained on the LG cell and retrained on the Panasonic cell, the test error rate is almost on par with the best performing mode. This suggests that the pre-training helps in downstream re-training despite the difference in cell type. In the scenario when the model was trained on less data (20% of the training data), we observe that without pre-training (row 6) the model yields high test errors. In rows 4 and 5, we observe that the error rate is reduced by pre-training the model, even on a different cell. This once again is evidence that pre-training contributes to minimizing the test set error regardless of the cell type. Supplementary Fig. 2 illustrates the estimation of the worst performing mode. Despite being trained on a different cell type, the model still can capture the trend of the ground truth SOC value. With pretraining on the LG cell and re-training on the Panasonic cell, model can estimate the SOC more accurately as shown in Fig. 4.
In this section, we showed that the weights of the model that is learned during the unsupervised pre-training phase can be re-used in re-training or fine-tuning across different cell types. This opens the possibilities of transfer learning which is extremely helpful especially when data and computational resource is scarce. On a side note, all re-training and fine-tuning in this and the previous section was only performed for 5 epochs to showcase the learning capability of the model despite a small training epoch. Re-training and fine-tuning the model for more epochs will likely yield better performance.

Discussion
In this work, a data-efficient Transformer-based SOC estimation model in combination with the SSL framework is developed to address the challenges on Li-ion cell data availability, transfer learning, training speed and model accuracy.
We showed that the proposed model can achieve the lowest RMSE and MAE on the test set at various ambient temperature settings. The proposed technique enables the model to be trained in a very short amount of time. The first contribution of this work is the introduction of a novel Transformer DL architecture that is capable of accurately estimating the SOC of a Li-ion cell under constant and varying ambient temperatures. Based on the provided dataset, the model can accurately estimate the SOC up to RMSE ≤ 1.19% and MAE ≤ 0.65% (at varying ambient temperatures) and RMSE ≤ 0.9% and MAE ≤ 0.44% (at constant ambient temperature) with no feature engineering or any type of filtering. This also shows that the transformer can directly self-learn the model parameters to map the voltage, current and temperature data directly to SOC. The second contribution is the self-supervised learning (SSL) training scheme to effectively train the proposed model. Even though the conventional supervised learning (SL) scheme can train the proposed model up to a reasonably low error (RMSE ≤ 1.63% at varying ambient temperatures), this work highlights that the SSL training framework is advantageous in further reduction of error rate (RMSE ≤ 1.42% at varying ambient temperatures) at approximately the same amount of training time. This suggests that the SSL framework proposed to train the transformer contributes to lowering the RMSE.
The third contribution of this work is to demonstrate that the weights from the encoder layers of the transformer learned using the unsupervised pre-training phase can be readily transferred to another cell type of a different chemistry. Additionally, with only five epochs of re-training, the model can achieve RMSE ≤ 2.5% on the test set. Extending the training time further likely leads to further reduction in RMSE. However, this work shows that even with lightweight retraining, the weights transferred from pre-training significantly contribute to the short training time with significantly less data. This opens the possibility of adapting the Transformer model to other types of Li-ion cell with only a fraction of the training data.
The fourth contribution highlights the SSL training framework that enables the proposed model to be re-trained with as few as 20% of the total training data and still achieve lower error compared to the models that do not use the SSL framework. Despite the reduced amount of training data, the model still generalizes well across different cell chemistry. This further accentuates the important role of unsupervised pretraining in allowing the model to be more data efficient.
The fifth contribution compares and validates the performance of the model against recent state-of-the-art DL models on SOC estimation. It is shown that the model clearly outperforms all other models in the RMSE and MAE metric on the test dataset. Given the efficacy of the model in SOC estimation accuracy, transfer learning capability, dataefficiency and training speed, the proposed Transformer model and framework is evidently advantageous over other DL models.

Methods
Dataset. In this study, we utilized raw data sampled from a brand-new cylindrical 18650 LiNiMnCoO2 cell by LG which was made available by the McMaster University in Hamilton, Ontario, Canada 45 . The specification of the cell is given in Supplementary Table 1. The data was collected by subjecting the battery cell to various EV standard drive cycles such as UDDS, HWFET, LA92, US06 at varying ambient temperatures ranging from -20 • C to 40 • C. In addition, to simulate the dynamics of driving conditions, the cell was also subjected to a random mix of the standard drive cycles. The division of train, validation and test dataset used in this study is specified in Supplementary Table 3. Fig. 5(a) illustrates a sample plot of the UDDS drive cycle from the test dataset at -20 • C ambient temperature. For DL models to work well, careful consideration is put in preprocessing the raw data samples. Firstly, the raw data is normalized into a range of 0 to 1 using Equation 1.
Next, the raw data is divided into three separate sets namely train, validation, and test set. The train set is used to train the model, validation set to check the generalization of the model during training and the test set is only used to evaluate the model at the end of training. The division of the train, valid and test set is tabulated in Supplementary Table 3.
Using supervised learning requires the dataset to be formatted into a input-label pairs. In this study the input is the normalized values of voltage, current, and temperature while the label corresponds to the SOC value. The inputlabel pairs was constructed by running a sliding window across the time axis over the train, validation, and test set as illustrated Fig. 5. Voltage, current and temperature values that resides in the green window corresponds to the input and SOC value that resides in the purple window corresponds to the label. Note that the width of the window, k is set to k = 400 timesteps.
Having massive amounts of data is a crucial component in training the proposed model without which the model will fail to generalize well 46 . Although the raw data already consists of hundreds of thousands of timesteps, it is still insufficient for the proposed model to work. Hence, the original dataset was augmented in various way by injecting random noise (µ = 0, σ = 0.33) onto the training and validation dataset. The types of noise injected includes additive and multiplicative Gaussian noise on the magnitude and random frequency noise generated with the wavelet decomposition method. Fig. 5(b) shows a sample of the original and the augmented version of the plot.
Transformer Model Architecture. The original Transformer proposed in 47 uses the encoder-decoder arrangement in the architecture. The model proposed in this work only adopts the encoder portion and not the decoder as detailed in 48 to work better with multivariate time-series data. Fig. 6 illustrates the block diagrams of the proposed model depending on the training stage which will be detailed more in the next section. Observe that in Stage 1 and Stage 2 there are several common blocks namely, input, x positional encoding, encoder stack, and linear layer. The input data to the model consists of the input vector of X = [Vk, Ik, Tk] representing the cell voltage, current and temperature at timestep, k. To give the input data contextual information, the input vector is then added to the positional encoding. There are various choices of positional encodings according to 49 . However, in this work we use the learned positional encoding with the sine and cosine functions as shown in Eq. 2 and Eq. 3. where pos and i correspond to the position and dimension, respectively. The reasoning behind using both functions has been previously detailed in 47 . Once tagged with the positional encodings, the input vector is passed through a series of encoder blocks. Core to the Transformer architecture is the multi-headed self-attention (MHSA) module inside the encoder block. The MHSA module applies self-attention to the input sequence with respect to the output sequence. As shown in the figure, the input to the multi-headed self-attention module are the key, K, value, V and query, Q. In the MHSA block, it attempts to map the query to a set key-value pairs with respect to an output to produce the attention matrix. The operation consists of a dot product of the query, Q with all keys and a division by dk and applying a softmax function over the result as given in Eq. 4 where dk is the dimension of the keys.  Training Framework. Instantiating a DL model involves various stochastic process. To ensure the reproducibility and consistency of the results obtained, all experiments were conducted using a preset seed value. Referring to Fig. 6, model training was divided into two distinct phases, namely the unsupervised pre-training and downstream fine-tuning. In the unsupervised pre-training stage 50 , unlabeled vectors of input sequence, X was used to train the model. Part of each input sequence values were randomly set to 0 by performing element-wise multiplication with a binary mask, M. The corrupted input, X˜ was generated with the X˜ = M 0 X. The model was then required to reconstruct the masked input with a modified MSE loss function, as given in Equation 6.
where xˆt is the predicted input vector values and x is the un-corrupted input vector values. Note that the loss does not require the model to reconstruct the entire input sequence but only elements in the mask, M . Upon completion of the unsupervised pre-training phase, the weights of the model save were transferred for the downstream finetuning phase. In this phase, the model was re-trained on a labeled dataset with supervised learning. The loss function, used in this phase is the hyperbolic cosine (Log-cosh) loss function as given in Equation 7.
where y is the ground truth and ŷ is the predicted value by the model.
One of the most important hyperparameter used in training DL models is the learning rate, α (LR) 51 . In order to search for the optimal LR range of values, we employ the use of LR finder introduced in 52 . The optimal LR found with the LR finder is α = 1e3 as depicted in supplementary Fig. 1. The LR value was used in conjunction with the Ranger optimizer which is a synergistic combination of Rectified Adam (RAdam) 53 and Lookahead optimizer 54 . RAdam has been shown to stabilize the training at the start and Ranger stabilizes the convergence in the remaining steps 55 . The Ranger optimizer is configured with momentum=0.95, weight decay=0.01 and epsilon of 1e-6. This combination has been shown to achieve state-of-the-art results on many datasets [55][56][57] . As the training approaches the end, the LR is decayed to a lower value to further facilitate convergence to a global minima 58 . The LR is decayed for each batch as follows, η is the maximum and minimum LR values, and Tcurrent is the number of epochs since the last restart. Figure  shows the LR values throughout the training. The training hyperparameter values of the proposed Transformer model is concisely summarized in Supplementary Table 5.
Implementation. All models studied were trained on an Ubuntu 20.04.02 LTS Linux operating system with Intel Core i7-4790K CPU at 4.00GHz clock frequency, 32GB of RAM and a Nvidia GeForce RTX3090 graphic processing unit. All DL models were built using the open source Pytorch 1.7.1 59 framework in tandem with the TSAI library 60 . The implementation of the proposed Transformer model and SSL training framework was divided into several steps. In Step 1, two distinct datasets from the LG LiNiMnCoO2 cell (Supplementary Table 1) and Panasonic LiNoCoAlO2 cell (Supplementary Table 2) were downloaded. Both dataset consist of data sampled from respective cells over a diverse range of temperature and drive cycles to simulate dynamic operating conditions as elaborated in Section 3.1. The dataset was divided into train, validation and test sets as shown in Supplementary Table 3. Next the data was normalized into the appropriate range (0 to 1) and pre-processed with sliding window of lag, k = 400 timesteps ( Fig. 5(b)). The sliding window lag value, n is arbitrarily selected to due to the limits in our computing resources. Given more computational resources, k can be made larger to allow the model to consider more contextual information from the past. At this point the dataset was also augmented by injecting additive and multiplicative Gaussian noise. Finally, the dataset is transformed into the positional encoding form shown in Fig. 5(c). This is the format of the data that is expected by the transformer model.

In
Step 2, the Transformer was configured using the appropriate model hyperparameters as detailed in Supplementary Table 4. Careful attention is placed on the dropout hyperparameter value as it largely influences the degree of overfitting on the dataset. We find that the settings of dropout in the feedforward layer to p = 0.2 and dropout in the residual layer to p = 0.1 work well in our experiments.

In
Step 3, the model is now ready to be trained. As illustrated in Section 3.3, the model was trained in two distinct stages in the order of unsupervised pretraining and then downstream retraining. In section 2.1, the model was trained on the LG dataset and was evaluated on its estimation accuracy at fixed and variable ambient temperature settings. In section 2.2, the model was trained on the LG dataset and tested on its estimation accuracy on the Panasonic dataset. In this step, the training hyperparameter was configured as detailed in Supplementary  Table 5. Careful attention was placed on setting the LR in both training phase as it largely determines the performance and convergence to a global minimal. As detailed in section 3.3, we rely extensively on the use of a LR finder algorithm which points us to setting the LR, α = 1×10 -3 for pretraining and α = 2×10 -4 for retraining.
In Step 4 the model was evaluated on the SOC estimation accuracy in Section 2.1 and the influence of pretraining and SSL on the performance of the proposed model in Section 2.2. The performance of the model was quantified with the RMSE and MAE performance metrics. Finally, the performance of the model is compared to various widely used DL models on similar performance metrics.
Page 13 of 15 Data availability. The data that support the findings of this study are available from the corresponding author upon reasonable request.
Code availability. The software code and the examined cases that validated our method are available from the corresponding author upon reasonable request.