Fast training of a transformer for global multi-horizon time series forecasting on tensor processing units

Time Series Forecasting (TSF) is essential to key domains, and the Transformer neural network has advanced the state-of-the-art on global, multi-horizon TSF benchmarks. The quadratic time and memory complexity of the Vanilla Transformer (VT) hinders its application to Big Data environments; therefore, multiple efficient variants of the VT that lower complexity via sparse self-attention have been proposed. However, less complex algorithms do not directly produce faster executions, and machine learning models for Big Data are typically trained on accelerators designed for dense-matrix computation that render slower performance with sparse matrices. To better compare the accuracy-speed trade-off of the VT and its variants, it is essential to test them on such accelerators. We implemented a cloud-based VT on Tensor Processing Units to address this task. Experiments on large-scale datasets show that our Transformer achieves good predictive performance when compared to state-of-the-art models while reducing training times from hours to under 2 min.


Introduction
Time Series Forecasting (TSF) is essential to decision-making in various crucial domains such as science, engineering, business, and economics. Over the last two decades, Machine Learning (ML) algorithms consolidated as a practical approach to TSF. Two factors contributed to this situation: (1) the increasing size, quality, and availability of time series historical datasets, and (2) the sustained development of powerful ML-oriented computing frameworks. Additionally, progressive advances in Deep Learning (DL) provided innovative neural network architectures able to produce state-of-the-art results in TSF [1], including Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), Long Short-Term Memory networks (LSTM), attention-based mechanisms and, among the latter, Transformers. The Transformer is a novel neural network architecture that leverages an effective self-attention mechanism to process sequential data. Unlike the RNN-based methods, the Transformer allows the model to access any part of the time series history regardless of distance, making it potentially more suitable for grasping the recurring patterns with long-term dependencies [2]. Transformers have consistently shown the ability to produce state-of-the-art results in fields like natural language processing (NLP) and computer vision (CV). Based on their success, the number of projects based on Transformers has increased exponentially since its first entrance [3], leading to the publication of numerous papers and multiple specialized surveys [4][5][6]. Concerning TSF, self-attention enables Transformers to build predictions by focusing on the most significant parts of the time series history. Moreover, as self-attention allows entirely leaving out convolutional or recurrent components, TSF Transformers are entitled to a parallelization extent that is out of the reach of other architectures.
As a result of the Big Data era, the canonical TSF problem evolved from predicting one step ahead on a single time series to make predictions for multiple time steps into the future (multi-horizon forecasting) over large sets of related time series (multi-series or global forecasting) [7]. Robust approaches that have recently been applied to global, multi-horizon TSF include Learning to Rank techniques [8][9][10], and Transformer-based neural networks. Transformer architectures have been extensively applied to the global multi-horizon TSF problem [2,11,12]. A lot of this effort addresses two well-known concerns that hinder the application of the original Vanilla Transformer (VT) [3] to practical TSF environments. First, the full self-attention mechanism makes the VT's speed and memory complexities quadratic on the input sequence length L − O(L 2 ) . Second, concerning multi-step-ahead forecasting strategies [13], the VT inference over a multi-horizon is iterative (also called recursive or multi-stage), thus considerably slower than the MIMO inference available in other architectures. A significant proportion of recent research in TSF Transformers is devoted to addressing the first challenge by designing variants of self-attention that apply more advanced attention strategies and simultaneously lower the quadratic-complexity computational and memory cost. However, less complexity does not directly translate into more speed, and the fact that low-complexity, efficient Transformers perform Fast training of a transformer for global multi-horizon time… faster than dense-matrix-based Transformers in practical TSF situations remains unclear. It is worth noticing the specialized hardware accelerators that usually execute ML workloads are designed for dense-matrix computation, and sparsity, a crucial characteristic of lighter self-attention mechanisms, often makes matrix multiplication slower on these devices [14]. Interesting research lines propose to adapt dense-matrix calculation devices for efficient sparse-matrix computation [15]. Still, it is crucial to investigate the direct implementation and execution of dense-matrix TSF Transformers on well-suited hardware accelerators. The results of such research work will serve as a foundation for future comparison with lowcomplexity, sparse-attention Transformers when both types are trained on predominantly used ML hardware. This work addresses the task by implementing a dense, full-self-attention, multilayer, encoder-decoder Transformer on a complete computational ecosystem for ML hardware acceleration based on Tensor Processing Units (TPU). We train a global, multi-horizon TSF model based on the VT architecture on two widely-known datasets using a specific cloud computing services integration to ensure TPU performance. The main contributions of this research are: • To implement a fast Transformer for global, multi-horizon TSF on a cloud computing ecosystem in full accordance with TPU operation. • Our Transformer generates results close to the state-of-the-art at a small fraction of the computation time reported by the reference models. • To our knowledge, this is the first implementation of a dense, full-self-attention, encoder-decoder, multi-layer Transformer for TSF on TPUs.
The remainder of this paper is organized as follows. Section 2 discusses related research work concerning two fields: the application of Transformers to TSF, and the application of TPUs to Transformers or TSF. Section 3 describes the materials and methods employed in implementing the proposed architecture. Section 4 reports the experimental results. Finally, Sect. 5 presents conclusions and future work.

Related work
In this section, we discuss previous work that is related to our research concerning two specific fields: the application of Transformer-based deep neural networks to TSF problems and TPU-accelerated computation to implement forecasting applications or Transformers. With the intersection of these two research areas, we approximate a proper context for our work since we have not found in the literature projects that exhibit a closer or more direct relation to it.

Background
As a result of the Big Data era, the canonical TSF problem evolved from predicting one step ahead on a single time series to make predictions for multiple time steps into the future (multi-horizon forecasting) over large sets of related time series (multi-series or global forecasting). Figure 1 shows the general process of global multi-horizon forecasting: for each element i in a set of related time series, previous observations z i , associated time-based covariates x i , and static covariates s i are passed as inputs to a single forecast model. Once trained, the model can predict a given number of time steps into the future of any time series in the set. Timedependent covariates (i.e., hour, weekday, or month-based encodings) generate a temporal alignment, which can find patterns across time series dynamics. Static covariates (i.e., geographical location, customer identity, or item category), in conjunction with proper embeddings, enable the model to group time series into subsets with similar behavior.

Transformers for time series forecasting
The Deep Transformer [16] implements the basic architecture of the original NLP Transformer [3] adapted to the TSF domain, providing a practical context to discuss further model improvements. This basic (vanilla) Transformer, is a sequence-to-sequence model with an encoder-decoder architecture, multiple layers featuring multi-head self-attention, layer normalization, residual connections, and an iterative strategy for multi-horizon prediction. The Temporal Fusion Transformer [12] implements an interpretable multi-head attention block on

3
Fast training of a transformer for global multi-horizon time… top of a sequence-to-sequence, LSTM-based encoder-decoder. This two-layer architecture for temporal processing learns both short-and long-term dependencies on time series data. It uses a direct strategy for multi-horizon forecasting and can learn regime-specific temporal dynamics from time series. A vital research line includes projects that improve the Transformer performance by replacing the canonical full self-attention mechanism with a sparse mechanism. The LogSparse Transformer [2] implements a decoder-only architecture that employs convolutional self-attention to provide queries and keys with enhanced locality knowledge. It also introduces LogSparse self-attention, which reduces the Transformer complexity from O(L 2 ) to O(L(log L) 2 ) , where L is the input sequence length. The Adversarial Sparse Transformer (AST) [17] implements a sparse attention mechanism by replacing the softmax activation function in the attention heads with the entmax activation. It uses an encoder-decoder architecture for primary forecasting tasks and further utilizes it as the generator of a Generative Adversarial Network (GAN). The discriminator of the GAN is trained with predicted and ground truth time series, which it classifies as generated or real. A composite loss function, including the generator and the discriminator losses, is optimized during adversarial training of the complete network. This training mitigates the error accumulation on further iterative inference. The Informer [11] proposes ProbSparse self-attention which reduces the Transformer complexity to O(L(log L)) . It also introduces a self-attention distilling operation across the Transformer layers. This self-attention mechanism decreases memory usage, and a generative-style decoder produces all the prediction outputs with only one forward step. A recent research line includes projects implementing series decomposition as a built-in block of the Transformer architecture and entirely replacing the canonical or sparse self-attention mechanism. The Autoformer [18] introduces time series decomposition to seasonal and trend-cyclical sub-series as an internal operation of the TSF model. It also replaces self-attention with an Auto-Correlation mechanism, which focuses on period-based relationships at the sub-series level. The ETSformer [19] replaces self-attention with exponential smoothing attention and frequency attention and also redesigns the Transformer architecture with modular decomposition blocks that allows the model to learn to decompose the time series into interpretable components.

Applications of tensor processing units to transformers and forecasting
The massive computing requirements placed by ML workloads led to the continuous development of specialized hardware, like high-performance CPUs, GPUs, and, more recently, TPUs. Since its public release, TPU accelerators have been applied to various research topics that demand very high processing and memory capabilities. Examples of those domains are DNN training subject to large batch sizes and specialized learning rate algorithms [20,21], distributed evolution strategies for metalearning [22], acceleration of explainable machine learning [23], and simulation of quantum physics [24], to mention a few. Regarding the Transformer, NLP and CV projects constitute the majority of TPU-assisted research using this architecture, partially due to the high availability of pre-trained models and TPU-ready software implementations. Examples of the application of TPU-assisted Transformers can be found in machine translation [25], language modeling [26], image generation [27], and visual recognition [28]. Regarding TSF research, TPUs have been applied to weather forecasting via convolutional networks enriched with a one-layer self-attention encoder block [29], and short-term wind speed forecasting via symbolic regression [30]. However, to the extent of our knowledge, our project is the first implementation of a full self-attention, multi-layer, encoder-decoder Transformer for TSF on TPUs.

Materials and methods
This section covers the materials and the methodology employed in our research. First, the global, multi-horizon TSF problem is defined. On this basis, the Transformer neural network is proposed to model the prediction function. Next, we present TPUs as a convenient computation resource to deal with the huge workload required by the Transformer-based TSF model. Afterward, this section explains the Transformer implementation based on the previous elements. Finally, we present our experimental study in detail.

Global multi-horizon time series forecasting
This subsection defines the global, multi-horizon TSF problem based on the background provided in [2], with minor adjustments to fit our Transformer implementa- be the conditioning range for time series i, and let ∈ ℕ be the forecast horizon. We will predict the next time steps (or prediction range) for all time series, i.e., be a set of associated time-based covariate vectors with dimension d x that are known over the entire time period, e.g., hour-of-the-day or day-of-the-week. Finally, let {s i } N i=1 be a set of associated static covariate vectors of dimension d s that are known and constant over the entire time period, e.g., a time series identifier. We intend to model the following conditional distribution:

3
Fast training of a transformer for global multi-horizon time… calculated by adding previously predicted values as new inputs. To fully use observations and covariates, they are concatenated into an augmented matrix as follows: 2 where [⋅•⋅] represents concatenation along the t axis.

Transformer architecture
We use a Transformer-based model f to predict z t given Y t as z t ∼ f (Y t ) (see Fig. 2). First an internal dimension d model is defined for the Transformer and the augmented matrix Y t is linearly projected from ℝ t×(d x +d s +1) to ℝ t×d model . Notice in Fig. 2 (leftblock) that the conditioning range passed as input to this model is split into an encoder input with dimension e and a decoder input with dimension d, where the value of t from (2) is given as t = e + d + 1 (the last value of the encoder input overlaps the first value of the decoder input, the decoder output is the decoder input shifted one-time step to the right, and the only predicted value is the last time step in the decoder output).
The Scaled Dot-product attention (Fig. 2, right-block) computes a sequence of vector outputs: The output of the MHA layer is calculated as follows, Finally, the MHA output is passed to a Position-wise Feed-forward Network (FFN) with two dense layers (fully connected). The first dense projects d model to a given d ff using a ReLU activation, while the second dense layer returns the projection to d model . Additional Add and Layer-normalization layers provide residual connections that allow the gradients to skip the MHA or FFN layers if they do not improve the model.

Tensor processing units
A TPU is an application-specific integrated circuit (ASIC) developed by Google to accelerate large-scale ML. Google Cloud (GC) first released TPU v1 in 2018, and at the time of this writing, its highest publicly available version is TPU v3. Each one of the eight cores in a TPU v3 board provides a Matrix Multiplier Unit (MXU) with 65,536 8-bit multiply-and-add units, a Unified Buffer (UB) with 24MB of SRAM, and an Activation Unit (AU) with hardwired activation functions. These computational resources are designed to process basic neural network-related operations at a very low level: the MXU performs matrix operations, while the UB holds intermediate results, and the AU performs the nonlinear operations over the MXU output [31]. A Complex Instruction Set Computer (CISC) controls the TPU resources with an instruction set also designed for neural network-specific operations. The core of TPU logic is a systolic array [32]. This chip architecture accelerates matrix multiplications on the MXU by making input values flow through fixed patterns of multiplyand-add units, substantially reducing read-write operations to the UB (see Fig. 3). TPU ML accelerators are only available via Cloud TPU, a high-performance computing service of GC.
Cloud TPU provides a TPU board connected via PCI to a host virtual machine (VM). Inside this combination, the TPU runs the consuming model training and evaluation stages in parallel. TPU computing performance is extremely high (up to 420 TFLOPS), and data infeed usually becomes an operation bottleneck. For this reason, the host VM executes code that feeds data to the TPU as fast as possible. TPU hardware is language-agnostic, so an Accelerated Linear Algebra (XLA) compiler maps the ML-code representations from a specific programming language to Fast training of a transformer for global multi-horizon time… TPU machine code. The XLA compiler is designed for linear algebra and vector computations; any programming construct that relies on different operations has to be directed to the host VM's CPU. The MXUs of a TPU can perform at mixed precision, combining different numerical formats in the same computational workload. TPU cores can use bfloat16 format for multiplications and float32 format for results accumulation. As most deep learning applications do not need a high precision to get the target application accuracy [20], floating point operations, and consequently the training process, are accelerated if weights, gradients, data, and activations are represented in bfloat16 format. Figure 4 compares float32, bfloat16, and float16 numerical formats. Figure 5 shows the cloud-based TPU computing pattern implemented for training the Transformer. We programmed our Transformer with TensorFlow [33] as the front end based on the Multi-head Attention (MHA) layer from TensorFlow AddOns. This layer is a subclass of the Keras Layer class that expresses the query, key, and value tensors as linear transformations based on the tf.einsum tensor contraction. The encoder and decoder layers that integrate the Transformer are also 1 3 defined as subclasses of the Keras Layer. To improve the TPU performance and to ensure the model scalability, the Transformer was coded using TPUEstimator, a high-level Application Programming Interface (API) that simplifies TPU usage by handling numerous low-level, hardware-specific details.

Implementation
TPUEstimator is built on the basis of Estimator, a high-level API for Ten-sorFlow. TPUEstimator presents an interface similar to Scikit-learn and some adaptations to simplify its deployment to production stages [34]. The Estimator class is the base for executing the different stages of the ML model (training, evaluation, and prediction) from the same base code expressed in a model_fn  Fast training of a transformer for global multi-horizon time… function. A user-defined function called input_fn produces the input for the Estimator, decoupled from the model_fn. Based on this decoupling, the TPU runs the model_fn, which encloses anything involving trainable variables or gradients. The host VM's CPU runs the input_fn, which focuses on data preparation and infeed. The TPUEstimator API implements all-reduce synchronous data parallelism, which means it evenly splits the training batch among the workers (cores) of the TPU board. Conversely, the ML model is not distributed, and every worker keeps a replication of the TensorFlow computation graph. Workers operate synchronously, with each worker performing the same step simultaneously. A reduction operation is performed across all the workers once the training step is completed. For this cloud-based implementation of the Transformer, we built an advanced architecture by leveraging several GC services. A development VM retrieves TensorFlow scripts and parameters from cloud storage, spins up a Cloud TPU as a training coprocessor, sends the TensorFlow computation graph to the TPU's host VM and, during training, writes the resulting serialized model back to cloud storage. TPU's host VM pulls data directly from cloud storage and distributes it to the TPU cores.

Experimental study
This subsection covers the datasets, data preprocessing, model parametrization, and testing metrics used in our experimental study.

Datasets and data preprocessing
We trained our Transformer on two widely-known datasets: electricity 3 that contains hourly time series of the electricity consumption of 370 customers, and traffic 4 which is composed of 963 time series with hourly readings of car lane occupancy of San Francisco bay area freeways. Figure 6 illustrates the data setup Data setup is completed with two additional components (not shown in the figure): a time-dependent covariate with the consecutive value of hour on the observation timestamp measured from the time series starting point (age-covariate), and a scalar time series identifier which is passed as a static covariate for the model, then conformed into a learnable embedding which provides a finite-dimensionality space where time series with similar behavior tend to group. Datasets were split for training and test stages using date ranges as in [35]. Min-max normalization to [0, 1] was applied to both datasets in order to compensate for the large difference in time series values in electricity and to speed up the model convergence in traffic. Proper normalization scalers were acquired only from training data to avoid leaking information from the future into the model. Once preprocessed and separated, data was persisted to cloud storage as sequences of binary strings (TFRecords) in accordance with best-TPU-performance guidelines. A data ingestion pipeline for training the model was defined on the basis of the tf.data.dataset TensorFlow API. Pre-fetching was also used for parallel preprocessing, that means to start preparing the next batch while still training the current batch. To ensure good throughput during ingestion from cloud storage, individual TFRecord files were prepared for each time series identifier, resulting in 370 50-MB files for electricity and 963 33-MB files for traffic.

Model parametrization
The Transformer was trained using the Adam optimizer [36] with 1 = 0.9 , 2 = 0.98 , and = 10 −8 , and a custom learning rate lrate = d −0.5 model ⋅ min(step_num −0.5 , step_num ⋅ warmup_steps −1.5 ) which results in increasing the learning rate linearly for the first warmup_steps training steps, and decreasing it proportionally to the step number's inverse square root [3]. The Mean Squared Error (MSE) was used as the training loss function, defined as where Ω train is the collection of readings for all t Fast training of a transformer for global multi-horizon time… timesteps in all i time series in the training dataset, and |Ω train | is the number of readings in that collection. The dynamic of the used learning rate schedule is shown in Fig. 7.

Inference
After training the Transformer, we tested it using 24 h (one day) and 168 h (one week) as forecasting horizons. To exactly match the specific one-week inference interval reported by the reference models, Seven rolling-day predictions of one day ahead were produced on each dataset. Following the convention in literature, we labeled these inference results as electricity 1d and traffic 1d . To study the predictive performance of our model on a longer horizon, we also produced one-time predictions seven days ahead for each dataset, and labeled them as electricity 7d and traffic 7d . Normalized Deviation (ND) and Normalized Root Mean Squared Error (NRMSE) were used as metrics for testing the Transformer model, and are defined as follows: where Ω test is the collection of readings for all t timesteps in all i time series in the test dataset, and |Ω test | is the number of readings in that collection.

Computational cost
To assess the computational cost of our Transformer, we calculated the number of trainable parameters for the architecture layers in the selected configuration (see Table 1). It can be observed that most of the computation is spent on matrix multiplication and accumulation operations in the MHA and FFN layers. In contrast, embeddings, input We kept fixed the architecture parameters to comply with the number of trainable parameters mentioned above, that is encoder length = 168 , decoder length = 168 , embedding dimension = 24 , number of layers = 2 , model dimension = 256 , and FFN dimension = 512 . Then, we used a grid search to tune a couple of training hyperparameters: learning rate and training batch size. As previously stated, the learning rate is given by a schedule that is a function of the number of warmup steps, the model dimension, and the current training step, then the number of warmup steps was the variable that defined the grid search. The number of epochs was fixed to 1 by modifying the batch size in accordance with the total training steps, approximately 1.5× the warmup steps. Table 2 offers a global summary of the experiment setup, including the dataset statistics and the architecture, training, and testing parameters used in the Transformer.

3
Fast training of a transformer for global multi-horizon time…

Results and discussion
This section presents the results of our experiments with the Transformer. First, we present eleven well-renowned TSF models that provide a benchmark for evaluation. Next, forecasting accuracy and computation times obtained are compared with the benchmark.

Benchmark
We collected from the literature the experimental results obtained by several wellrenowned TSF models on the datasets, inference setups, and error metrics that we used. Following is the description of the benchmark models (Transformer-based models have already been described in Sect.

Testing parameters
Forecasting horizon [24,168] [24,168] applied to TSF, (7) DeepState [40] is an approach to probabilistic TSF that integrates a linear state space model with a jointly-learned RNN, (8) Informer [11], (9) TFT [12], (10) Seq2Seq is the simple sequence-to-sequence [41] TSF model with global context, and 11) MQRNN [42] is a framework for probabilistic TSF that combines sequence-to-sequence neural networks, quantile regression, and multihorizon direct forecasting. Table 3 shows the forecasting accuracy comparison of our Transformer with the benchmark models. Most of these models report ND metrics for electricity 1d , electricity 7d , traffic 1d , and traffic 7d . NRMSE metric is only reported by DeepAR and TRMF for electricity 1d and traffic 1d . Informer is evaluated on electricity using a very particular procedure: it takes a two-year, hourlyresolution dataset and splits it into train/val/test sets of 15/3/4 months, and although the model is global, only the MT-320 power consumption time series is used for Table 3 Forecasting accuracy of our Transformer compared to state-of-the-art models

Forecasting accuracy
The last row shows how it ranks among the models that provide a metric value for a given combination of dataset and inference type 1 Results from [17] 2 Results from [35] 3 Results from [2] 4 Results from [11] 5 Results from [12] The best metric value for the corresponding combination of dataset and inference is presented in bold

Dataset
Electricity Traffic   Inference  1d  7d  7d  1d  7d   Metric  ND  NRMSE  ND  MAE  MSE  ND  NRMSE  testing. This evaluation is atypical, so we did not compare our Transformer with Informer. However, the comparison that Informer established with DeepAR, LogSparse and ARIMA (MAE and MSE) is presented as a valuable reference. Our Transformer consistently reaches convergence at a very fast pace, after training it for roughly one epoch on the complete datasets, that is 1.8 million samples for electricity and 3.1 million samples for traffic. On electricity 1d , our model ranks fifth among eleven models for ND, and ranks first among the three available models for NRMSE. Concerning traffic 1d , our model ranks seventh among eleven models for ND, and second of three for NRMSE. The fact that our model achieves good predictive performance on traffic even though the dataset was not entirely passed to the model during training, supports the assumption that our Transformer can learn patterns across time series that exhibit similar behavior. Regarding the seven-day forecasting horizon, with only ND metrics available, our model ranks third of eight for electricity and sixth of eight for traffic. This  predictive performance degradation on traffic 7d can be explained by the error accumulation in the iterative inference process, and suggests the need for training for more epochs if longer forecasting horizons are planned for this dataset. Figures 8 and 9 show the test metrics for different training batch sizes of our Transformer on electricity 1d and traffic 1d , respectively. On both datasets and for the two selected inference metrics, the Transformer produces the best results when trained with a batch size of 256. Figure 10 shows that the best inference metrics for a seven-day forecasting horizon are also produced by the Transformer trained with a batch size of 256.   [35]. Reported as total running time, it includes training and inference stages. Trained on a single p2.xlarge AWS EC2 compute instance containing 4 CPUs and 1 GPU (not specified) b Results from [12]. Trained using a single NVIDIA Tesla V100 GPU c Results from [11]. A total training time of 480 min (on a not specified number of NVIDIA Tesla V100 SXM2 GPUs with 32GB memory) is reported for this dataset and three similar datasets when the encoder length is set to 168, as in our main experiment

3
Fast training of a transformer for global multi-horizon time…

Computation time
Concerning computation time, our Transformer outperforms the models in the benchmark, as shown in Table 4. Our model takes only 89 s of training wall-time 5 on electricity and 119 seconds on traffic to consistently reach the accuracy metrics reported above. This represents an outstanding improvement over the 7 h and 3 h of total running time reported by DeepAR, respectively. The extreme reduction in training computation time achieved by our model does not exclusively apply to DeepAR, a DL architecture that exhibits comparable predictive performance to ours. The Temporal Fusion Transformer (TFT) [12] reported a training computation time on GPU-accelerated hardware, that is slightly over 360 min for the electricity dataset. In the same line, training times close to 480 min are reported for both LogSparse Transformer and Informer in [11], when these models are trained with the encoder length set to 168 (as in our main experiment) on electricity and three similar datasets, using GPU-accelerated hardware. Inference computation times for our Transformer are highly dependent on the infrastructure where prediction is served and, due to the iterative nature of our model, this inference process is not suitable for TPU-based implementation. Instead, a parallel execution based on separated time series processing was prototyped to obtain inference computation times comparable in magnitude to those required by training.
A major concern that arises from the comparison of computation times in Table 4 is the fact that the benchmark models and our Transformer were trained on substantially different infrastructures and implementations. While DeepAR, TFT, Log-Sparse, and Informer were trained using GPUs to accelerate a non-specific training workload (as only minor adjustments are required to execute the same workload on CPU-based architectures) we set up and executed a highly specialized configuration that arranges hardware and software as a whole, to accelerate a particular target DL algorithm. In other words, a considerable part of our contribution resides in the design and implementation of a training workload that is specific to the combination of the hardware accelerators we used (TPU node), the DL framework it was executed on (TensorFlow 1.15), and the algorithm that supports the model we trained (dense-matrix self-attention Transformer for TSF). By carefully integrating these three computing layers, we achieved an unprecedented speed in training models that produced close-to-state-of-the-art predictive performances in several dataset/inference type combinations predominant in the field. As such, we do not intend to provide an even-handed comparison of our Transformer with the models in the benchmark but to demonstrate the substantial decrease in computing times our hardware-software-algorithm integral setup can achieve. It is worth mentioning the considerable speed improvement we presented can be extended to other DL models, at the cost of implementing them according to the key elements in the proposed configuration, which are decoupled data in-feed and model training functions, robust cloud-based execution and artifact management, and efficient data serialization.
When used for workloads that involve numerous matrix multiplications, TPUs certainly provide more computing performance than GPUs (the NVIDIA V100 GPU reported in two projects on the benchmark delivers 112 TFLOPS at float16 resolution while the TPU V3 delivers up to 420 TFLOPS at bfloat16). However, TPU-based computing environments demand a precise configuration to perform at its full-extent speed. The procedures needed for optimizing TPU operation are not insignificant, which is partially demonstrated by the fact that most of the forecasting projects presented as related work, especially those that produced the models in the benchmark, do not leverage TPUs. An important question that emerges from here is: to what extent do the impressive reductions in computation times we present come from our TPU ecosystem configuration and not only from the TPU itself? Answering this question is not trivial, as it requires developing a new implementation of the VT on TPUs, one that uses the same accelerator in combination with a different software framework, a different execution environment, and also different data infeed procedures. As the effort required for implementing such a benchmark is out of the time scope of our research, we designed an analogous test for the TPU operation without the key elements in our configuration.
To address this task, we implemented a substitute neural network that approximates the MHA and FFN layers in our Transformer with comparable stacked dense layers. This results in a model that has the same number of trainable parameters as ours but uses a much simpler architecture. Before training this substitute network on the benchmark datasets it was given three additional advantages with respect to our Transformer: (a) it was constructed as a Keras sequential-type network while our Transformer is a more complex, functional-type network, (b) it was trained using TensorFlow eager execution, therefore dispensed from writing continuous log summaries, while our TensorFlow graph-based-execution Transformer keeps an extensive training summary, and (c) it was allowed to manage the complete dataset in memory while our Transformer is required to extract the dataset from cloud-based storage. Table 5 shows the main characteristics of the basic TPU implementation for the substitute network in contrast with our Transformer. Although the training of the substitute network on this basic TPU implementation is also much faster than the models in the benchmark, it takes 246% the time required by our Transformer to train electricity (219 seconds), and 305% the time required by our Transformer to train traffic (363 s). Regardless of the significant execution advantages granted to the substitute network it is not able to reach the speed of our Transformer. This reveals that leveraging TPUs for adequate DL workloads does, in fact, accelerate training times, but a specific configuration of hardware and software as a whole is indispensable for achieving TPU's best performance.

Conclusions and future work
Real-world TSF applications demand ML-based models to be not only accurate but also fast. At the same time, TSF research can greatly benefit from frameworks that provide notable reductions in training times. Low-complexity, sparse variants of the VT have consistently advanced the state-of-the-art on TSF but do not consistently report their speed metrics. This situation hinders the analysis of the accuracy-speed trade-off of ML-based models, which is an invaluable tool for operating TSF processes in modern research and production environments. In this work, we present in detail the accelerated training of a Transformer for global, multi-horizon TSF. We implemented our Transformer on a cloud computing ecosystem that follows the best practices for TPU operation, such as decoupled input and model-function, model training with mixed-precision arithmetics, cloud-based execution and artifact management, and efficient serialization (TFRecord file-based) for data in-feed. Experiments on two widely-known, large-scale datasets show that our densematrix, full self-attention, multi-layer, encoder-decoder Transformer achieves good predictive performance when compared to several state-of-the-art models, while reducing training computation times from several hours to under 2 minutes wall time. As the training computation times of the models used as the benchmark in this paper are comparable to most of the DL models currently proposed to solve the global, multi-horizon TSF problem, our work provides a basis for further research aimed at producing a more balanced comparison, in terms of both predictive performance and speed of the original VT and its sparse variants.
Future work in this direction includes the following two research lines: adding sparsity and enhanced locality to our Transformer's self-attention mechanism via the design of convolutional filters (as in [2]) and activation functions (as in [17]); and downsampling the input sequence as it passes through the model's self-attention layers (as in [11]). These contributions have already proved to increase the Transformer accuracy on TSF problems, therefore they constitute a good benchmark for evaluating the impact that moving the computation process outside the dense-matrix domain will have in the speed of our TPU-accelerated Transformer. On a different line of research, we will work on adding adversarial ML robustness to our Transformer. This is an important research area that has recently gained attention [43,44], and is devoted to developing defense strategies against adversarial attacks on forecasting models. These strategies will complement our fast-training implementation by improving the model's performance in real-world conditions.