A Time Series Forecasting Model Selection Framework using CNN and Data Augmentation for Small Sample Data

The key to the accuracy of time series forecasting is to find the most appropriate forecasting method. Therefore, the forecasting model selection of time series has become a new research hotspot in the data analysis field. However, most of the existing meta learning forecasting model selection methods rely on manual selection of features, which leads to low efficiency and lack of objectivity. Therefore, this paper proposes an improved meta learning framework for deep learning time series forecasting model selection. Inspired by computer vision, we transform one -dimensional time series into two-dimensional images, and use convolution neural network to train and classify time series images (model selection). Moreover, in order to deal with the over fitting problem caused by small sample datasets, the sliding window data augmentation method is used to improve the accuracy of small datasets model selection. The large-scale empirical study on M3 data sets shows that the framework has better model selection accuracy and smaller forecasting error than the recurrent neural network (RNN), back propagation neural network and traditional time series image algorithms. In addition, compared with the traditional time series image method, RNN and BP, the classification rate (model selection accuracy) of this algorithm is improved by 6.5%, 4.4% and 3.2%, respectively.


Introduction
Time series forecasting analysis is an important role of financial industry, from forecasting economic phenomena to forecasting product sales [28].In recent decades, a great quantity of time series forecasting methods have been applied, aiming to improve the forecast accuracy.However, from the "no free lunch" theorem [26] everyone know that no method is applicable in any time series.
In the past period of time, time series forecasting model selection is chiefly relied on feature selection.Scholars have made a lot of attempts on the feature-based single variable time series forecasting model selection method.For example, based on 26 time series features, Shah constructed multiple individual selection rules by discriminant analysis Shah [33]; Wang, Smith miles and Hyndman [21] according to the meta characteristics of time series, supervised and unsupervised learning methods are applied to generate rules for selecting the optimal forecasting model; Lemke and Gabrys [25] provided a new set of time series feature element learning method for NN3 and NN5 data sets, and analyzes the results; Widodo and Budi [42]used principal component analysis to reduce the feature dimension, so as to optimize the forecasting model selection method; Petropoulos et al. [11] proposed the "horse for course" in M3 dataset [27], and counted the effects of 7 different time series features on the performance of 14 good forecasting methods.Recently, Talagala et al. [38] proposed a novel framework using random forests as classifiers and meta learning for forecasting model.Therefore, the selection of features plays an important role in the selection of time series forecasting model.
At present, most feature-based time series forecasting model selection algorithms rely on manual feature selection.However, with the advent of the era of automation, manual feature selection seem to be outdated.And it is true that manual feature selection is too cumbersome and consumes a lot of manpower and computing resources.Therefore, the deep learning framework for automatically extracting time series features gives scholars great inspiration.
In recent years, deep learning is widely used in time series forecasting.Connor and Martin [8] proposed recurrent neural network (RNN) for the first time, and taken the original characteristics of time series as input to predict the subsequent trend; On this basis, Gers proposed an improved RNN network called long and short term memory neural network (LSTM) to do some prediction [12].Naduvil-Vadukootu et al. [29] proposed a pipeline framework, which combined the mainstream time series forecasting methods with deep neural network (DNN), so as to improve the forecasting accuracy of time series.
Deep learning needs a lot of data to train the time series.However, in many real-world datasets (such as agricultural product price, sales volume, etc.) the small sample training set problem remains.So the insufficient of data can also be a problem for time series analysis, which leads to over fitting and the low performance.
Data augmentation has been proved to be an effective way to reduce the over fitting of neural network model [34].In real life, the amount of data in many fields hard to meet the requirements of deep learning model training, so the use of data augmentation can help the network overcome the problem of too small datasets or class imbalance [15].Although generalization and regularization methods can be used to reduce over fitting, data augmentation solves the problem from the data preprocessing point without changing the structure of neural network models.
This paper aims to propose an improved meta learning framework to overcome the problem of time series feature selection manually and over fitting of small sample datasets.Get inspired from the achievement of Hatami et al. [16] and Wang and Oates [41] in image processing, this paper combines the idea of time series imaging and meta learning framework using convolution neural network (CNN) to select best time series forecasting model.This framework can automatically extract time series features and avoid the problem of different standards caused by subjective factors in feature selection.At the same time, window slicing data augmentation is used to solve the problem of over fitting of small datasets in the process of deep learning training, which can improve the accuracy of model selection.And the proposed algorithm in this paper achieve better result in forecasting model selection Compared with the traditional time series imaging algorithm, RNN and BPNN.

Meta-learning Forecasting Model Selection
John Rice is the first proposer of meta learning in 1976, which called algorithm selection problem [31].The selection structure of Rice algorithm mainly consists of four parts.The problem space P represents the data set involved in the experiment.Feature space F is the set of all features in problem space P. Algorithm space A is a group of excellent candidate algorithms to solve problem space P problem.Performance metric Y is a measure of algorithm performance such as classification accuracy and running speed.Smith Miles [37] put forward a clear definition of algorithm selection in 2009.
Algorithm selection problem (ASP).For a given problem instance x ∈ P, with features f (x) ∈ F, find the selection mapping S( f (x)) into algorithm space A, such that the selected algorithm α ∈ A maximizes the performance mapping y(α(x)) ∈ Y .
The main bottleneck of ASP is to recognize the selection mapping from feature space to algorithm space.Although Rice's framework shows the concept of ASP, it does not specify how to obtain S, so it introduces the meta learning method.
With the wide application of machine learning, the term meta learning also appears in the time series literature.PrudeNcio and ludermir [30] was the ancestor of applying meta learning to time series, and discussed the influence of meta learning methods on model selection.Wang, Smith-Miles and Hyndman [37] introduces a new model selection method based on meta learning framework, called simple percentage better (SPB), whose model selection accuracy changes with the forecasting accuracy error of random walk model.Later, Widodo and Budi [42] proposed a novel meta learning framework for prediction model selection, which is based on a set of features proposed by Wang, Smith-Miles and Hyndman.Recently, KüCK, Crone and Freitag [24] combined neural network with meta learning to select the forecasting model, constructed a set of new features based on forecasting error, and used the mean absolute forecasting error as the evaluation standard to determine the best forecasting model of each time series.

CNN-based Forecast-model Selection(CFMS)
The proposed CFMS framework is presented in Fig. 1.The CFMS framework proposed in this paper can be divided into four parts, which are time series preprocessing, time series visualization, CNN classification and model selection, and evaluation criterions.First, in the preprocessing stage, time series are divided into training set, verification set and test set.Then perform data augmentation(SWDA) on the training set data, and assign labels to the enhanced time series.The label set used in this paper is six forecasting models (Introduced In the following section, we briefly discuss some aspects of training the original time series phase.The duty of model selection algorithm is to assign the "best" forecasting method to a given time series.It is impossible to train classifiers for all possible model classes, so in this paper, we choose six most popular time series forecasting models.The selected candidate models will depend on the type of time series.For example, if the time series only have non seasonal instead chaotic feature, the candidate forecasting model are limited by white noise, random walk, ARIMA and ETS processes.So even in this simple scenario, the number of matching models can be quite large.
Since each candidate model must be calculated and compared in each time series in the reference set, this step is the most computationally intensive and time-consuming step.
The more candidate models are, the longer the calculation time is and the return may be a significant improvement in forecasting accuracy.This step is the most computationally intensive and time-consuming because each candidate model must be applied to each time series in the reference set.The more candidate models are, the longer the calculation time is and the return may be a significant improvement in forecasting accuracy.
The pseudo code for our proposed framework is presented in Algorithm 1 below.

Exponential Smoothing
This model has been developed for several decades and was first proposed by Brown [6].
It is the basis of many popular time series prediction algorithms.In exponential smoothing method, time series usually models the four parts of time series such as seasonality and damping by multiplication or addition [43].The "ZZZ" model used in this paper refers to the best ETS model automatically selected by R software package according to AIC criteria [3].

Theta Method
Theta method is a single variable method for non-seasonal time series forecasting.Theta method is based on decomposing the original time series into"theta line" and solve the second-order difference equation to obtain a new time series.Each decomposed line is calculated by the forecasting algorithm, and the prediction results are reorganized to obtain the prediction results of the original time series [2].The algorithm is implemented by R package forcetheta [13].

Random Walk and Random Walk with Draft
Random walk (RW) is often used in financial data statistical models because of its effectiveness.The model assumes that adjacent observation points provide guidance for the next predicted value [39].The mathematical expression of RW model is as follows: where y t−1 and y t are the observed value of time series, and t is a white noise.The white noise term obeys the normal distribution, its mean value is zero and has constant variance σ 2 .

ARIMA
In ARIMA model, the forecasting value of a variable is related to several known observations and linear functions of random errors.In terms of mathematical expression, the basic process of generating time series has the following characteristics: y t and t is the actual value and random error of time period t respectively; φ i (i = 1, 2, ..., p) and φ j ( j = 0, 1, 2, ..., q) is the model parameter.Integers p and q are usually called the order of the model.The mean value of random error t is zero and the variance is constant σ 2 .Several special cases of AR I M A are included in equation ( 2).If q = 0, then (2) becomes a p-order AR model.When p = 0, the model is simplified to a q-order M A model.One of the core tasks of AR I M A model construction is to determine the appropriate model order ( p, q)?.

Data Augmentation
In order to test whether the proposed framework can identify best forecasting models, M3 datasets are applied in this paper.Table 1shows the types of M3 datasets.However, there are only 3003 time series in M3 competition.So a data augmentation method is considered to extend the datasets.
In the research of time series forecasting, data augmentation is also widely used, which can be regarded as the prior knowledge about data invariance injection for some transformations.The enhanced data can expand the training set, prevent over fitting and improve the robustness of the deep learning model [1].Perm, for example, is a simple method to disturb the time position of events in a random window.In order to disturb the position of data in a single window, the time series are divided into N segments with the same length, and then randomly arrange N fragments to generate a new sequence.Time warping (TimeW) is also a way to interfere with time position.By smoothing the time interval between the samples, the time position of the samples is changed.Whether the original label is retained depends on the magnitude of the distortion change.Scaling changes the length of data in the window by multiplying the scaling factor, while amplitude distortion changes the size of each sample by convoluting the data window, so that the smooth curve changes around a sample.In addition, jitter is also a method of data enhancement by increasing noise.These data enhancement methods can improve the robustness and generalization ability of the training model and improve the performance.Finally, crop is similar to image clipping in [18] to reduce the dependence on event location.In addition, using random position for clipping in different periods will get an optimal sliding window step.It is worth noting that clips may retain non information areas, resulting in label changes.Compared with image recognition, small changes caused by dithering, zooming, clipping, twisting and rotation may not change the data label.
In this paper, window slicing method of Le Guennec et al. [14] are applied to extract multiple small windows from a single window, and shorten part of the data window to augment the data.One of the advantages of our framework is to expand the smaller datasets to meet the experimental conditions.At present, most public datasets or real life datasets, such as agricultural product prices, financial series, are often limited in size.In order to solve this problem, this paper proposes a data augmentation technology based on the original datasets to avoid over fitting and improve the generalization ability.For massive datasets with rich training data, data expansion may not be necessary.The proposed data augmentation of window slicing as follow: For time series T = t 1 , t 2 , .., t n , window slice is the fragment of original time series, defined as S i: j = t i , t i+1 , ..., t j , 1 ≤ i ≤ j ≤ n.Assuming that the length of a given time series is n and the slice length is s, our slicing operation will generate a set of n − s + 1 slice time series: where all the time series in Slicing(T , s) have the same label as their original time series T does.In this paper, because the length of each time series in M3 datasets is different, the Fig. 2 The relationship between classification accuracy and scaling factor in reference [22] value of S is variable.We choose s = n − 2. Therefore, the enhanced time series becomes three times the original one.The reason why we choose the multiple of data augmentation m = 3 is that the best window slice length in [14] is 90% of the original.Later, Mooseop Kim et al. [22] compared the effect of data augmentation methods with different window slicing ratio using sensor data, and show this by figures.The results show that when the scaling factor is 0.1, that is, the length of time series slice window is 90% of the original length, the classification accuracy is the highest.On the basis of [22], a mathematical expression is fitted according to the known data and orange line in Fig. 2   y Where y is the classification rate, x is scaling factor, a and b are constants.It can be concluded that the local Gaussian function reaches the maximum when scaling factor x = 0.1.Because in this case, the sliced time series not only meets the data augmentation, but also retains the periodicity, trend and other features of the original series.The length of M3 time series selected in this paper is mostly about 20, so when the augmentation factor m = 3, the length of original time series is 18, which satisfies the condition in [14].The data augmentation method used in this paper is shown in Fig. 3.
All time series are sliced in a given training datasets and these sliced time series are regarded as independent training data.The experimental results show that the sliced label (best forecasting model) may not consistent with the original one.

Gramian Angular Field
Gramian Angular Field(GAF) can be divided into Gramian Angular summation field (GASF) and Gramian Angular difference field (GADF).In GAF [41], the polar coordinate system is used to represent the time series but not traditional Cartesian coordinate system.In the Gramian matrix, each element is actually the cosine or sine of the sum of angles.Given a time-series X = x 1 , x 2 , ..., x n with length n, normalize X so that all values are scaled at [−1, 1] or [0, 1] by: Therefore, by encoding the value of the time series X as angular cosine and the time point as radius, the normalized time series can be expressed in polar coordinates, and the formula is as follows: In the equation above,t i is the time point and N is a constant parameter used to regularize the radius of the polar coordinate system, which is a novel time series visualization method.In the Cartesian coordinate system, the area formula is expressed as: among that S i,i+k = S j, j+k (9) If f (x(t)) has the same values on [i, i + k] and [ j, j + k].However, in polar coordinates, if the area is defined as Then S i,i+k = S j, j+k .That is, the area formed in the polar coordinate system from time point i to time point j depends not only on the time interval |i − j|, but also on the absolute values of i and j.
Rescaled data in different intervals have different angular bounds.[0, 1] corresponds to the cosine function in [0, π 2 ] , while cosine values in the interval [-1,1] fall into the angular bounds [0, π].The formula of GAF is as follows: The Gramian Angular Summation Field (GASF) and Gramian Angular Difference Field (GADF) are defined as follows: I is the unit row vector [1, 1, ..., 1].After transforming to the polar coordinate system, we take time-series at each time step as a 1-D metric space.By defining the inner product x two types of Gramian Angular Fields(GAFs) are actually quasi-Gramian matrices [ x, y ]: The GAFs have several advantages.First, they provide a way to preserve temporal dependency, since time increases as the position moves from top-left to bottom-right.The GAFs contain temporal correlations because G |i− j|=k represents the correlation when the time interval is G i,i .when k = 0, it is a special case.It represents the angle containing only the main diagonal element and the original value, the GAFs are large because the size of the Gramian matrix is n × n when the length of the raw time-series is n.
The transformation maintains the time dependence between the values, and provides time correlation due to the superposition in the direction relative to the time interval, athe bijection matrix is formed.Therefore, the inverse function of the original data is an absolute reconstruction.As shown in Fig. 4.

Markov Transition Field
We get inspiration from Campanharo et al. [7].time series X determine Q quantile bins, and assign each x i to the corresponding storage unit q i ( j ∈ [1, Q]).Thus we construct a Q × Q weighted adjacency matrix W by counting transitions among quantile bins in the manner of a first-order Markov chain based on the time axis.w i, j is given by the transition probability of a point in quantile q j is followed by a point in quantile q i .After normalization by j ω i, j = 1 W is the Markov transition matrix.It is irrelevant to the distribution of X and Fig. 4 Illustration of the proposed encoding map of Gramian Angular Fields.Taking GADF as an example, the formation of GASF is similar.X is a sequence of rescaled time-series in the M3 datasets and transform X into a polar coordinate system by Eq. ( 7) and finally calculate its GASF images with Eqs.(12) temporal dependency on time steps t i .However, our experimental results on W demonstrate that getting rid of the temporal dependency results in too much information loss in matrix W. In order to overcome this disadvantage, the mathematical formula of Markov transfer field (MTF) is as follows: A Q × Q Markov transition matrix is established by dividing the data into Q quantile bins.M i, j in the MTF denotes the transition probability of q i → q j .That is, by considering the time and location, the matrix W is extended to an MTF matrix containing the transition probability on the magnitude axis.By forming the probability of quantiles from time step i to time step j at each pixel M i j , the essence of MTF is the multi span transition probability of coded time series.M i, j|i− j|=k represents the transition probability of two points with time interval k. Figure 5shows the procedure to encode time-series to MTF.

Recurrence Plot
In this part,recurrence plots (RP) is applied to transform time series into images.The recurrence plots provides a method for visualizing the periodicity of trajectories through phase space [10], and it can contain most of the relevant dynamic features in the time series.The recurrence plots of time series x can be expressed as: where R(i,j) is the element of recurrence matrix R; i indexes time on the x-axis of the recurrence plot, j indexes time on the y-axis.is a predefined threshold, and (•) is the Fig. 5 Illustration of the proposed encoding map of Markov Transition Fields.X is a sequence of time-series in the M3 dataset .X is first discretized into Q quantile bins.In this image, we take Q = 4. Then we calculate its Markov Transition Matrix W and finally build its MTF with Eq.Eq.( 15) Fig. 6 Illustration of the proposed encoding map of recurrence plots.X is a sequence of time-series in the M3 dataset.We finally build its RP with Eq.Eq.( 17) Heaviside function.In short, a black spot will appear when the distance from x i and x j are smaller than .The following modified RP is used to balance binary output with thresholdless RP [40].
Compared with the binary method, it generates more dense points and can generate color images.As we can see in Fig. 6.

Data Augmentation Time Series Image(DA-TSI)
This paper combines data augmentation method with time series image transformation, and proposes an integrated and innovative data augmentation time series imaging(DA-TSI) algorithm.The time series image augmentation algorithm proposed in this paper is different from the traditional image augmentation technology.In the field of image recognition, data augmentation has become a convention.Most of the most advanced convolution neural network (CNN) [44] structures use some form of data expansion.For example, Alexnet [23] is one of the first deep CNN that created a record benchmark on Imagenet large-scale visual recognition challenge (ILSVRC) datasets [32], uses clipping, mirroring, and color augmentation to optimize the network.Other examples include the original proposal for the VGG network [36], which uses scaling and clipping, the Resnet work [17] using scaling, clipping, and color augmentation, Densenet [19] using translation and mirroring, and perception network using clipping and mirroring.The DA-TSI algorithm proposed in this paper preprocesses the time series (see Sect. 2.4 for details), and then uses the time series image conversion algorithm to generate images.The advantage of this is that it can better protect the features of the original time series from being lost after image augmentation.Because the image generated by time series is close to mosaic, if image augmentation is carried out on the basis of mosaic, there will be a lot of time series feature loss.Therefore, DA-TSI algorithm has advantages in theory.

Convolution Neural Networks
Convolution neural networks(CNNs) have made remarkable achievements in image classification [39], natural language processing [9] and reinforcement learning [35].For time series forecasting, CNNs can reflect the subtle differences of underlying datasets and customize the corresponding architecture [4] and complex data representation [5] to reduce the work of manual feature engineering and model design.
In this paper, Three deep learning frameworks are applied to test the generalization performance of the proposed algorithm.The three deep convolution neural network models have different network depth and network structure, so if the algorithm can perform well in the three convolution neural network models, it can be applied to other deep learning models.

Basic Idea of Residual Learning
He et al. [17] put forward an improved CNN model for image classification, which is called deep residual network.The main difference between residual network and traditional CNN is that they have different network structures and information transmission modes, as shown in Fig. 7.For the traditional CNN model, the input layer, convolution layer, pooling layer and output layer are combined in a cascade manner.But for the rest of the network, it has a Fig. 8 The architecture of Vggnet shortcut that connects input and output directly together.Mathematically speaking, different from the direct approximation of basic function H (x), residual learning emphasizes the fitting of residual mapping f (x) The special mapping of residual network block is F(x) + x, which is the output of a traditional CNN, namely H (x).However, as He et al. pointed, compared with the original mapping H (x), the fitting residual mapping F(x) is more effective, especially when H (x) is an identity or approximate identity mapping.The characteristics of the residual network will increase the depth greatly, but will not reduce the classification accuracy of the network.

Basic Idea of Visual Geometry Group Network
Visual Geometry Group network (VGGnet) is a multi-layer neural network.VGGnet is very useful because it will 3 × 3 size convolution layer is installed on the top, which increases the depth of the network.In order to reduce the size of convolution kernel, max pool layer is used in VGGnet.There are 4096 neurons in two FC layers.As shown in Fig. 8.
In the training stage, convolution layer is used to extract features, and maximum pool layer and partial convolution layer are used to reduce feature dimension.In the first convolution layer, 64 kernels (3•3 filter size) convolution kernel.All connected layers are used to construct eigenvectors.Finally, in the test phase, Softmax activation function is used to classify the images.
VGGnet systematically studies the influence of network depth on classification performance, and constructs a deeper structure on the basis of shallow layer [20].

Brief Introduction of Densenet
Densenet [45] is a CNN architecture proposed in recent years.It has a new connection mode: dense block connection.In dense blocks, each layer is connected to all other layers.In this case, all layers can access the output features of the previous layer, which enhances the correlation of features.The effect of this framework makes the model is more dense to prevent over fitting.All these excellent features make Densenet more suitable for image recognition, which not only achieves the most advanced performance, but also does not need pre training or additional post-processing.
Traditional CNNs, such as FlowNets, calculate the output of the l th layer by applying a nonlinear transformation H to the previous layer's output x l−1 After the convolution layer and pooling layer processing, the traditional convolution neural network can obtain semantic features at the top, but these features are too rough, and the fine image details often disappear in the network.In order to improve the information exchange Fig. 9 The architecture of Densenet between layers, Densnet an improved connection mode: the first layer takes the feature maps of other layers as input: where [x 0 , x 1 , . . ., x l−1 ] is a single tensor formed by concatenating the output feature maps of the previous layer.It is a single tensor formed by concatenating the output eigenvectors of the previous layer.In this way, even the last layer network can share features with the first layer.The loss function directly supervises each layer through quick connection.As we can see in Fig. 9.

Experiment and Result Analysis
In this section, the experimental effect of data enhanced time series image (da-tsi) algorithm is applied to a small number of data sets, and theoretical analysis is carried out in combination with the experimental results.Finally, the da-tsi algorithm is compared with the time series image (TSI) baseline algorithm, RNN and BPNN.

Baseline Algorithm
In this paper, tsi-cnn algorithm without data enhancement and RNN and BPNN algorithm before and after data augmentation are used as the benchmark model.In this paper, controlling variables is to obtain scientific experimental results.Three different network structures, resnet-18, vgg-11 and denset-121, are used as the general training model of the deep learning method and applied to the reference depth CNN time series image method.

Datasets
Time series of M3 datasets is applied in this paper.M3 datasets includes more complex Micro-economic and Industrial data, and is conducive to verify the generalization ability of the proposed method.The specific classification is shown in Table 1.

Model Evaluation
The evaluation criteria to verify the correctness of the model selection are the classification accuracy rate obtained by comparing the label of the test set with the optimal label, and the Mean Absolute Percentage Error (MAPE) obtained according to the model selection results.Therefore, this paper has two standards.One is classification accuracy, the other is forecasting error.
The classification accuracy can be expressed as accuracy = (T P + T N) All (21) Where True positives(TP) is the number of positive examples correctly divided, and True negatives(TN) is the number of negative cases correctly divided.
The forecasting error used in this paper is the Mean absolute percentage error (MAPE).The benchmark model in this paper is four different single image generation methods and six econometric model methods.
where Y t is the real value of the time-series at point t, Ŷt is the forecast, n is the forecasting horizon.

Parameter Setting
In our experiment, we used Python 3.7 and R. The size of four kinds of single pictures is 359 × 359, and after resizing the size of combined image(GADF-GASF-MTF-RP) is also 359 × 359.The parameters for pre-trained CNN models are set as follows: Dimension of the output of the pre trained VGG-11bn: 1000.Dimension of the output of the pre trained resnet-18: 512.Dimension of the output of the pre trained densenet-121: 1000.The iteration rate of CNNs is 0.001, and the batch size is 16.

Result Analysis
Three deep learning models Resnet-18, Densenet-121 and VGG-19 with different network structures are applied to M3 datasets.M3 datasets is divided into training set, verification set and test set according to the ratio of 8:1:1.Then four kinds of TSIs generated from M3 datasets and four DA-TSIs are input into three convolution neural networks to compare the classification accuracy (model selection accuracy).In order to save space, this algorithm did not show the complete results on the single channel image.Through the analysis of the experimental results, some conclusions come to us: 1.As we can see Tables 2 , 3 and 4 , the average classification rates of GADF and GASF under three depth CNN are 6.6% and 5.5% higher than MTF respectively.And after data augmentation, the average classification rates of DA-GADF and DA-GASF under three depth CNNs were 2.8% and 3.6% higher than that of DA-MTF respectively.In addition to the potential risk of over fitting, we find that after three different CNNs training, the classification rate of MTF in the test set is generally slightly lower than that of GAFs under the same algorithm.This may be due to the uncertainty of the inverse mapping of MTF relative to GAF.Although both GAF and MTF time series image maps after time series standardization are epimorphic, on the [0, 1] standardized time series, the mapping function of GAF is bijective, while MTF is not bijective.The original time series can be reconstructed from the diagonal of GAFs, but it is very difficult to roughly recover the signal from MTF (Table 5).As we can see the Tables 2, 3 and 4, When the steps are 1, 3 and 6, the classification rates of time series images after data enhancement in CNN classifier are improved by 2.0%, 5.7% and 11.0% respectively compared with traditional time series images.On the whole, with the increase of forecasting step, the classification accuracy of DA-TSI algorithm proposed in this paper will be improved.The reason may be that with the increase of forecasting step, the fluctuation of time series increases, so the discrimination of time series images is more obvious, which is conducive to the higher classification rate of CNNs.It shows that this algorithm has better effect on the medium and long-term forecasting of small datasets.3. From Tables 2 and 6 , the classification rate of DA-TSI-MTF algorithm is improved compared with the original MTF algorithm after input into Densenet and VGG network, but the forecasting error MAPE increased by 0.14 and 0.06.The reason is that there may be a huge forecasting error in some wrongly selected forecasting models, which will affect the overall average error, resulting in the phenomenon that the classification rate increases but the error also raises.On the contrary, the classification rate of DA-TSI-GASF algorithm input into VGG network is slightly lower than that of TSI-GASF algorithm, but the forecasting error MAPE is reduced 0.27, which also shows that the pros and cons of an algorithm should be judged by multiple criteria (Table 7).4. From Tables 4 and 8, when the forecasting step is 6, the classification effect of various CNNs for DA-TSI image algorithm has best results compared with other steps, but the error is also the largest.Because even if the classification rate increases due to the increase of step size, that is, the accuracy of model selection increases, but the step size is 6, it belongs to medium and long-term forecasting, so even the forecasting result of the optimal forecasting model has a large standard deviation.5. From the results of classification rate, the highest classification accuracy (model selection accuracy) of DA-TSI algorithm proposed in this paper has increased significantly compared with the traditional TSI algorithm.From the perspective of forecasting error MAPE, in order to show the advantages of this algorithm more intuitively, we simply average the errors of all DA-TSI algorithms and TSI algorithms combined with different CNNs.In different steps, the error of the algorithm is the lowest, which can fully demonstrate the superiority of the algorithm proposed in this paper.6.In order to show the advantages of this algorithm more intuitively, we combine Tables 2, 3 and 4. It can be seen from the table that when the step size is 1, the advantage of the proposed data augmentation imaging (DA-TSI) algorithm is not obvious compared with the traditional time series imaging (TSI) algorithm, and the maximum improvement of3.6% is reflected in the Densenet network.On the one hand, the Densenet network is more   9).7.In this paper, RNN and BPNN are added to the comparative experiment.From Tables 10  and 11, it can be seen that RNN has a deeper and more complex network structure than BPNN, and can better extract time series features for classification.Therefore, RNN has the best classification effect and forecasting error before and after data enhancement.Through comparison with TSI-CNN, it is found that although the classification rate of RNN is slightly higher than that of MTF DesNet, the main reason may be overfitting caused by insufficient data volume.After data augmentation, DA-TSI-CNN is superior to DA-RNN in both classification rate and forecasting error MAPE, thus more comprehensively proving the superiority of this algorithm.

Conclusion and Future Work
This work investigated meta-learning based CNNs time series image for time series prediction with the aim to link problem-specific knowledge to well performing forecasting methods and apply them in similar situations.In the improved meta learning framework proposed in this paper, we use computer vision algorithm instead of Feature Engineering, and use convolution neural network to automatically extract features from time series images, so as to reduce the workload.In addition, in order to deal with the over fitting problem of small datasets in deep convolution network, we propose data augmentation imaging (DA-TSI) algorithm, which can effectively solve the problem of over fitting caused by insufficient data in real life.M3 datasets is applied to this algorithm, the experimental results show that this algorithm can automatically extract time series features, and has stronger advantages than the original time series image algorithm and machine learning algorithm.
In the future work, we will continue to explore the significance of Gaussian function between data augmentation sliding window cut length (multiple of data increase) and time series classification.And try other classification methods to enrich the meta learning framework proposed in this paper.

Fig. 3 3
Fig. 3 Data augmentation diagram when slice window is 3

Fig. 7
Fig. 7 Basic building blocks in different CNN models.Left: a basic building block in a typical CNN model.Right: a basic building block in a residual network

Fig. 10
Fig. 10 The average classification rate of different method in different step size

Table 1
x 2 ∈q j . . .ω i j|x 1 ∈q i ,x n ∈q j ω i j|x 2 ∈q i ,x 1 ∈q j ω i j|x 2 ∈q i ,x 2 ∈q j . . .ω i j|x 2 ∈q i ,x n ∈q j ω i j|x n ∈q i ,x 1 ∈q j ω i j|x n ∈q i ,x 2 ∈q j . . .ω i j|x n ∈q i ,x n ∈q j

Table 2
Comparison of classification rates between DA-TSI algorithm and traditional image algorithms by three different CNNs when the step size h=1

Table 4
Comparison of classification rates between DA-TSI algorithm and traditional image algorithms by three different CNNs when the step size h=6

Table 5
The average value of the forecasting error MAPE of the six forecasting models on the test set under different step sizes Experimental results show that the DA-TSI algorithm proposed in this paper is basically suitable for all deep learning models, different time series visualization methods and different step sizes, and can improve the classification rate of the original TSI algorithm.

Table 7
Comparison of test set forecasting error MAPE between DA-TSI algorithm and traditional image algorithms by three different CNNs when the step size h=3 Bold values indicate the best experimental result in each experiment

Table 8
Comparison of test set forecasting error MAPE between DA-TSI algorithm and traditional image algorithms by three different CNNs when the step size h=6