A conformal predictive system for distribution regression with random features

Distribution regression is the regression case where the input objects are distributions. Many machine learning problems can be analyzed in this framework, such as multi-instance learning and learning from noisy data. This paper attempts to build a conformal predictive system (CPS) for distribution regression, where the prediction of the system for a test input is a cumulative distribution function (CDF) of the corresponding test label. The CDF output by a CPS provides useful information about the test label, as it can estimate the probability of any event related to the label and be transformed to prediction interval and prediction point with the help of the corresponding quantiles. Furthermore, a CPS has the property of validity as the prediction CDFs and the prediction intervals are statistically compatible with the realizations. This property is desired for many risk-sensitive applications, such as weather forecast. To the best of our knowledge, this is the first work to extend the learning framework of CPS to distribution regression problems. We first embed the input distributions to a reproducing kernel Hilbert space using kernel mean embedding approximated by random Fourier features, and then build a fast CPS on the top of the embeddings. While inheriting the property of validity from the learning framework of CPS, our algorithm is simple, easy to implement and fast. The proposed approach is tested on synthetic data sets and can be used to tackle the problem of statistical postprocessing of ensemble forecasts, which demonstrates the effectiveness of our algorithm for distribution regression problems.


Introduction
In the distribution regression cases, the input objects are represented by probability distributions or empirical probability distributions instead of feature vectors. One example is multi-instance learning whose input object is a bag of instances. The goal of distribution regression is to build a model from the input distributions to the related labels. Some distribution regression problems are risk-sensitive.
Two representative examples are medical diagnosis where the patient's repeated measurements of the medical conditions are considered as the input object, and statistical postprocessing of ensemble forecasts whose input object is the forecasts of the same meteorological elements from multiple members of different numerical weather prediction models. For these risk-sensitive applications, predicting the label of a test input is not enough since this bare point prediction has no useful information about the uncertainty and confidence of the prediction and the probability distribution of the label is preferred. Also, it is desired that the prediction distributions of the regression model have the property of validity (Gneiting & Katzfuss 2014;Vovk et al. 2019;Vovk 2019), which means that the predictions have statistical compatibility with the realizations (Vovk et al. 2018a).
Many algorithms developed from statistics and machine learning can output cumulative distribution functions (CDFs) for test labels. However, the representative algorithms such as the ones built on Gaussian process regression or Bayesian regression are sensitive to their prior distribution assumptions about the applications. If the prior assumptions are wrong, the CDFs output by them can be far away from validity which can not be trusted and used with confidence (Melluish et al. 2001;Balasubramanian et al. 2014). This issue can be tackled by the recently proposed pioneer works about conformal predictive systems (CPSs) Vovk 2019;Vovk et al. 2018a, b), which are proved to have the property of validity even in the small-sample case with the assumption not more than the data are independent and identically distributed (i.i.d.). With little restriction about the data, the CPSs are more useful and practical than Bayesian methods.
CPSs were first proposed by Vovk et al. (2019) to design systems which output valid CDFs for test labels. They are based on conformal prediction (Vovk et al. 2005;Balasubramanian et al. 2014), which can output valid prediction sets for labels and have been successfully applied to many applications where the prediction errors need to be controlled (Bosc et al. 2019;Cortés-Ciriano & Bender 2019;Nouretdinov et al. 2011;Papadopoulos et al. 2009;Laxhammar & Falkman 2011. The p values calculated using conformity scores are the essential elements of constructing valid prediction sets for conformal prediction. With the i.i.d. assumption about the data, the p values are uniformly distributed on [0,1]. This character is very promising since using this property we can transform the complex uncertainty from data to a very familiar distribution, which we can use to do many interesting things such as constructing valid prediction intervals. CPSs make full use of the p values of conformal prediction and relate them to the CDFs of the test labels. This is the reason why CPSs are valid even in the small-sample case. As CPSs are based on conformal prediction, the original approaches proposed in Vovk et al. (2019), Vovk (2019) and Vovk et al. (2018a) have computational issue inherited from conformal prediction, which severely limits the applicability of CPSs to real-time applications. This issue can be tackled in two ways. The first is to design more computationally efficient variants, which leads to the following works in Vovk et al. (2018bVovk et al. ( , 2020a where split conformal predictive systems (SCPSs) and cross-conformal predictive systems (CCPSs) were proposed. The second is to use a fast and well-performed underlying regression algorithm, which motivates our recently proposed CPS combing Leave-One-Out CCPS with regularized extreme learning machine (RELM) (Huang et al. 2011) named as LOO-CCPS-RELM (Wang et al. 2020). The property of validity of LOO-CCPS-RELM were proved theoretically in the asymptotic setting and proved empirically by the experiments. Also, the comparison study with the other representative CPSs was conducted, which verified the effective-ness of LOO-CCPS-RELM. As such, we employ LOO-CCPS-RELM to build our CPS for distribution regression in this paper.
To handle input distributions, we follow the works Szabó et al. (2015) and Szabó et al. (2016) by embedding the input distributions to a reproducing kernel Hilbert space (RKHS) with kernel mean embedding. Different from mapping feature vectors to points in RKHS with kernel method for pattern analysis (Shawe-Taylor & Cristianini 2004), kernel mean embedding maps input distributions to points in RKHS, each of which is a new representation of the related distribution. Using these new representations as inputs and the corresponding labels as outputs, a regression algorithm can be trained to establish a regression model. Thus, the CPS we develop for distribution regression in this paper comprises the following two steps. First, the input distributions are embedded to a RKHS by kernel mean embedding. Second, LOO-CCPS-RELM is trained from the embeddings to the labels. Moreover, to make the learning process of our CPS fast, we use random Fourier features (Rahimi & Recht 2007, 2008a, 2008b to approximate the kernel of kernel mean embedding, which is inspired by the works in Jitkrittum et al. (2015) and Lopez-Paz et al. (2015).
The main contributions of this paper are two parts. First, to our knowledge, this is the first CPS which are built to tackle the distribution regression problems. Our approach is simple, easy to implement and applicable to real-time applications with the property of validity inherited from the learning framework of CPS, which is welcome by high-risk and real-time application areas. Second, the CPS is applied to statistical postprocessing of ensemble forecasts for temperature and precipitation, which is the first attempt of using a CPS to tackle the statistical postprocessing problems. Besides, the experimental results in this paper confirm the promising performance of our approaches for distribution regression both in the prediction CDFs and prediction intervals compared with other widely-used Bayesian methods.
The rest of this paper is organized as follows. Section 2 reviews kernel mean embedding of empirical distribution, regularized extreme learning machine and computationally efficient conformal predictive systems. Section 3 introduces our proposed algorithm. In Sect. 4, experiments are conducted, which includes the empirical study on synthetic data sets and applying our algorithm to statistical postprocessing of ensemble forecasts. The conclusions in this paper are drawn in Sect. 5.

Literature review
We focus on the setting where the input distributions are empirical distribution, as it is more common to observe the samples than the probability measures itself in real applications (Póczos et al. 2013;Szabó et al. 2015). Thus, this section first reviews kernel mean embedding of empirical distribution and its approximation with random Fourier features, which was proposed in Jitkrittum et al. (2015) and Lopez-Paz et al. (2015). Then, we review SCPS and CCPS, which are computationally efficient versions of CPSs.
Some notations are needed throughout this paper. Let the training set be denoted by is the empirical distribution of x i , i.e., x i;j s are the samples of x i . y i is the corresponding label. All samples are assumed to be taken from the sample space X & R n and the labels are real numbers. Here is an example of a medical application about this setting. x i can be repeated blood tests of the i th patient and the sample x i;j represents the n-dimensional feature vector from the j th blood test. y i is a health indicator of the i th patient related to x i . Another example is related to statistical postprocessing of ensemble forecast for precipitation. In this case, x i is the forecasts of the precipitation from the N i members of ensemble numerical models with y i being the corresponding observed precipitation.
For a test input object x 0 , the goal of a CPS is to construct a CDF on y 2 R, such that the CDF describes the uncertainty of the corresponding label y 0 .

Kernel mean embedding of empirical distribution and its approximation with random Fourier features
Suppose that k : X Â X ! R is a positive definite kernel and the RKHS with k is represented by H k . Referring to Muandet et al. (2017), the kernel mean embeddingl i of the empirical distribution x i to the space H k can be formularized as If k is a characteristic kernel such as Gaussian kernel (Muandet et al. 2017), the embedding of formula (1) can capture all of the information of the distribution of x i . That is why kernel mean embedding is a popular way to deal with distributional data. In general, the distribution regression based on formula (1) need to resort to dual optimization of learning with kernels. This is time consuming since the complexity of constructing the kernel matrices is at least O l 2 ð Þ (Lopez-Paz et al. 2015). To address this issue, an alternative way is to approximate l i with a finite representation using random Fourier features (Rahimi & Recht 2007;Jitkrittum et al. 2015;Lopez-Paz et al. 2015). The approximation using random Fourier features explicitly maps x i to a D-dimensional Euclidean space by a randomized feature map ; : X ! R D . This map satisfies that Here is an example of Gaussian kernel, i.e., where c is the kernel parameter. With Gaussian kernel, it follows that Each component of w i in the above formula is randomly drawn from the univariate Gaussian distribution with mean being 0 and variance being 2c, and b i is drawn from the interval Àp; p ½ uniformly. Therefore, the embedded mean l i in formula (1) can be approximated byl i with the help of the randomized map [, wherel i can be written aŝ For shift-invariant kernels, the work in Lopez-Paz et al. (2015) analyzed the approximation error ofl i . Throughout this paper, we use this approximation to design CPSs for distribution regression. Since the original distribution x i is transformed to a D-dimensional feature vector with the approximation (3), it is easy to apply any regression algorithm including a CPS on the data setẑ l 1 l i ; y i ð Þ; i ¼ 1; . . .; l f gfor distribution regression problems. With this idea in mind, the CPSs for distribution regression are built onẑ l in this paper. Thus, for the remaining parts of Sect. 2, we use the notationl i s as the input objects of CPSs.

Regularized extreme learning machine
RELM is a single-hidden-layer feedforward neural network with randomly chosen parameters of hidden nodes (Huang et al. 2011), which can be written as where L denotes the number of hidden nodes,lR D the input feature vector and bR L the output weights learned from training data. Also, where hl; h i ð Þ denotes the ith activation function whose parameters h i R Dþ1 .
With the training set the training setẑ l , a chosen activation function hl; h i ð Þ and a fixed number L, the l Â L matrix H of RELM is first calculated and can be written as where h 1 ; . . .; h L f gare randomly drawn from a probability distribution on R Dþ1 .
After that, RELM aims at minimizing the following optimization problem where y ¼ y 1 ; y 2 ; . . .; y l ½ T is a column vector and y i s are labels.
Then, the regressor output by the RELM algorithm can be expressed aŝ where I LÂL is the L Â L identity matrix. RELM can learn fast and the leave-one-out predictions of RELM on the training set,f i s, can also be calculated fast by the following formula (Wang et al. 2018) where hat ii is the entry in the i th element of the diagonal of This is the reason of using RELM as the underlying algorithm of LOO-CCPS-RELM.

Computationally efficient conformal predictive systems
As the original learning framework of CPS proposed in Vovk et al. (2019) has computational issue inherit from conformal prediction, two variants were proposed in Vovk et al. (2018bVovk et al. ( , 2020a to speed up the learning process, which are SCPS and CCPS. As we also focus on fast CPSs in this paper, we only introduce SCPS and CCPS in this section.

Split conformal predictive system
Just like conformal prediction, every CPS needs a conformity measure A S;ẑ ð Þ to calculate the conformity scores of observations. Here A S;ẑ ð Þ is a function of a data set S and an observationẑ whose purpose is to measure the degree of agreement between the observations in S and the dataẑ. However, different from conformal prediction, the conformity measure of CPSs should satisfy the conditions more than being measurable. As stated in Vovk et al. (2018b), in the context of SCPSs, the conformity measure A S;ẑ ð Þ must be a balanced isotonic function. Generally, with an underlying regression algorithm r, a balanced isotonic function A S;ẑ ð Þ can be defined as wherer is the regression model learned from S using the regression algorithm r andẑ ¼l; y ð Þ. This design for conformity measure was used in Vovk et al. (2018a, b) to build CPSs. Also, in this paper, we use it as the conformity measure to establish our CPSs.
Suppose the trainingẑ l is split into two parts, which are the proper training setẑ m 1 ¼l j ; y j ; j ¼ 1; 2; . . .; m n o and the calibration setẑ l m ¼l j ; y j ; j ¼ m þ 1; . . .; l n o .
For each possible label yR, SCPS calculates l À m þ 1 conformity scores as wherel 0 is a test input, y is the corresponding label ofl 0 and i ¼ m þ 1; m þ 2; . . .; l. Then the function Q can be obtained as Vovk et al. (2018b) The theory in Vovk et al. (2018b) shows that the function Q t y ð Þ above is a randomized predictive system, which theoretically has the property of validity. Referring to Vovk et al. (2018bVovk et al. ( , 2020b, if A S;ẑ ð Þ is a strictly increasing continuous function of y, then C i can be defined by the condition In the case where formula (6) is used as the conformity measure, we have By sorting C i in the increasing order we have ð Þ ¼ 1, and then the function Q can be written as the following formula (Vovk et al. 2018b) where (11) can not be used directly to represent the CDF of y 0 as it depends on t. To remove the impact of t, it is recommended in Vovk et al. (2020a) to use a modification of formula (11) in applications, such as if y 2 ðC ðiÞ ; C ðiþ1Þ Þ for i 2 0; . . .; l À m f g i l À m if y ¼ C ðiÞ and y 6 ¼ C ðiþ1Þ for i 2 1; . . .; l À m f g That is, formula (12) is used as the CDF predicted for y 0 in this paper. The empirical cumulative distribution function of C i ð Þ ; i ¼ 1; . . .; l À m È É is given by formula (12). From the approach above, we can see that SCPSs only use the calibration set to obtain l À m conformity scores, which seems to be less informationally efficient than using the full training data. This is the very reason why CCPSs were also proposed.

Cross-conformal predictive system
The CCPS borrows the idea of cross validation and partitions the training set into k non-empty folds first. Let o i denote the set of the ordinals in the i th fold andl l o i ð Þ denote the training data set leaving the i th fold out. Then, for each i 2 1; . . .; k f g, a SCPS with conformity measure A S;l ð Þ is used to obtain the conformity scores with z l o i ð Þ being the proper training set and fz j jjo i g being the calibration set. Thus, the related conformity scores can be written as a j;i ¼ Aðz l ðo i Þ ; z j Þ and a y 0;i ¼ Aðz l ðo i Þ ; ðx 0 ; yÞÞ where jo i . Finally, the function Q t y ð Þ of the CCPS is obtained as Just like SCPSs, if A S;ẑ ð Þ is a strictly increasing continuous function of y, C j:i can be defined by the condition . . . C l ð Þ be all C j;i s sorted in the increasing order and C 0 ð Þ ¼ À1 and C lþ1 ð Þ ¼ 1. Then the function Q t y ð Þ of CCPS is determined by the following formula (Vovk et al. 2018b if y 2 ðC ðiÞ ; C ðiþ1Þ Þ for i 2 0; Á Á Á; l f g where To remove the impact of t in formula (14), it is recommended in Vovk et al. (2020a) to use a modification of formula (14) as follows, That is, formula (15) is used as the CDF predicted for y 0 . Also, formula (15) represents the empirical cumulative distribution function of C i ð Þ ; i ¼ 1; . . .; l È É . Then Leave-One-Out CCPS can be obtained by setting k ¼ l.

LOO-CCPS-RELM
With A S;ẑ ð Þ as formula (6) and underlying regression algorithm r as RELM, we only have Leave-One-Out CCPS with RELM, which has to compute RELM l times to calculated C i s. A further step to obtain LOO-CCPS-RELM is to utilize a slight modification of Leave-One-Out CCPS with RELM by observing that RELM is a uniformly stable algorithm (Bousquet & Elisseeff 2002) whose output regressors are similar to each other before and after one training datum is removed. By this modification, RELM needs to be computed only once in LOO-CCPS-RELM and the details are summarized in Algorithm 1, where we see kl as a whole from formula (4).
The theoretical analysis in the asymptotic setting and empirical explorations of LOO-CCPS-RELM can be found from our previous work in Wang et al. (2020).

The proposed conformal predictive system for distribution regression
Recall that the training set is z l and our purpose is to build a CPS to predict the prediction CDF for the test input x 0 . After the review of Sect. 2, we can formulate our proposed CPS in the following two steps. First, the inputs x i s are mapped tol i s with approximated kernel mean embedding using formula (3). Then LOO-CCPS-RELM is applied to the embedded data setẑ l to obtain function Q for predicting the CDF of x 0 . We use random Fourier features to approximate the kernel of kernel mean embedding. Also, like the work in Wang et al. (2020), we use random Fourier features as hl; h i ð Þ s to build RELM. Thus, the whole learning process of our proposed CPS is very fast which can be applied to real-time applications. Algorithm 2 summarizes our proposed CPS for distribution regression with random Fourier features.
Here are some analyses of the complexity of Algorithm 2. As the training process of Algorithm 2 contain training RELM on the data setẑ l and calculating y i Àf i s with formula (5), for fixed meta-parameters D, L and k, the complexity of the training process is O l ð Þ, which is the same as that of LOO-CCPS-RELM (Wang et al. 2020). Hence, the training of Algorithm 2 can be fast even when l is very large. Also, for fixed meta-parameters, the testing process of Algorithm 2 is including embedding x 0 tol 0 and calculatingfl 0 ð Þ as in Algorithm 1, whose computational complexity is only O 1 ð Þ since the times of multiplication computation is not dependent on l. Therefore, the computational complexity of Algorithm is low and can be applied to real-time applications properly.
As Algorithm 2 outputs a prediction CDF for the test input x 0 , it is easy to derive a prediction interval from the CDF by the corresponding quantiles. For example, the prediction interval with expected coverage rate of 1 À g can be represented by where q g=2 ð Þ x 0 and q 1Àg=2 ð Þ x 0 are g=2 and 1 À g=2 quantiles of Q x 0 y ð Þ, respectively. Thus, Algorithm 2 can also produce prediction intervals for the test inputs.
Although Algorithm 2 is very simple and easy to implement, it surprisingly has the property of validity inherited from CPSs and the error rate of the prediction intervals derived from the prediction CDFs can be controlled by the significance level g, which are verified empirically in the next section.  (2019), Algorithm 2 having the property of validity means that the prediction CDFs are compatible with the realizations, which can be verified empirically by the fact that the frequency of the events Q y 0;i À Á a for i 2 1; . . .; n f gis near a, i.e.,

Experimental result and analysis
Also, another important property we care about is whether the error rate of the prediction intervals derived from the CDFs of Algorithm 2 can be controlled by the significance level g, which can be verified if the frequency of y 0;i s being out of the prediction interval q g=2 ð Þ s is near or less than g.
In this section, we first explore Algorithm 2 using synthetic data sets to test whether Algorithm 2 has the property of validity in the sense of formula (16) and whether the error rate of the prediction intervals derived from Algorithm 2 can be controlled. After that, we build probabilistic forecast models using Algorithm 2 to forecast temperature and precipitation based on ensemble forecasts of numerical models and compare them with other widely-used Bayesian models .
For all of the following experiments, since we use random features and Law of Large Numbers to approximate the underlying transformation made by Gaussian kernel, we set D = 1000 referring to the previous study in Lopez-Paz et al. (2015), and we also observe no significant improvement with more random features in our application cases. Following our previous work about LOO-CCPS-RELM, we set L ¼ 1000 and select the meta-parameter kl from f10 À5 ,10 À4 ; . . .,10 4 , 10 5 g with the least of leave-oneout square error in the training set. All of features and labels of the data sets in this section were first normalized to [0,1] with min-max normalization before they were fed to the algorithms. The experimental results were collected with ten-fold cross-validation, i.e., the algorithms were trained on nine folds and tested on the rest fold ten times to obtain the testing experimental results. Algorithm 2 was implemented with Python language (Van Rossum & Drake 1995) and its Numpy library (Oliphant 2006).

Explorations on synthetic data sets
Synthetic data sets were generated to demonstrate the applicability of Algorithm 2. Referring to Póczos et al. (2013), each input distribution followed a beta a; 3 ð Þ distribution with 500 sample points, whose parameter a was uniformly chosen from the interval 3; 20 ½ and the corresponding label is the skewness of the distribution plus a noise variable e, which can be written as We collected six data sets in the above way, whose numbers of data were all set to 1000. The first data set is denoted by beta 0:00 with e ¼ 0 and the other five data sets are denoted by beta 0:04 , beta 0:08 , beta 0:12 , beta 0:16 and beta 0:20 , whose e s are uniformly drawn from À0:04; 0:04 ½ , À0:08; 0:08 ½ , À0:12; 0:12 ½ , À0:16; 0:16 ½ and À0:20; 0:20 ½ , respectively. We first check whether Algorithm 2 has the property of validity using formula (17). Algorithm 2 was trained and tested on the six data sets individually and experimental results are shown in Fig. 1, which records the relations between the frequency of Q y 0;i À Á a and a. The curves are all closed to the diagonal, which means that Algorithm 2 is valid and is insensitive to the variance of the noise variable. Then, the error rates of the prediction intervals of Algorithm 2 were calculated on all of the data sets. Figure 2 shows the error rates can be controlled by g s for all of the data sets, which demonstrates that the prediction intervals output by Algorithm 2 are reliable and are robust to the variance of the noise variable. Figure 3 shows the average interval sizes of Algorithm for different g s and different data sets. It can be seen that the average interval size is related to g and the variance of the noise variable. As the prediction error rates are robust to the variance of the noise variable, a larger variance of the noise variable leads to a larger average interval size. Also, as the error rate can be controlled by g, a larger g leads to a smaller average interval size. As g increases, the average interval size is forced to become short to increase the error rate, which shows the strict control over the error rate by tuning g. Thus, from Fig. 2 and Fig. 3, we can see that there is a balance between the error rate and the informational efficiency of the prediction intervals. In high-risk applications, as the informational efficiency must be sacrificed to make sure that the error rate of the interval predictions is low enough, one can choose g from 0:05; 0:2 ½ to balance the error rate and the informational efficiency. Fig. 1 Tests the validity of Algorithm 2 on six data sets. The curves are all closed to the diagonal, which indicates that Algorithm 2 is valid in the sense of formula (16) on all six data sets Fig. 2 The relation curves of g s and the error rates of the prediction intervals derived from the CDFs output by Algorithm 2. All curves are closed to the diagonal, which demonstrates that the error rates can be controlled for all six data sets

The proposed algorithm for statistical postprocessing of ensemble forecasts
Ensemble forecasts from numerical models for probabilistic weather forecasting are not valid since they tend to be biased and under dispersed. This inspires many researches on statistical postprocessing of ensemble forecasts. The task is to build a regression model, whose inputs are ensemble forecasts of some meteorological element and labels are the corresponding observations. This is a typical distribution regression problem since the inputs can be seen as the estimated empirical distribution of the meteorological element.
In this section, we apply Algorithm 2 to the task of postprocessing of ensemble forecasts for temperature and precipitation. The two data sets are all from ensemblepp package (Messner 2017) of R language (R Core Team 2018). For the task of postprocessing of ensemble forecasts for temperature, we use the temp data set, which has 18 -30 h minimum temperature ensemble forecasts and observations at Innsbruck. For precipitation, we use the rain data set, which has accumulated 18 -30 h precipitation ensemble forecasts and observations which are also at Innsbruck. All the two data sets have 2749 samples and are from 2000-01-02 to 2016-01-01. As the distribution of precipitation is highly biased, it is common in the literature to transform precipitation data before applying postprocessing methods. Referring to Vannitsem et al. (2018), we use the square root to transform all forecasts and observations of precipitation data prior to the process of minmax normalization. As the data were collected in chronological order, they were partitioned with ten sequential parts for ten-fold cross-validation experiments.
Two kinds of widely-used postprocessing algorithms are compared with Algorithm 2. The first kind is Bayesian Model Averaging (BMA)  and the second kind is Ensemble Model Output Statistics (EMOS) . For temperature, as the assumed distribution of temperature is normal distribution, the corresponding algorithms are named as BMA-n (Raftery et al. 2005) and EMOS-n . For precipitation, BMA-g (Sloughter et al. 2007), EMOScsg (Scheuerer & Hamill 2015) and EMOS-gev (Scheuere 2014) are used as comparative algorithms, where the assumed distributions of precipitation are gamma distribution, censored and shifted gamma distribution, and censored generalized extreme value distribution. The packages ensembleBMA ) and ensem-bleMOS (Yuen et al. 2018) of R language were used to build BMA and EMOS based algorithms, respectively.
For temperature, BMA-n, EMOS-n and Algorithm 2 were tested on the temp data set. First, the test of whether the algorithms are valid in the sense of formula (16) was conducted. We changed a from 0 to 1 to see if the frequency of Q y 0;i À Á a is near a. Figure 4 shows the curves obtained by BMA-n, EMOS-n and Algorithm 2. It can be seen that Algorithm 2 is valid as the curve is very closed to the diagonal, while the other two algorithms are not valid at all. This is reasonable since Algorithm 2 inherits the property of validity from CPSs. Second, the error rates of the prediction intervals output by the three algorithms were calculated. We also changed g from 0 to 1 to see whether the error rates are controlled by g s, i.e., the frequency of Fig. 3 The relation curves of g s and the average interval sizes. Large variance of labels leads to large average interval size, as the error rates are strictly controlled by g s Fig. 4 Tests whether Algorithm 2, BMA-n and EMOS-n are valid for probabilistic forecasting temperature. Algorithm 2 is closed to the diagonal while the other two algorithms are not, means that Algorithm 2 is valid and BMA-n and EMOS-n are not valid in the sense of formula (16) the real labels being out of the prediction intervals is near or under g for each g 2 0; 1 ð Þ. Figure 5 shows the experimental results of error rates of prediction intervals. From  Fig. 5, we can see that the error rates of prediction intervals of Algorithm 2 are controlled by g s while those of the other algorithms are not controlled. Thus, Fig. 4 and Fig. 5 confirm that Algorithm 2 is compatible with the realizations while BMA-n, EMOS-n are not. Also, the average interval sizes were obtained to see if the intervals predicted by Algorithm 2 are informationally efficient. Figure 6 shows how the average interval sizes changed when g varied from 0 to 1 for different algorithms. It is shown that the average interval size of EMOS-n is the lowest for any g. For g 2 0:1; 1 ½ , the average interval size of Algorithm 2 is lower than that of BMA-n while for g 2 0; 0:1 ½ Þ, the average interval size of Algorithm 2 is nearly the same as that of BMA-n. Since we focus on high-risk applications, we care more about the performance for g 2 0:05; 0:2 ½ . Thus, as the error rates of the prediction intervals of EMOS-n can not be controlled for g 2 0:05; 0:2 ½ , the prediction intervals of EMOS-n for g 2 0:05; 0:2 ½ are meaningless. Then, we can conclude that the prediction intervals of Algorithm 2 are informationally efficient since the average interval size is shorter than that of BMA-n for g 2 0:05; 0:2 ½ . We summarize the experimental results in Table 1 to make them more concise. From Table 1, we can conclude that the CDF output by Algorithm 2 is compatible with the realizations as it is valid in the sense of formula (16) and the error rates of its prediction intervals can be controlled. Besides, the prediction intervals of Algorithm 2 are informationally efficient.
For precipitation, similar experiments were conducted with the rain data set to compare Algorithm 2 with BMA-g, EMOS-csg and EMOS-gev. As precipitation is always nonnegative, we force the negative samples in the empirical distribution predicted by Algorithm 2 to be 0. The meanings of Fig. 7, Fig. 8 and Fig. 9 are similar with those of Fig. 4, Fig. 5 and Fig. 6, respectively. From Fig. 7, it can be seen that Algorithm 2 is valid while the other three algorithms are not. Figure 8 shows that the error rates of the prediction intervals of all four algorithms can be controlled by g s, which indicates that the prediction intervals of all four algorithms are reliable. At last, Fig. 9 Fig. 5 The relation curves of g s and the error rates of the prediction intervals derived from the CDFs output by Algorithm 2, BMA-n and EMOS-n. The prediction intervals of Algorithm 2 are reliable for all g 2 0; 1 ð Þsince its curve is near or below the diagonal. The prediction intervals of BMA-n and EMOS-n are not reliable for some g 2 0; 1 ð Þ Fig. 6 The relation curves of g s and the average interval sizes of Algorithm 2, BMA-n and EMOS-n. On the condition that the prediction intervals are reliable, Algorithm 2 are informationally efficient for g 2 0:05; 0:2 ½ demonstrates that the average interval sizes of Algorithm 2 are shorter than those of BMA-g and EMOS-gev for g 2 0; 1 ð Þ and are similar to those of EMOS-csg for g 2 0; 0:2 ð Þ, which confirms that the prediction intervals output by Algorithm 2 are informationally efficient for forecasting precipitation. The experimental results are summarized in Table 2, where we can conclude that the CDF output by Algorithm 2 is compatible with the realizations and the prediction intervals of Algorithm 2 are informationally efficient.
In summary, we can conclude that Algorithm 2 is valid in the sense of formula (16) for ensemble forecasts for temperature and precipitation while the other comparative algorithms are not. Despite this, the prediction intervals derived from the corresponding predictive distributions of these algorithms can be reliable, i.e., the prediction error rates are near or lower than the significance levels, which makes these algorithms useful in practice for interval prediction. For significance levels between 0.05 and 0.2, the average interval sizes of Algorithm 2 are shorter than those of BMA-n, BMA-g and EMOS-gev and similar to that of EMOS-csg, which confirms the informational efficiency and effectiveness of Algorithm 2. Besides, since the comparative algorithms are all trained with maximum likelihood method while Algorithm 2 only needs matrix multiplication with complexity O(l) for fixed meta-parameters, our method is easy to implement and can learn fast for real-time applications.

Conclusion
This paper builds a conformal predictive system for distribution regression. A two-stage process is employed where the input distribution is first embedded to an approximated reproducing kernel Hilbert space and then the LOO-CCPS-RELM is used to build the conformal predictive system. The experiments confirm the validity of the prediction CDFs and the reliability of the prediction Fig. 7 Tests whether Algorithm 2, BMA-g, EMOS-csg and EMOSgev are valid for probabilistic forecasting precipitation. Algorithm 2 is closed to the diagonal while the other three algorithms are not, means that Algorithm 2 is valid and EMA-g, EMOS-csg and EMOS-gev are not valid in the sense of formula (16) Fig. 8 The relation curves of g s and the error rates of the prediction intervals derived from the CDFs output by Algorithm 2, BMA-g, EMOS-csg and EMOS-gev. The prediction intervals of all algorithms are reliable since their curves are all near or below the diagonal Fig. 9 The relation curves of g s and the average interval sizes of Algorithm 2, BMA-g, EMOS-csg and EMOS-gev. On the condition that the prediction intervals are reliable, Algorithm 2 and EMOS-csg are both informationally efficient for g 2 0:05; 0:2 ½ intervals of the proposed algorithm. Comparisons with other widely-used Bayesian methods for postprocessing of ensemble forecasts verify the effectiveness of our algorithm for probabilistic forecasts for temperature and precipitation. Our approach is easy to implement, fast and valid, which is promising in real-time and high-risk applications related to distribution regression.