Mixed logit model based on nonlinear random utility functions: a transfer passenger demand prediction method on overnight D-trains

In recent years, with the development of high-speed railway in China, the operating mileage and passenger transport capacity have increased rapidly in transportation industry. Due to the high density of trains in the daytime, we usually set up skylights at night (0:00–6:00 am) on high-speed railway for comprehensive maintenance. However, this arrangement contradicts with the operation demand of D-series overnight high-speed trains (overnight D-trains for short). In order to adjust the operation plan of overnight D-trains with skylights coordinately, it is necessary to predict the passenger demand of newly added overnight D-trains. Therefore, in this paper, a mixed logit model based on nonlinear random utility functions for different transport modes is proposed, in order to predict transfer passenger demand. According to Maximum Simulated Likelihood Method, the likelihood function of this mixed logit model is proposed to maximize the overall utility value of different passenger groups while Metropolis–Hastings algorithm is adopted to iteratively solve the probabilities of discrete random variables in utility functions. After that, the unknown distributions of parameters are estimated and the optimal solution of this model is provided by traditional algorithms, basic heuristic algorithms and improved heuristic algorithms including improved fireworks-simulated annealing algorithm proposed in this paper, respectively. Finally, a real-world instance with related data of Beijing–Shanghai corridor is implemented to demonstrate the performance and effectiveness of the proposed approaches.


Introduction
With the continuous improvement in railway network, the connectivity across the regions and travel demand of passenger are growing with an unprecedented speed. However, due to the heavily congested passenger flow in peak hours, sometimes passenger demand cannot be satisfied even with the maximum departure frequency. To release traffic pressure and solve the transportation problems, such as the mismatch between transportation resource allocation and passenger demand, overnight D-trains have now been constructed and operated between two cities (Beijing-Guangzhou, Shanghai-Shenzhen, Beijing-Shanghai, etc.). Additionally, the skylight time of the railway is closely B Shuang Ren sren@bjtu.edu.cn 1 School of Computer and Information Technology, Beijing Jiaotong University, 100044 Beijing, China related to practical operation plans of overnight D-trains, including the carrying capacity, departure quantity and departure frequency, which in essence determine the number of passengers allowed to board.
In the beginning of overnight D-trains operation, the reasonable prediction of transfer passenger demand is important for the formulation of the train operation scheme and the maximization of benefits as well. The novel optimal combination utility value with both transfer modes and transport modes of different passenger groups in the prediction model can make the prediction results more accurate and more practical to a great extent. This research intends to explore this issue explicitly.

Literature review
As a key component of passenger transport organization, the prediction of passenger demand has received significant attention from both researchers and practitioners in the trans-portation field. In general, the current prediction methods can be divided into the following categories.
(a) The related factors methods establish prediction models between the passenger flow and the influence factors, such as competitive relationship among various transport modes, social development level and national income. Some experts generally regard passenger demand as a 'derivative demand' implied in various productions and activities, such as Jonas et al. (2004), Wardman (2006), Ahn et al. (2017), ,  and Gelhausen et al. (2018). With the extensive development of the society and economy, closer and closer ties between different regions increase passenger demand to some extent. (b) The time series analysis methods establish passenger flow prediction models by considering time as the independent variable, such as Wang et al. (2012) and Yin et al. (2017). There are also some models combined with time series methods, for example ① seasonal models, Tian et al. (2011), Guang et al. (2017), ② grey models, Li and Wang (2007), Huang and Feng (2014), ③ neural networks, Chen and Grant-Muller (2001), Dia (2001), Stella et al. (2006), Eleni (2007), Eleni et al. (2010), Wei and Chen (2012), Li et al. (2014), Glišović et al. (2016), Ivanov and Osetrov (2018), ④ modal demand models, Wang and Sun (2017) and so on. (c) Spatio-temporal prediction methods are most popular in recent years because passenger demand is not only influenced by dynamical of time, but also limited by the scale of transport network, which means the prediction has spatial and temporal dependence. There are also some models listed in Table 1.
In addition, Castro-Neto et al. (2009a, b) and Roos et al. (2018) think that accurate prediction under atypical conditions (vehicular crashes, inclement weather and holidays) is also crucial to traffic management.
In all prediction methods, the logit discrete choice model is one of the most widely used models to study transfer passenger demand prediction or market shares of different transportation (e.g., train, airplane). There are some improved logit models shown as follows.
(a) Multinomial logit (MNL) model and mixed multinomial logit (MMNL) model, such as Hess and Polak (2005), Jou et al. (2011) and Lee et al. (2016). MNL model is most common in practice with the advantages of simplicity, reliability and easy implementation. An important property of MNL model is Independence from Irrelevant Alternatives (IIA), which means the ratio between the choice probabilities of any two alternatives is unaffected by the systematic utilities of other alternatives. However, MNL model also has some inherent theoretic defects and then leads to the need for more refined models. MMNL model offers significant advantages over MNL model by allowing for random taste variation across decision makers. But the biggest drawback of MMNL model is that integrals of choice probabilities do not have a closedform expression and need to be approximated through simulation. (b) Nested logit (NL) model, generalized nested logit (GNL) model, cross-nested logit (CNL) model and random parameter nested logit (RPNL) model, such as Jung and Yoo (2014), Cheng and Yang (2015) and Yu et al. (2018). In order to overcome the restriction of MNL model, NL model is first proposed as an extension of MNL model. NL model partitions the choice set into several separated sub-nests and puts similar alternatives in the same subnest, so that the correlation among alternatives within each sub-nest can be captured. However, when alternatives cannot be partitioned into well separated nests, NL model is not appropriate. CNL model is a direct extension of NL model, where each alternative may belong to more than one nest and the correlation across nests can be estimated. (c) Mixed logit (MXL) model is suitable for handling random preference problems, including correlation among alternatives and random coefficients. It allows the unobserved factors to follow any distribution. See Sect. 2.2.1 for details.
Most of existing researches of the logit models are based on utility functions, in order to maximize the utility values of passengers. For comparative convenience, we list the detailed characteristics of some closely related references in Table 2, including transport modes, variables in utility functions and formulated models.

Focus of this study
Nowadays, there is little research on the market share of transport modes after newly added overnight D-trains. Obviously, running overnight D-trains is not only convenient for long distance travel of passengers, but also improves the capacity and volume of the railway. Therefore, it is urgent to accurately predict market share and transfer passenger demand of newly added overnight D-trains. In the literature mentioned above, two problems remain unsolved.
(a) The nonlinear relationships between the utility values and the influence factors about different income passengers' choices about different transport modes are neglected in most of transfer passenger demand prediction approaches, i.e., logit models in a linear way.
(b) The focus of recent research in Table 2 is how to choose the variables in the models reasonably and comprehensively by studying the influence factors of different transport modes. The number of variables involved in the research is relatively small, so some traditional software, such as SPSS or ALOGIT, and traditional methods, such as Newton Method, can usually be used to estimate unknown parameters. Nevertheless, the growing variables make the time complexity by using Newton Method too large to non-convergence of the results. Therefore, Han et al. (2020) adopt some heuristic algorithms to solve the problem and the experiment demonstrates that heuristic algorithms outperform the traditional methods.
In summary, our primary contributions are as follows.
(a) It is the fact that the function relationship between the utility value of passengers and the influence factors about the travel intention of passengers is nonlinear. In this paper, we first consider the travel intention of passengers to choose different combinations of transfer modes and transport modes, and then establish different nonlinear utility functions, mainly including the passenger income function, the ratio function of travel expenses to travel time value and the arrival time function for different types of transport modes. (b) The randomness of passengers' choices about different transfer modes and different transport modes may lead to the uncertainty of both total travel expenses and total travel time. Thus, the model proposed in this paper considers total travel expenses and total travel time as discrete random variables in the objective function, so as to get close to the actual decision-making optimizations and maximize the total utility value of all involved passengers in the real situation. (c) In this paper, we adopt some representative basic heuristic algorithms, including Ant Colony Algorithm, Simulated Annealing Algorithm and Fireworks Algorithm and then propose an improved combinatorial heuristic algorithm, i.e., Improved fireworks-simulated annealing algorithm, to estimate the unknown distributions of parameters of utility functions and solve the optimal solution of the model, respectively. Finally, the comparison results of their running time, optimal solutions and convergence speed show that the improved algorithm we proposed can significantly improve the baseline and achieve state-of-the-art performance.
The technical route of the specific research in this paper is shown in Fig. 1 (Best viewed in color). The rest of this paper is organized as follows. In Sect. 2, a mixed logit model based on nonlinear random utility functions is formulated to predict

Mathematical formulation
To characterize this problem of interest in a mathematical way, this section will explicitly discuss the formulation process for a mathematical model for transfer passenger demand prediction. First, a mixed logit model based on nonlinear random utility functions is established. Then, to reduce the complexity of solving this model, we adopt Maximum Simulated Likelihood Method to restructure a corresponding likelihood function for the objective function.

Problem description
To formulate the problem, three types of transport modes (i.e., air, highway and railway) are taken into consideration for passengers to choose during long-distance travel in this study. Therefore, the travel process (i.e., OD) of passengers from Origin (O) to Destination (D) can be divided into three stages (see Fig. 1 for details): ① from Origin (O) to airport/train station/bus station, ② by an airplane/train/intercity bus (with different seat grades), ③ from airport/train station/bus station to Destination (D). Therefore, many combination results will be obtained from choices about different transfer modes and transport modes, which is the basis for setting the proposal probabilities of random variables in subsequent sections. In order to simplify or idealize some realistic conditions, throughout this paper the model is developed at the level of the income groups instead of the individual. According to the 'per capita disposable income of urban residents' of the National Bureau of Statistics in China, we divided all passengers into three groups according to different monthly incomes, i.e., low-income passenger group, medium-income passenger group and high-income passenger group. In fact, passenger groups with different income levels have different consumption habits. Specifically, low-income passengers tend to be more sensitive to total travel expenses, so they probably do not choose transport modes with high-quality, fast and high-price. Medium-income passengers often seek a more comfortable transport mode with acceptable price and good service quality. Finally, for high-income passengers who focus on speed and comfort, they often choose  The core of this paper i.e., mathematical model, is in the blue box transport mode which is more time-saving, comfortable with relatively better service quality. Inspired by 'Utility-maximizing rule' 1 in Utility Theory of Lopez et al. (2011), we put forward an assumption that passengers will evaluate the utility value of different available transport modes and always choose one transport mode with maximum utility. Under this condition, the complex relations between different income levels and the utility values of different transport modes are as follows. ① About air: Passenger income is proportional to utility value, i.e., the higher the passenger income is, the more the utility value will be and vice versa. ② About railway: When passenger income is in a certain range, utility value is the highest, and when income is lower or higher, utility value will decrease. ③ About highway: Passenger income is inversely proportional to utility value, i.e., the lower the passenger income is, the more the utility value will be and vice versa.

Mathematical model
In line with the research of Han et al. (2020), safety, rapidity, economy, convenience and comfort are the main factors affecting passenger choices about different travel modes. Specifically, the most important factors affecting travel intention of passengers are ① safety, which cannot be quantified and has little difference for each transport mode, ② rapidity, which can be quantified as travel time, and ③ economy, which can be quantified as travel expenses. In addition to these objective factors, the income levels of passengers and arrival time also indirectly determine their travel intention to a large extent. Therefore, in order to make the model much closer to the actual case, besides two objective factors including travel expenses and travel time, we consider passenger income and arrival time while constructing utility functions. See Sect. 2.2.2 for details.

Mixed logit model
The mixed logit (MXL) model ( McFadden and Train (2000), Hensher and Greene (2003) and Train (2003)) extends the standard logit model and overcomes the limitations in MNL model and NL model by allowing one or more parameters in the model to be randomly distributed. It means that different decision makers may have different preferences and the IIA property in MNL model no longer holds in this model. In this paper, we adopt a mixed logit model to solve market shares and predict passenger demand. First the standard logit model mainly includes the following definitions.
(a) Model parameters A set of individuals (passenger groups Q = {q|q = 1, 2, 3} where q = 1 is low-income passenger group, q = 2 is medium-income passenger group and q = 3 is high-income passenger group), a set of available alternatives (transport modes M = {m|m ∈ N + }), a set of measured attributes X ) of the individuals and their alternatives.
(b) Decision rules The individual q will choose the alternative m which maximizes individual utility. Therefore, the utility value is based on comparison and evaluation of each decision maker's attractiveness to alternatives with different characteristics, which can be expressed as functions of different attributes. (c) It is impossible to possess complete information about all attributes that determine one decision, so errors may arise for specification or observational reasons. To take the unobserved measurement error into account, the true attribute X * or its function f (X * ) is effectively replaced by X + ε or f (X * ) + ε, respectively. In general, for individuals who have the same set of alternatives and face the same constraints, it can be assumed that the unknown random disturbance term ε is independent and obeys the identical distribution with the same mean and variance.
In line with utility-maximizing rule, we suppose that a passenger group q always chooses one transport mode m with the maximum utility value. Specifically, a utility U mq can be represented by two components, i.e., observed representative component, V mq which is a function of observable attributes X , and an unknown random disturbance term, ε mq which represents unobserved attributes/measurement errors/observation errors. Therefore, the utility value U mq = V mq + ε mq is random and associated with a probability, i.e., or more explicitly, i.e., (2) It means the probability of a passenger group q depends on a transport mode m that the utility of m is greater than that of any other choices l ∈ M. From above joint probability distribution of disturbance term {ε mq , m ∈ M, q ∈ Q}, a specific random utility model can be obtained. According to (c), each distribution of ε mq is assumed to be the Gumbel extreme value distribution and has the following cumulative distribution function, i.e., with mean zero and variance δ 2 . So the distribution function about the difference between two random errors ε mlq = ε mq − ε lq is In standard logit model, the probability of passenger group q to choose transport mode m can now be expressed as follows, The MXL model considers that the unknown parameter vector to be estimated is a random vector instead of a fixed value, which obeys a certain distribution due to different preferences of passenger groups. Therefore, the probability of passenger group q to choose transport mode m should be the probability expected value of parameter vector θ traversing all possible values and can be expressed as follows: The above probability function of MXL model can be regarded as the weighted average of the probability of standard logit model and the weight is determined by the distribution density function f (θ |ψ). In this function, parameters in θ are random variables, which may obey normal distribution, uniform distribution, etc., and ψ are the unknown characteristic parameters to be estimated in density function, such as the mean and variance of normal distribution.

Nonlinear random utility functions
In previous parts, we explore the function relationships between each attribute (including travel expense, travel time, passenger income and arrival time) and utility value, respectively.

(a) Travel expense and travel time
Based on the fact that the likelihoods of the choices about different transfer modes and transport modes are different for passengers with different income levels, the probability distribution regularities of discrete random variables are summarized as follows. ① About low-income passenger group: The lower the total travel expenses are, the more the probabilities will be, while travel time has little effect on them. ② About medium-income passenger group: All probabilities are similar and do not change with total travel expenses and total travel time. ③ About high-income passenger group: The less the total travel time is, the more the probabilities will be, while travel expenses have little effect on them.
Here, two types of discrete random variables are taken into consideration. The first type is expense denoted byẽ. It aims to generate the total travel expenses including transfer cost and price ticket for different passenger groups' choices about different transfer modes and transport modes. The second type is corresponding time denoted byt, which indicates the total travel time including transfer time, waiting time and in-vehicle time (or running time). By analyzing the combination results of different transfer modes and transport modes, these two attributes are related, and when a passenger chooses one certain combination with transport mode m, total travel expensesẽ m and total travel timet m can be determined accordingly.
Mathematically, the probability distribution of discrete random variableX is as follows: and E(X ) = x 1 · p 1 + x 2 · p 2 + · · · + x i · p i + · · · + x n · p n is called the mean or mathematical expectation and reflects the average level ofX . Therefore, in this paper, we adopt the mathematical expectations of two discrete random variables, E(ẽ m ) and E(t m ), to calculate the corresponding utility functions. Because the units of time and expenses are not uniform and belong to different dimensions, the utility value cannot be calculated in one function. So we introduce time value to replace travel time in order to unify their dimensions. The unit (hour) time value of passenger group q is denoted by v q = s q 30·24 , where s q is monthly income of passenger group q and the denominator is the number of hours converted per month. Inspired by the research of Wardman (1998) demonstrating that the marginal utility of time increases with income, the time value of high-income passenger group is higher than that of low-income passenger group and medium-income passenger group to a certain extent.
Additionally, the utility value about rapidity, economy and comfort of transport modes is mainly measured by travel time reflected by fatigue recovery time and travel expenses with different seat grades. So the utility of the choices about different transport modes is inversely proportional to travel time and directly proportional to travel expenses. In line with the work of Shi et al. (2007), the function relationship between them is shown in Fig. 2. On the one hand, the less the travel time is, the higher the utility value is. But with the increase in time, the increment of utility value brought by saving unit time will decrease. On the other hand, the higher the travel expense (seat grade) is, the higher the utility value is. However, with the increase in travel expense, the increment of utility value brought by more unit expense will decrease. So in order to describe this relationship more accurately, we establish the following function where ϕ (R) m (ẽ m ,t m , α m1 , α m2 ) denotes the ratio function of According to the complex relationships mentioned in Sect. 2.1, the passenger income functions should be established as curves which can map variables between 0 and 1 to standardize utility functions. Particularly, here we adopt a sigmoid function and its variants to study this attribute due to the following characteristics of this function. ① This function and its inverse function are smooth and monotonic. It is a bounded and differentiable mathematical function having a characteristic S-shaped growth curve which is constrained by a pair of horizontal asymptotes as x → ±∞, shown in the solid line of Fig. 3 and defined by the following formula where (0, 0.5) is a symmetry point, i.e., ② This function has a non-negative first derivative at each point. Its bell-shaped derivative function can be expressed by the dashed line of Fig. 3 and defined by the following formula, i.e., ③ At the same time, this function can map a real number to the interval of (0,1) that can deal with the normalization problem of this attribute.
Consequently, many natural processes use sigmoid function or its deformation to establish the model. So does this paper. First, the establishment of passenger income function about air (i.e., airplane) exhibits a progression from small beginning at the range of low-income level that accelerates and approaches a climax with the increase in passenger income. Therefore, we adopt the deformation of a sigmoid function to denote this function. Secondly, passenger income function about highway (i.e., intercity bus) is contrary to that about air as stated above. This function is based on an inverted S-curve, i.e., an inverse function of sigmoid function. Hence, we adopt the deformation of an inverse sigmoid function to denote this function. Finally, passenger income function about railway (i.e., various types of trains) is between the above two functions, i.e., the combination of the first and second half in passenger income function about air and highway, respectively. In summary, we establish all passenger income functions as follows whose graphs are shown in Fig.  4. (a) Air: (c) Highway: (c) Arrival time In order to analyze passengers' preference and utility for arrival time, Lei et al. (2018) obtain passengers' choices about the ideal arrival time by conducting an intention survey. The results illustrate that there are two peaks in passenger choices of ideal arrival time, as shown in Fig. 5 named as bimodal distribution. The vertical axis represents probability of passengers who choose corresponding arrival time. Therefore, we choose probability density function with bimodal shape to describe this function relationship between arrival time and utility enlightened by Jiang (1997) Based on the above analysis, we establish different nonlinear random utility functions for different transport modes. The utility value of passenger group q (q ∈ Q) to choose transport mode m (m ∈ M) is expressed by the following function, i.e., Moreover, the coefficient parameter vector where ω 1 denotes the weight about ratio of travel expenses to travel time value, ω 2 and ω 3 denote the weight of passenger income and arrival time, respectively, α, β and γ are position and scale parameters in each formula. The elements in parameter vector θ obey different distributions. Then, in next subsection, the above nonlinear random utility function will be substituted into the mixed logit model established in this paper. To reduce the difficulty of solving this complex model, we will adopt Maximum Simulated Likelihood Method to construct a corresponding likelihood function.

Maximum likelihood estimation
The purpose of this study is to maximize the overall utility of all passenger groups in this model. We need to estimate the unknown distributions of elements in parameter vector. Because the probability function of MXL model cannot get the result by analytical method, it needs to use simulation method to calculate the probability and the optimal parameter value. In this paper, Maximum Simulated Likelihood Method is adopted to solve this problem and the specific steps are as follows.
Step 1 Calculate the simulation probabilityP mq .
Under the premise of given characteristic parameters ψ, a parameter vector θ is randomly chosen from the density function and recorded as θ w , w = 1 in the first extraction. Common random sampling methods are pseudo-random sequence method and quasi-random sequence method. Then, based on Equation 5, we can calculate the value of P mq (θ w ), and then repeat iteration for W times to calculate the mean of all P mq (θ w ) as the simulation probability, i.e.,P mq = w∈W P mq (θ w )/W .
Step 2 Construct maximum likelihood operator.
The likelihood function is concave and can be constructed as follows under a suitable condition. Therefore, we can obtain a unique solution, i.e., the maximum value of the likelihood function.
(a) First, we construct simulation likelihood function as follows for a random sample of size M, which can be viewed as the product of the choice probabilities associated with Q sets of independent observations. Here, the first subset q=1 choose alternatives m from 1 to m 1 , the next one q=2 choose alternatives m from m 1 + 1 to m 1 + m 2 , and so on.
(b) Then, we introduce 0-1 variables y mq , m ∈ M, q ∈ Q, as choice indicators of different passenger groups, i.e., y mq = 1 passenger group q choose transport mode m, 0 otherwise, and m∈M y mq = 1, q ∈ Q.
Because each passenger group can only choose one transport mode and all passenger groups are independent, the expression of φ * can be simplified by y mq as follows: (c) Because ln φ * is the monotone increasing function of φ * , ln φ * and φ * have the same extreme point. In order to solve this mixed logit model conveniently, we choose corresponding log-likelihood function (i.e., maximum likelihood operator) as objective function of MXL model.
Step 3 Solve the characteristic parameters. The value of coefficient vector θ is changed iteratively until the maximum value of the simulated maximum likelihood operator is obtained, and the unknown char-acteristic parameter ψ of the density function is obtained by some heuristic algorithms in Sect. 3.2. In this paper, the optimization objective is to maximize log-likelihood function including nonlinear random utility functions about travel expenses, time values, passenger income and arrival time. Then, we set the elements in parameter vector θ to obey normal distributions with different means and variances. In this way, the mixed logit model based on the nonlinear random utility functions is shown as follows.

Solution approaches
In this section, we develop solution approaches of the model we established in this paper. Specifically, we first approve Metropolis-Hastings (M-H) Algorithm to solve the probabilities of different passenger groups' choices about different transfer modes and transport modes, according to proposal probabilities of discrete random variables. Then, we adopt three basic heuristic algorithms (Ant Colony Algorithm, Simulated Annealing Algorithm and Fireworks Algorithm) and two improved heuristic algorithms to solve the unknown distributions of elements in parameter vector to be estimated and obtain satisfactory solutions of objective function in short time, respectively. Finally, we compare the performance of these algorithms according to a real numerical experiment in Sect. 4.

Metropolis-Hastings algorithm
Given the distribution regularities in Sect. 2.1, we adopt Bayesian to deal with the probabilities of discrete random variables with different passenger income levels, namely prior probabilities in Bayesian statistics, and the probabilities are mainly determined by expert experiences and historical facts (sample data or observational data). Bayesian is a method to obtain objective posterior distribution on the basis of subjective prior distribution, and the corresponding Bayesian formula is shown as follows: . (18) where p(ψ i ) and p(ψ i |x) denote the prior and posterior probabilities/distributions of ψ i , respectively, p(x) and p(x|ψ i ) denote the marginal distribution and likelihood function of x, respectively.
Usually we need to sample from posterior probabilities of Bayesian. Since marginal distribution is a fixed value, posterior distribution is the proportional to product of prior distribution and likelihood function. The likelihood of marginal distribution may not be considered in the sampling of posterior distribution. But in generally, posterior probabilities are difficult to solve, mainly because we can't map to a random number that obeys Uniform Distribution. Therefore, the most feasible methods are Metropolis Algorithm and Metropolis-Hastings (M-H) Algorithm to solve the posterior probabilities. Typically, prior probabilities in Bayesian are the same as proposal probabilities in these two algorithms. Usually as the proposal probability q( j|i) doesn't satisfy detail balance condition 2 in Metropolis Algorithm, we choose M-H Algorithm to solve this problem.
The M-H Algorithm is developed to construct a transition matrix P of a time-reversible Markov chain {X t } with a given desired stationary distribution π . At the current state i, the next state j is proposed with a proposal probability q(i, j), which is a state transition probability of an arbitrary irreducible Markov chain on the state space N , where q(i, j) > 0 if and only if i = j, and q(i, j) = 0 for all i = j. Let Q {q(i, j)} i, j∈N be a proposal (transition) matrix. The state transition from state i to state j is accepted with an acceptance probability α( j|i) as follows and rejected with probability 1-α( j|i).
Then, we construct the Markov chains by multiplying an acceptance probability, i.e., So we summarize the procedure of M-H Algorithm as follows.
Step 1 Predefine parameters, including proposal probability q of random variables, acceptance probability α and maximum iteration times iter_max.
Step 2 Do for i = 1, 2, ..., iter_max, Step 2.1 Generate a current state i and a new state j from the proposal distribution.
Step 2.4 If u is a convergence to stationary distribution, i.e., u < α, we retain this current state i.
Step 3 Calculate posterior probability, which is equal to the number of preserved states divided by total iteration times.
As addressed above, the mathematical expectations of discrete random variables, i.e., E(ẽ m ) and E(t m ), which can be calculate by the corresponding probabilities using M-H Algorithm, are substituted into the objective function which need to be optimized.

Heuristic algorithms
Motivated by the success of Han et al. (2020), we learn that heuristic algorithms not only cost less time than traditional algorithms (e.g., Newton method), but also outperform traditional algorithms in optimal solution. The specific reasons we summarized are as follows.
Newton method has strict requirements and restrictions for functions. First, the second derivative of the objective function is continuous and differentiable. Second, the computational complexity is large because the inverse matrix or the pseudo-inverse matrix of Hessian matrix needs to be solved in each step. Moreover, gradient explosion or gradient disappearance occurs easily in the iteration process. Finally, when the initial point is chosen improperly, it often leads to local convergence or non-convergence of the function.
To tackle these challenges, various heuristic algorithms are successively proposed and increasingly used in recent research. The main characteristics of some heuristic algorithms can be summed up from four aspects, i.e., positive feedback mechanism, distributed computation, greedy heuristic search and strong robustness. Positive feedback mechanism can account for rapid discovery of good solutions, distributed computation can avoid premature convergence, and the greedy heuristic search can find acceptable solutions in early stages of the search process.
Based on the principle of bionics, heuristic algorithms mainly abstract some phenomena in nature into algorithms to deal with the corresponding problems. From using them to approach the optimal solution as much as possible, a relative optimal solution is obtained. Usually, heuristic algorithms are easy to find the global optimal solution instead of falling into the local optimum. Moreover, the optimal solution of a search problem can be obtained in short time. For NP problems, a better solution can also be obtained in polynomial time.
In this section, five heuristic algorithms including Ant Colony Algorithm (an imitative animal algorithm), Simulated Annealing Algorithm (an imitative physics algorithm) and Fireworks Algorithm (an imitative chemistry algorithm), Improved simulated annealing algorithm (proposed by Han et al. (2020)) and Improved fireworks-simulated annealing Algorithm (proposed in this paper) are used to solve the unknown distributions f (θ |ψ) of elements in parameter vector to be estimated, and the optimal solution of this model, i.e., φ(θ). Experiment in Sect. 4.1 demonstrates that our algorithm achieves state-of-the-art performance in these benchmarks.

Ant Colony Algorithm
Ant Colony Algorithm (ACA), first proposed by Dorigo (1992), is designed by imitating the cooperative manner of an ant colony and the characteristics of ants foraging behaviors, and then abstracting this manner into mathematical description. In biology, the foraging behaviors of an ant colony have the following characteristics.
(a) While building the paths from their nest to food, ants will deposit and sniff a chemical substance, which can mark the paths and provide ants with the ability of communication with each other, called pheromones. (b) Generally speaking, ants essentially move at random, but they always choose one path with higher concentration of pheromones and release a certain amount of pheromones to enhance the concentration of pheromones on this path. Therefore, the higher the concentration of pheromones is, the shorter the distance of corresponding path will be. (c) With the continuous actions of ant colony, the shorter paths are more frequently visited and become more attractive for the subsequent ants. By contrast, the longer paths are less attractive because the pheromones trail will evaporate with the passing of time. Finally, the shortest way from nest to food is found. Nowadays, this algorithm is widely used in optimization problems. The basic idea of applying it is that a feasible solution is expressed by a walking path of an ant and all paths of the whole ant colony constitute the solution space. Finally, the whole ant colony will concentrate on one path corresponding to the optimal solution. So by analyzing the foraging process of ant colony, we estimate the unknown distributions of elements in parameter vector and solve the optimal solution of model by ACA. The detailed solution process is summarized by mathematical method as follows.
Step 1 Predefine parameters, including ant colony size (the total number of ants) ant_max, the pheromones volatilization coefficient ρ, total pheromones released by ants for one iteration Q (constant), a constant of transfer probability p 0 , maximum number of iterations iter_max. Initial pheromones τ t (0) = φ(θ).
Step 2 Do for i = 0, 1, 2, . . ., iter_max, Step 2.1 Do for t = 1, 2, . . ., ant_max, Step 2.1.1 Each ant is randomly placed in different positions, and the next position of ant t, namely next feasible solution, is determined according to transfer probability p t , i.e., Step 2.1.2 Do following judgments about the distance between ant t and the position with the highest concentration of pheromones (i.e., the current maximum of the function): (a) Local Search The closer the distance is, the transfer probability p t will be smaller, and variable value θ will tend to be fine-tuning, i.e., θ(i + 1) = θ(i) + rn · λ where rn is a 0-1 random number, and λ = 1/(t + 1) is the expected level of θ and it gradually decreases as the iteration progresses. (b) Global Search The farther the distance is, the transfer probability p t will be larger, and the algorithm will tend to search for the optimal value in a wider range, i.e., θ(i + 1) = θ(i) + rn · (ub − lb)/2, where ub is upper bound and lb is lower bound of θ . According to the above searches, iterative formula of parameters to be estimated is Step 3 Calculate the pheromones of each path, and update the concentration of pheromones by the iteration formula as follows, i.e., τ t (i At the same time, the optimal solution of the current iteration is recorded.

Simulated Annealing Algorithm
Simulated Annealing Algorithm (SAA), first proposed by Metropolis (1953), is initially inspired by the change rules of internal molecular state and internal energy of solids in the process from high temperature to low temperature. This algorithm takes the temperature of the solid as the control parameter, and with the decrease in temperature, the internal energy of the solid (i.e., the objective function value) decreases gradually until it reaches the global minimum. It is a greedy algorithm; meanwhile, the random factors are introduced in the search process. If it accepts a solution worse than the current one with a certain probability, it is possible to jump out of the local optimal solution and reach the global optimal solution. We estimate the unknown distributions of elements in parameter vector and solve the optimal solution of model by SAA and the detailed solution process is summarized as follows.
Step 1 Predefine parameters, including an initial temperature T 0 , a termination temperature T min , an initial state θ 0 , maximum number of disturbance dis_max and maximum number of iterations iter_max. Let current temperature be T 0 and current state be θ 0 .
Step 2.2 Exit the loop until T i < T min .
Step 3 Set T i+1 = λ · T i where the cooling coefficient λ (0 < λ < 1) is used to control the speed of cooling. The larger the value of λ is, the slower temperature will drop.

Fireworks Algorithm
Fireworks Algorithm (FWA), first proposed by Tan and Zhu (2010), is inspired by the splendid fireworks in the sky. In this algorithm, two explosion (search) processes are employed and mechanisms for keeping diversity of sparks are also well designed. However, the explosions of fireworks are distinguished from good to bad. For a good explosion, the generated sparks are dense and numerous, and vice versa.
For mimicking the process of setting off fireworks, a rough framework of this algorithm is depicted as follows. When a firework is set off, a shower of sparks will fill the local space around the firework. The explosion process of a firework can be viewed as a search in the local space around a specific point where the firework is set off through the sparks generated in the explosion. When we need to find a point x satisfying f (x) = y, we can continually set off 'fireworks' in the local space until one 'spark' is fairly near the point x. For each generation of explosion, we first choose n locations, where n fireworks are set off. Then, after explosion, the locations of sparks are obtained and evaluated. When the optimal location is found, the algorithm stops. Otherwise, n other locations are chosen from the current sparks for the next generation of explosion.
In this paper, we adopt a simplified version of Fireworks Algorithm, called Bare Bones Fireworks Algorithm (BBFWA) proposed by Li and Tan (2018). It adopts only three essential operators of Fireworks Algorithm. ① The explosion operator generates numerous explosion sparks around the locations of the fireworks. ② The mutation operator mutates the locations of the fireworks to generate mutation sparks. ③ The selection operator selects the fireworks of the new generation from the sparks in the last generation. The mechanism of this algorithm is easy to understand, and the time complexity of this algorithm is linear. It can be easily implemented on a variety of platforms, which is convenient for real-world applications. Experimental results on a standard benchmark indicate that its performance is serviceable and stable.
We estimate the unknown distributions of elements in parameter vector and solve the optimal solution of model by BBFWA whose detailed solution process is summarized as follows.
Step 1 The relevant parameters need to be initialized, including the lower and upper boundaries of the search space lb and ub, the location of the firework f ∼ U (lb, ub), the locations of explosion sparks s, the explosion amplitude A = ub − lb, the maximum iteration times iter_max, an amplification coefficient C a > 1 and a reduction coefficient C r < 1.
Step 2 Do for i = 1, 2, . . ., iter_max, Step . Note that if a spark is located outside the boundaries, it can be replaced by a randomly chosen one in the feasible space.
Step 3.2 If the best spark is a better solution than the firework, it will take the place of the firework. That is, if maxφ(s i ) < φ( f ) then f ← argmaxφ(s i ) and A ← C a · A. Otherwise, A ← C r · A and the current firework will be kept.
Step 4 Exit the loop until that termination criterion is met.

Improved simulated annealing algorithm
The cooling coefficient λ ∈ (0, 1) is an important parameter affecting the convergence of SAA from the procedure of this algorithm. When this coefficient is too large, the solution accuracy is high, but the algorithm runs for a long time. On the other hand, when this coefficient is too small, the search accuracy is low. Therefore, to deal with the problem, improved simulated annealing algorithm of cooling coefficient (ISAA-CC) is first proposed by Han et al. (2020) to improve the efficiency of the basic algorithm and the quality of the optimal solution. They improve the cooling strategy in the basic algorithm as follows.
The value of λ is set as large as possible at the beginning of the algorithm, so that the algorithm has strong global search ability. With iterations of the algorithm, the value of λ decreases, in order that the algorithm can search for the optimal solution better. As a result, a function relationship between the cooling coefficient λ and the iteration times as shown in Fig. 6 will be established as follows:

Improved fireworks-simulated annealing algorithm
Because of the fast convergence speed and running speed of Fireworks Algorithm, and the optimal solution of Improved simulated annealing algorithm, we propose an improved algorithm combining the advantages of both algorithms in this paper, named Improved fireworks-simulated annealing algorithm (IFW-SAA). The BBFWA is not globally convergent because there is no stochastic mutation operator. This situation can be fixed easily by setting the reduction coefficient C r and the amplification coefficient C a or adopting a suitable mutation operator if necessary. According to ISAA-CC in Sect. 3.2.4, we introduce the cooling strategy into Fireworks Algorithm. A function relationship between the reduction coefficient C r ∈ (0, 1) and the iteration times is shown as Eq. 22. A similar function relationship between the amplification coefficient C a ∈ (1, 2) and the iteration times is shown as Eq. 23. The values of C r and C a are set as large as possible at the beginning of the algorithm, so that the algorithm has strong global search ability. With the iterations of the algorithm, the values decrease, in order that the algorithm can search for the optimal solution better. (22) In Sect. 4.2, we use these heuristic algorithms to solve the problem, respectively, and then compare the performance of them. The results show that heuristic algorithms proposed in this paper are better than others in terms of the running time and optimal solution.

Numerical experiment
Beijing-Shanghai high-speed railway connects two major economic zones in China, i.e., the Bohai Sea Rim and the Yangtze River Delta, and runs through three municipalities, i.e., Beijing, Tianjin and Shanghai. This railway is the longest high-speed line ever constructed in a single phase all over the world with the largest investment scale and the highest technology content. Because Beijing-Shanghai high-speed railway is very typical in China, it is urgent for us to predict the passenger demand and judge the feasibility of operation plan about overnight D-trains in the eastern line to be opened. Therefore, in this section, we give an in-depth analysis and predict transfer passenger demand of newly added overnight D-trains in Beijing-Shanghai corridor. This real-world case is implemented to illustrate the effectiveness and robustness of the proposed model and algorithms. Note that we can also use the similar method in other corridors, such as Beijing-Guangzhou and Shanghai-Shenzhen.
In this experiment, the set of transport modes is M = {m|m = 1, 2, 3, 4, 5}. For newly added transport mode, i.e., D-series overnight high-speed train (overnight D-train for short) where m = 5, the most competitive transport modes are airplane where m = 1, intercity bus where m = 2, T-series express train (T-train for short) where m = 3, and G-series high-speed train (G-train for short) where m = 4, respectively. Meanwhile, according to the 'per capita disposable income of urban residents' of the National Bureau of Statistics in China, we divide all passengers into three passenger groups with different income levels, i.e., low-income passenger group (Low-for short) whose monthly income is less than CNY3,000, medium-income passenger group (Medium-for short) whose monthly income is between CNY3,000 and CNY8,000 and high-income passenger group (High-for short) whose monthly income is larger than CNY8,000.
The algorithms in this paper are all coded and solved by Python 2.7 Version in Pycharm on Window 7 computer with 3.4 Gb processor. All data used in numerical experiments are shown in Appendix 1 including expenses and time of different transport modes in Beijing-Shanghai corridor.

The computation of Metropolis-Hastings algorithm
In this paper, according to different proposal probabilities of discrete random variables, M-H Algorithm is used to solve the probabilities of different passenger groups' choices about different transfer modes when travelling in different transport modes. We set the iteration times and the number of samples to 1000, and the results are shown in Fig. 7. The five radar charts are about airplane, intercity bus, T-train, G-train and newly added overnight D-train, successively. Figure 7 shows that the results obtained by M-H Algorithm conform to the distribution regularities in Sect. 2.1 and keep consistent with the actual situation. They do not make much difference from the initial proposal distribution. Specifically, five irregular closed-loops of medium-income passengers in radar charts show that the probabilities of different combinations of transport modes and transfer modes to be chosen is approximately equal. Generally speaking, less time corresponds to higher expense about transfer modes, i.e., bus, metro or taxi, and more comfortable seats correspond to higher prices about transport modes, i.e., airplane, intercity bus, T-train, G-train or overnight D-train. Therefore, the corresponding probabilities about each irregular closed-loops of low-income passengers and those of high-income passengers in each radar chart are just the opposite. The fundamental reason for this result is that travel time is the primary factor for high-income passengers, while travel expense is the primary factor for low-income passengers.

The computation of five heuristic algorithms
Five heuristic algorithms are adopted to solve the mixed logit model based on nonlinear random utility functions in this paper. The distributions of parameter vector, i.e., f (θ |ψ), and the optimal solution of the model, i.e., φ(θ), are obtained.  By running five heuristic algorithms, the estimations of characteristic parameters in parameter vector, i.e., mean errors and standard deviations, are shown in Table 3. Moreover, the probabilities that different (Low-, medium-and high-) income passenger groups choose different transport modes (airplane, intercity bus, T-train, G-train and newly added overnight D-train) and market shares of different transport modes are shown in Table 4. There are the contrasts of five heuristic algorithms shown in Table 5 and Fig. 8 in this numerical experiment. Figure  8 shows the market shares of different passenger groups' choices about different transport modes. In comparison of five heuristic algorithms, the market shares of overnight Dtrain are very close and the maximum difference is not more than 0.0328. The running time in Table 5 is unified into the time spent in about 10,000 iterations. It can be concluded that the optimal solutions of five heuristic algorithms are also very close and the maximum difference is not more than 0.123. Moreover, the improved algorithm proposed in this paper has the shortest running time and the optimal objective function value.
And the relationships between the optimum solutions and the iteration times (about 10,000 iterations) obtained by five heuristic algorithms above are shown in Fig. 9. In summary, the aim of the model we established is to maximize the log-likelihood function and then to maximize utility of all passenger groups. Figure 9 shows that, as the iteration times increase, both ACA and SAA fluctuate greatly during the search process and have the slowest convergence speed that converge to the optimal solution more than 9000th iterations. Due to the advantage of BBFWA, the improved algorithm we proposed in this paper has the fastest convergence speed, and it converges to the optimal solution until about 500th iterations. The second is BBFWA and it converges to the optimal solution until about 5300th iterations. Moreover, ISAA-CC converges to the optimal solution until about 7600th iterations.

Transfer passenger demand prediction
From Fig. 8, we can see that the rankings of market shares solved by all algorithms have no big differences. Among them, G-train is one of the most popular transport modes in Beijing-Shanghai corridor at any income levels because of its high speed, its appropriate departure time and high running frequency. Meanwhile, we concluded in this paper the     (a) Long-distance and high-speed operation at night offers more comfortable experience than travelling by trains in the daytime, and saves the night accommodation costs for passengers. (b) Overnight D-trains run at a maximum speed per hour of 250 kilometers and are faster than overnight K-trains and overnight T-trains which run at a maximum speed per hour of 160 kilometers. (c) The operation range of overnight D-trains has been increased from about 1500 kilometers to about 2500 kilometers, so it is more suitable for passengers who choose long-distance and fast railway travel, especially those for business and travelling. Finally, if we know the real-world dataset of total passenger demand between Beijing and Shanghai in a certain day or a certain year, based on the above numerical results, the transfer passenger demand of new transport modes, such as newly added overnight D-trains in this paper, can be solved.

Conclusion and future research
The passenger demand prediction is a common method for managing the passenger flow in order to reduce the mismatch between transportation resources allocation and passenger demand. To research transfer passenger demand of newly added overnight D-trains and characterize the problem in a mathematical way, a mixed logit model based on nonlinear random utility functions is formulated in this paper. According to Maximum Simulated Likelihood Method, the likelihood function of the model is formulated to maximize the overall utility of passenger groups. In particular, two types of discrete random variables are introduced in utility functions, i.e., total travel expenses and total travel time, and we adopt M-H Algorithm to iteratively solve the probabilities. While we can use traditional algorithms to solve the model established in this paper, the calculation complexity is high due to relatively more variables and parameters and it becomes a significant problem in the solution process. In view of this fact, the model is solved by five heuristic algorithms, respectively, and these algorithms are compared in running time and optimum solutions. Furthermore by performing a real-world instance in Beijing-Shanghai corridor, we prove that the improved heuristic algorithm we proposed is superior to other algorithms not only in running time but also in optimal solution.
Further research can focus on the following two aspects. (a) This study only considers transfer passenger demand at Beijing station and Shanghai station. In reality, the passenger demand is also related to stations along the road. Thus, with a series of passenger demand data, how to generate the robust and reliable passenger demand prediction model is a significant topic in the further research. (b) Here we only adopt three typical basic heuristic algorithms to solve the problem. More effective and improved heuristic algorithms will also be studied in our future research.

A. Parameter settings in numerical experiment
As shown in Fig. 10, we set Beijing municipal government as origin and Shanghai municipal government as destination, and then, the total expenses and time of different transport modes in Beijing-Shanghai corridor are shown in Table 6. Among them, the data of travel expenses and travel time for different transport modes are from Qunar.com and 12306.com, and the data of transfer expenses and time for different transfer modes are from Baidu Map. Due to the large number of certain transport modes, the range of airplane arrival time is from 10:00 AM to 6:00 PM, the range of G-train arrival time is from 10:00 AM to 6:00 PM, and the range of D-train arrival time is from 7:00 AM to 10:00 PM the next day.
And then total travel expenses and total travel time of different combination results, and proposal probabilities (i.e., prior probabilities subjectively based on Sect. 2.1) of different (low-, medium-and high-) income passenger groups are shown in Table 7.