A trust region based local Bayesian optimization without exhausted optimization of acquisition function

Bayesian optimization (BO) is an effective optimization technique for solving expensive black-box problems. Even though BO has remarkable success, its drawbacks are also obvious. First, the time complexity of the Gaussian process inference is higher than O(n3), where n is the number of samples. Consequently, the running time of BO increases rapidly with the problem size. Second, due to the non-convexity and multimodality of the acquisition function, it costs a lot to achieve good results. To address the above problems, we develop a local Bayesian optimization algorithm based on the trust region idea (TRLBO). In TRLBO, two trust regions with dynamically changing sizes are used to enhance the algorithm’s exploitation ability, while at the same time retaining the exploration ability. Specifically, one trust region is used to reduce the number of samples in the Gaussian process. The other is used to restrict the solution space of the candidates. Furthermore, some theoretical results were provided to enlighten the efficiency of the proposed algorithm. Experimental results on both benchmark functions and real-world problems show that TRLBO compares favorably with the state-of-the-art algorithms.


Introduction
Many problems encountered in scientific research and engineering are expensive and non-convex, whose mathematical formation and the gradient information are usually not available.For example, one common problem in machine learning is to find the optimal parameters of neural networks so that they can perform well in certain tasks.The problem is posed as an optimization problem, and its mathematical formation is difficult to construct.In addition, evaluating one set of parameter values is very time-consuming, so traditional optimization algorithms are not suitable for this type of problem.In comparison, Bayesian optimization (BO) is specifically designed for black-box function optimization and very effective.In recent years, BO has been successfully applied to interactive user-interfaces, robotics, environmental monitoring, information retrieval, combinatorial optimization, automatic machine learning, sensor networks, adaptive Monte Carlo, experimental design, reinforcement learning, and so on (Shahriari et al. 2016).
The success of Bayesian optimization is attributed to two components: the probabilistic surrogate for modeling the objective function, and the acquisition function for balancing the exploration and exploitation abilities.The Gaussian process is currently the default choice of surrogates, and the time complexity of Gaussian process inference is O(n 3 ), where n is the number of samples.Therefore, Bayesian optimization is mainly used to solve the low dimension problems.However, recent years have witnessed the rapid growth in the computational power of GPUs.A more efficient algorithm that reduces the time complexity to O(n 2 ) is developed by resorting the computational power of GPUs (Gardner et al. 2018).The development of the new algorithm makes it possible for BO to tackle high-dimensional problems.
But there are still challenges to overcome before BO can be successfully extended to the high-dimensional problems.First, due to the curse of dimensionality, the search space grows exponentially as the dimensionality of the objective function increases.Moreover, Brochu et al. (2007) found that the tendency toward exploration is more salient when BO is handling higher dimensional problems, which leads to its poor performance.This is the major barrier that limits the applicability of BO.In the literature, many attempts have been made to extend BO to high-dimensional problems.Assuming that the objective function is in D-dimension and there are only d active variables, where d ≪ D , Chen et al. ( 2012) used a hierar- chical diagonal sampling method to perform both variable selection and objective optimization.Rolland et al. (2018) decomposed a high-dimensional objective function into multiple low-dimensional functions and used a graph to represent the dependency between the low-dimensional functions.An efficient information transfer algorithm is developed to optimize the acquisition function.Furthermore, Gibbs sampling method was employed to learn the structure of the dependency graph.Following the work of Rolland, Mutný and Krause (2018) proposed a quadrature Fourier features method to approximate the exponential square covariant function.It is proven that the approximation error will decrease exponentially as the number of features increase.
Besides the works mentioned above, there exist other approaches to extend BO to high-dimensional space (Snoek et al. 2015a;Lakshminarayanan et al. 2016;Binois et al. 2015Binois et al. , 2020;;Gardner et al. 2017;Kandasamy et al. 2015;Nayebi et al. 2019;Wang et al. 2016Wang et al. , 2018a)).Most of these methods assume the objective function has some additive structures.Then, different additive structures are adopted to train different Gaussian processes, so it becomes very time-consuming to draw a larger number of samples.Moreover, one obstacle lying in the way of applying BO to highdimensional problems is that the acquisition function is non-convex and difficult to optimize.Some researchers tried to circumvent the obstacle by making use of stochastic feature approximation (Snoek et al. 2015b).In contrast, McIntire et al. (2016) use a sparse Gaussian process as the probabilistic surrogate.More recently, Wang et al. (2018a) proposed an ensemble Bayesian optimization algorithm (EBO) that resolves the problems of drawing a large number of samples, increasing dimensionality of objective functions, and balancing the diversity and accuracy of samples.Wang et al. (2018a) also revealed the relationship between EBO and evolutionary algorithms (EAs), which brings innovation to the algorithm analysis.Eriksson et al. Eriksson et al. (2019) developed a trust region local Bayesian optimization algorithm (TuRBO).The numerical results reported in Eriksson et al. (2019) show that TuRBO can reach state-of-the-art performance when tackling high-dimensional problems, in which case a large number of samples are required.This paper follows this promising research avenue and makes further progresses.The differences between our approach and the existing work are as follows.
(1) During the iteration of TuRBO, all the sample-observation data is used to train the Gaussian process.In our approach, we extract part of the data to train the Gaussian process.(2) Thompson sampling is adopted in TuRBO as the acquisition function.In our approach, the Gaussian process upper confidence bound (GP-UCB) is used instead.(3) In TuRBO, one trust region is constructed to control the samples' solution space size.In our approach, the solution space size and the number of samples in the Gaussian process are controlled by two trust regions.(4) We use a simple yet efficient way to extend GP-UCB so to draw samples in batches.
The rest of the paper is organized as follows.Section 2 gives a brief introduction to the Bayesian optimization algorithm, as well as the trust region method.The proposed TRLBO algorithm is described in detail under Sect.3. Section 4 is devoted to the algorithmic analysis of TRLBO.The numerical experiments are conducted in Sect. 5. Finally, Sect.6 discusses the remaining issues and concludes the paper.

Bayesian optimization
Given an optimization problem: minf (x) , x ∈ Ω ⊂ ℝ D , where Ω is a compact subset of ℝ D , f ∶ Ω → ℝ is a deterministic black box function whose mathematical formulation, gradient information, and the property (convexity) are unknown.Moreover, evaluating the function is very expensive.Without making any assumptions to the objective function, the only way to find the optimum is brute force search.It might happen that only small changes are made to the decision variables, but the objective function value changes dramatically, e.g., the Dirichlet function.In practice, we often introduce several assumptions to make the optimization problem more tractable.One common assumption is Lipschitz continuity: If for any x ∈ P, y ∈ P, there exists a constant such that ‖f (x) − f (y)‖ ≤ ‖x − y‖ , then f is termed Lipschitz continu- ous in region P .DIRECT (Jones et al. 1993) is an algorithm built upon the assumption.DIRECT constantly divides the solution space and abandons subspaces that do not contain the optimum.In this way, the search space is gradually reduced, and the algorithm can eventually find the optimum.However, in the process of reducing the search space size, numerous function evaluations are required.Therefore, the algorithm is not suitable for the above scenario.Different from the DIRECT algorithm, BO is a stochastic optimization technique.Although it cannot guarantee to find the global optimum in each run, it is able to find an ideal solution with a relatively small number of function evaluations.A more detailed description of BO can be found in Shahriari et al. (2016) andBrochu et al. (2010).BO is composed of a probabilistic surrogate model and an acquisition function.In the following, we briefly introduce these two components.
The probabilistic surrogate model is used for modeling the objective function.One of the most commonly seen surrogate models is the Gaussian process.It is also used in our study.Gaussian process GP (x), k x, x � is a non-parametric model determined by its mean function (x) and covariant functionk x, x ′ .We assume that the objective function is sampled from a Gaussian process with mean 0 ( GP 0, , where y t = f x t + t and t are independent identically distributed variables drawn from Gaussian distribution, namely, t ∼ N 0, 2 .According to the property of the Gaussian distribution, it is easy to deduce that the posterior distribution is also a Gaussian distribution whose mean and covariance are formulated as: Please refer to Rasmussen and Williams (2009) for more information about the Gaussian process.Modeling an objective function with a Gaussian process involves the matrix inversion operation, which has a time complexity of O(n 3 ).This precludes the application of BO to scenarios where large numbers of samples are required.In the Gaussian process, the covariance function determines the smooth property of the sampling function, as well as the type it can fit.To increase the applicability of the Gaussian process, a common practice is to add some hyperparameters to the covariance function.
During the training process, the hyperparameters can be tuned to make the surrogate model closer to the real objective function.One of the most important hyperparameters is the scale parameter.It is a vector that shares the same dimensionality as the input of the objective function.It is used to determine the degree of importance of the input variables.In our study, the scale parameter is also used to determine the corresponding trust region of the solution space of the acquisition function.Some commonly seen covariance functions are exponential covariance function, exponential square covariance function, and Martern covariance function.
The acquisition function is adopted to guide the selection of promising points for evaluation.There are different types of acquisition functions available for use, e.g., probability of improvement (PI) (Kushner 1964), expected improvement (EI) (Mockus 1977), Thompson search (Thompson 1933), entropy search (Villemonteix et al. 2009 and Szepesvari 2006).Although its formation is very simple, it can effectively strike a balance between exploration and exploitation.GP-UCB is formulated as follows: where n (x) and n (x) denote the mean and variance of the posterior distribution respectively.The hyperparameter n is used for balancing exploration and exploitation.Intuitively, the algorithm tends to explore when n is large.Conversely, the algorithm tends to exploit when n is small.
In Bayesian optimization, we optimize the function to generate new evaluation points.Such optimize methods include DIRECT, grid search, hyper particle swarm (Beldjilali et al. 2020;Laím et al. 2022;Pozna et al. 2022), etc.In the realworld applications, we hope that multiple promising points can be evaluated simultaneously in each iteration to fully use the computational power.Many studies have been conducted in this research avenue (Contal et al. 2013;Desautels et al. 2012;Ginsbourger et al. 2010;Snoek et al. 2012;Marmin et al. 2015;Shah and Ghahramani 2015;Wang et al. 2020).In this paper, to endow GP-UCB with the capability of batch sampling, we discard the exhausted optimization of the acquisition function.A random search algorithm is adopted instead.For the hyperparameter n of GP-UCB, Srinivas derived its exact expression and proved its effectiveness from a theoretical perspective.However, the result is based on the premise that GP-UCB is optimized through some exact approach.This differs from the situation we face.To set n to a value that effectively balances exploration and exploitation, we normalize n (x) and n (x) .The details are presented in Sect.3.

Trust region algorithm
Generally speaking, it is very difficult to globally fit a non-convex function.However, if we concentrate on a local region of the objective function at a time, it is easy to fit the region with a linear or quadratic function.The basic idea of the trust region algorithm is to use an approximation model to fit the objective function locally.The accuracy of the approximation model in the local region is trustable, and therefore the local region is termed "trust region".Trust region is usually a sphere or polytope centered at the current best solution, and its size is dynamically adjusted in each iteration.In short, the trust region algorithm uses a merit function to evaluate the fitting performance of the approximation model.If the approximation model fits the objective function very well, then the trust region grows.Conversely, the trust region shrinks.Instead of directly optimizing the objective function, the principle of the trust region algorithm is to optimize a simple local approximation model of the objective function.Quadratic approximation is a widely used approximation model.Using quadratic approximation, the original problem is reduced to the following constraint optimization problem.
where f k , g T k , B k denote the objective function value, the gradient, and the Hessian approximation matrix at point x k respectively.The notation Δ k represents the radius of the trust region.It can be seen that the objective function of the transformed problem is a convex quadratic function.This function is easy to optimize.Assuming that s k is the solution to the transformed problem, we use the following evaluation function r k to determine the point x k+1 and the trust region radius Δ k+1 of the next iteration.
The trust region algorithm repeats the above process until it meets the predefined termination criterion.The algorithm has a good convergence property.For a detailed description of the trust region algorithm, please refer to Conn et al. (1997).As in TuRBO (Eriksson et al. 2019), we use the Gaussian process to fit the objective function locally.However, instead of using all the data to train the Gaussian process, we use part of the data lies within the trust region.

Trust region based local Bayesian optimization (TRLBO)
In Eriksson et al. (2019), the authors successfully introduced the trust domain into Bayesian optimization and achieved good experimental results, but our analysis found that there is still room for improvement in this method.First of all, the main idea of the trust domain is that there is a high degree of confidence in the local fitting of the objective function, so we abandon the Gaussian process of training with all the data in each iteration process, and instead choose to train the Gaussian process at sample points within the trust domain, which corresponds to our local Gaussian process.In addition, the acquisition function is mainly used to guide the selection of the next evaluation point, where the meaning represented by the acquisition function is the degree to which the local Gaussian process fits the objective function, so the Gaussian upper bound is the most natural choice.Finally, in order to reduce the cost of optimizing the acquisition function, we choose the random sampling method to optimize the acquisition function, and as the algorithm progresses, the trust domain continues to shrink, the optimization error caused by random sampling will become smaller and smaller.
In this section, the proposed TRLBO is described in detail.Same as BO, our algorithm is composed of a surrogate model and an acquisition function.In TRLBO, the local Gaussian process serves as the surrogate model while the local GP-UCB serves as the acquisition function.Each component is associated with a trust region.The sizes of the two trust regions are controlled by a scale parameter.The two regions are different in shapes.The one associated with the local Gaussian process has a ball shape, while the other associated with the local GP-UCB is a hyper-rectangle.The details are presented in the following subsections.

Local Gaussian process
The surrogate model plays an important role in Bayesian optimization.A suitable surrogate model can effectively fit the objective function and contribute to the good performance of the algorithm.Conversely, the algorithm will perform poorly if the surrogate model differs significantly to the objective function.Gaussian process is used for modeling functions, and it has a nice property: all joint distribution, marginal distribution, and posterior distribution of finite variables are Gaussian distribution.Therefore, the Gaussian process is the most commonly seen surrogate model in BO and has achieved good results in real-world applications.The performance of BO is largely attributed to the global Gaussian process that accurately fits the objective function.When handling low dimensional problems, a small number of samples are sufficient to train the global Gaussian process.However, when handling high-dimensional problems, a larger number of samples are required.This lowers the efficiency of the algorithm.Inspired by the trust region algorithm, we abandon the attempt to fit the entire objective function and try to fit a local region, this approach is similar to TuRBO (Jones et al. 1993).The major difference is that TuRBO uses the entire data set to train the Gaussian process.Only when optimizing the acquisition function, it restricts the solution space to a trust region.In contrast, we extract part of the data to train the local Gaussian process.Specifically, denote the observed data asD n = x t , y t n t=1 , the scale parameter of the trust region as , and the current best solution asx opt .The superscript n represents the n th iteration of the algorithm.We use the following formula to extract point set D ′ n from D n to train the local Gaussian process.
In the formula, = Max i , i ∈ {1, 2...d} , i is the scale hyperparameter of the Gaussian process, and d is the dimension of the objective function input.The parameter is used to guarantee that the trust region associated with the local Gaussian process covers the trust region associated with the acquisition function.The scale parameter of the trust region gradually decreases as the number of iterations grows.Therefore, during the training process, the number of samples used the local Gaussian process will be much smaller than that in the global Gaussian process.This way, the running time is greatly reduced.In addition, according to our experiment, the local Gaussian process has similar or even better fitting quality than the global Gaussian process.

Local GP-UCB
In the previous subsection, by employing the trust region method, we have successfully applied the local Gaussian process to the local fitting of the objective function.To force the algorithm to pay more attention to local exploitation, we need to restrict the solution space of the newly generated samples.In other words, we hope that the next batch of samples can be generated near the current best solution.This is the effect of the trust region associated with the acquisition function.The approach used in Eriksson et al. ( 2019) is adopted here to determine the trust region associated with the acquisition function.Let denote the scale parameter of the trust region, denote the length-scale hyperparameter of the local Gaussian process, and x opt denote the current best solution.Then, the trust region is a hyper-rectangle centered at x opt , with volume d .For each dimension, the side length of the hyper-rectangle is computed as , where d is the dimension of the input variables of the objective function.We have restricted the solution space of the new samples to the hyper-rectangle centered at the current best solution.Let Ω denote the solution space formed by the hyper-rectangle.We now illustrate the method used to generate the samples for evaluation.The GP-UCB formulated in (2.3) is used in our algorithm as the acquisition function.The original Bayesian optimization algorithm equipped with GP-UCB can only generate one candidate in each iteration.We use a simple method to extend the algorithm, so that multiple candidate points can be generated at a time.Specifically, we generate a point set x i n i=1 by uniformly sampling points from the solution space Ω of the acquisition function ⊣ ucb x;D n .Then, we substitute it into the mean n (x) and variance n (x) of the posterior distribution of the acqui- sition function and obtain the corresponding function value respectively.Finally, the acquisition function value is obtained by . In order to generate m(m > 1) can- didate points at a time, we sort ⊣ i n i=1 in ascending order and select the first m acquisition function values.The points (repre- sented by the input variables x i ) corresponding to the m func- tion values are collected to form the point set x i m i=1 used for next evaluation.By uniformly sampling points in the restricted solution space of the acquisition function, we not only avoid optimizing the complex acquisition function, but also generate a set of samples without much computational cost.The theoretical analysis of the method is given in Sect. 4.

Outline of TRLBO
In the previous two subsections, we introduce trust regions into the probabilistic surrogate model and the acquisition function of BO.However, there is still one important step to emphasize before completing our algorithm, that is, the adjustment strategy of trust regions.It is worth noting that the trust regions associated with the local Gaussian process and the acquisition function are controlled by the same scale parameter and ∈ min , max .The primitive trust region algorithm adjusts the trust region according to the merit function formulated in (2.6), which represents the ratio between the expected decrement and the actual decrement.In contrast, we use the stochastic model to locally approximate the objective function.Therefore, the adjustment strategy in Eriksson et al. ( 2019) is adopted in TRLBO.In each iteration, among all samples, if one sample is better than the current best solution, then we termed it a successful trial.Otherwise, a failed trial is recorded.We define two threshold values fail and succ to trigger the process of adjusting the size of trust regions.If the number of successive failed trials reaches fail , we shrink the trust region by half: ← ∕2 .Similarly, if the number of successive successful trials reaches succ , we enlarge the trust region by a factor of two: ← min 2 , max .The min opera- tor is to guarantee that ∇ does not exceed the upper limit.Finally, all the procedures of our algorithm are shown as follows.
2: Train the local Gaussian process (Algorithm 1).3: Generate m evaluation points (Algorithm 2), calculate their objective function value { } =1 , and add the new observation data to the data set , = ∪ {( , )} =1 .4: Adjust the trust region length ℓ according to the optimization results of Step 3, if ℓ ≤ ℓ min , terminate the algorithm; otherwise, go to Step 2.

Exploration vs. exploitation
Trust region algorithm is a local optimization algorithm.Although it cannot guarantee to find the global optimum, it has a global convergence property.We introduce the main idea of the trust region algorithm to BO.From a theoretical point of view, our algorithm possesses the local optimization property.In each iteration of our algorithm, it is easy to infer from the trust region adjustment strategy that the trust region length gradually decreases.This property forces our algorithm to pay more attention to local exploitation.Overall, we have deduced several theoretical results.For the sake of illustration, we first introduce some relevant notations.We map the domain of the objective function to the space Ω = [0, 1] d .The current best solution of the algorithm is denoted by x opt and the scale parameter of the trust region is denoted by ≤ 1 .λ is the scale hyperparameter of the Gaussian process, and D is the observed data set.Moreover, we define the following quantities.
Proposition 1 Let y denote the event Ω gp ≠ ∅ and x denote the event x i , we have: Proof Note that the trust region associated with the local Gaussian process is a sphere centered at the current best solution x opt , with radius equals to .Its volume is computed as V = d∕2 d ∕Γ(1 + d∕2) .The trust region associated with the acquisition function is a hyper-rectangle with its side length equals to where d is the dimension of the objective function.Note also that the domain of the objective function is [0, 1] d .When L (i) ≥ 1 , we take L (i) = 1 .Let A = {1, 2...d} , we discuss two different cases.
(1) For every i ∈ A, L (i) < 1 .In this case, the volume of the hyper-rectangle is d , and we assume that the observed point set Ω gp used for training the local Gaussian process is not empty (there is no need to train if it is empty).Furthermore, for simplicity, we assume that the sample points not in Ω gp are uniformly distributed in (4.1.1) the space Ω (the real case will be more complex).The probability that there exists observed data not in Ω gp is computed as: It is obvious that when → 0 , we have lim leads to a contradiction.Therefore we can assume that there exists a set B ∉ ∅ , for every i ∈ B ⊂ {1, 2...d} ,L (i) = 1 .At this point, the volume of the hyper-rectan- g l e c a n b e c a l c u l a t e d a s

I t i s e a s y t o l e a r n t h a t lim
According to the corollary, as the scale parameter of the trust region decreases, our algorithm behaves similarly to the primitive Bayesian optimization algorithm whose solution space shrinks iteration by iteration.This property forces our algorithm to pay more attention to local exploitation.Conversely, when the scale parameter of the trust region is not so small, the solution space does not entirely overlap with the trust region used for determining the training data of the local Gaussian process.This guarantees the exploration ability of the algorithm.

Effect of using random search to optimize the acquisition function
From the theoretical point of view, it is very difficult to find the global minimum (maximum) of a non-convex function.Therefore, in practice, we change our goal to finding a suboptimal solution.Some common algorithms for optimizing the acquisition function are DIRECT, multi-start quasi-Newton hill-climbing approach, CMA-ES, and multi-start local search.In our algorithm, we use random search to optimize the acquisition function.In the following, we analyze the (4.1.5) efficiency of the approach.We first provide some basic definitions and illustrations.
Definition 1 Let x min = argminf (x), x ∈ Ω , Ω is a compact subset of R n , x is termed an ϵ − optimal solution if it satisfies the following conditions: There exists a hyper-cube A = � x�‖x‖ ∞ =  � ⊂ Ω , such that x ∈ Aandx min ∈ A , namely x min and x are in the same grid.Now we regard the scale parameter of the trust region as a random variable L , its corresponding value is .Let d denote the dimension of the input variables of the objective function.We divide the solution space of the new samples into k = ⌈ d ∕ d ⌉ grids with size d .When is sufficiently small, the dividing error can be omitted.Moreover, when using random search to optimize the acquisition function, the number of samples is set to 100d .We use m to represent the number.Supposing y is the event that an ϵ − optimal solution is obtained by using the sampling method, we have The probability is very small.However, P y|L = � will increase as → .Next, we analyze the expected number of iterations the algorithm required to find a local optimal solution or successively find solutions better than the current best solution.We first assume that when reduces to a threshold value, say , then the algorithm has found a local optimal solution.Otherwise, if the current solution is not a local optimum, when is sufficiently small, the probability that the random search finds a better solution is p .When there are 100d sample points, the probability of finding a better solution becomes q = 1 − (1 − p) 100d .It can be inferred from the expression of q that, when p = 0.01, d = 10 , q exceeds 0.99.Therefore, we can take the fact that the current solution is the local optimum.
Note that as the scale parameter of the trust region changes, it only takes two values, and 2 .Therefore, when is changing within the interval [ , 1] , its choices are very limited.We can regard it as a random variable whose value are taken from a finite state space S = {0, 1, 2...n} .We denote the random variable as L i .State i(i ∈ S) indi- cates the case = 2 i .When is sufficiently small, n is approximated as n = ⌈log 2 (1∕ )⌉ .For the convenience of analysis, we now take L i as a random walking process.Let P i,i+1 = p denote the probability of moving a step right. (4.2.1) Starting from state j, the probability that L i reaches state n before reaching state 0 is α(j) .According to property of random walking, we have

It can be simplified to
Given that α(0) = 0 , we have, Adding the above n-1 equations and substituting α(n) = 1 , we can deduce that when p ≠ 1∕2, when p = 1∕2, Let B denote the number of steps required for L i to start from state j, and finally reach state 0 or n.We then compute the expected value of B, namely, E B|L 0 = j .
Let X i ∈ {−1, 1}, i ≥ 1 denote the outcome of L after the j th change, and for B we have, (1) when p ≠ 1∕2.
According to Wald equation, we can get , E X i = 2p − 1 , the expres- sion becomes Substitute the above formula into 4.2.9 , the expected value of B is given by (2) when p = 1∕2.
L e t G(j) = E B|L 0 = j , i t i s o b v i o u s t h a t G(0) = G(n) = 0 , and consider the case of one step of movement, we have (4.2.12) is an in-homogeneous linear difference equation.Solving the equation gives: Note that the maximum value of j(n − j) is n 2 ∕4 and n = ⌈log 2 (1∕ )⌉ .Even takes a very small value, the value of n is acceptable.Refer to Ross (1985) and (Lawler 2006) for more detailed descriptions of random walking.
From the above analysis it can be seen that even in the worst case, the number of iterations taken by our algorithm to find a local optimum is acceptable.In addition, according to the experimental results, given the same number of iterations, the running time spent by our algorithm is shorter than other algorithms which perform exact optimization of the acquisition function.

Numerical experiment
In the experiment, three commonly used synthetic functions (i.e., Levy, Griewank, and Ackley) and two real-world problems (i.e., a 14D robot pushing problem and a 60D rover trajectory planning problem) are used to test the performance (4.2.10) of algorithms.For the synthetic functions, three different numbers of dimensions (i.e., 10, 20 and 50) are tested.The main part of our algorithm is built upon the framework of TuRBO (Eriksson et al. 2019).Besides the parameters of the algorithmic framework, we introduce two new parameters, the acquisition function parameter of GP-UCB and the radius r associated with the local Gaussian process.Their values are set to d and ‖λ‖ ∞ respectively, where d is dimension of the input variables of the objective function and λ is the scale hyper-parameter of the local Gaussian process.In addition, we generate new sample points in a batch manner and the batch size is fixed at 10.The number of the initial sample points is 20.See Eriksson et al. (2019) for more detailed descriptions of the other parameters.To evaluate the performance of our algorithm, we compare it with TuRBO1, TuRBOm, CMA-ES, and RS (Random Search).TuRBO1, proposed by Eriksson et al. (2019), is a Bayesian optimization algorithm with a single trust region.The trust region is used to restrict the solution space of the acquisition function.TuRBOm extends TuRBO1 by increasing the number of trust regions to m.In our experiment, m = 5.CMA-ES is one of the most successfully evolutionary strategy algorithms.It has exhibited good performance in real-world applications, especially when handling medianscale complex optimization problems.CMA-ES is a random search algorithm.Its key idea is to use a multivariable Gaussian distribution in the search space to guide the search direction, and adjust the mean and covariance matrix of the multivariable Gaussian distribution in an adaptive manner.In the experiment, we use the CMA-ES implementation provided in the pycma library.Random search algorithms are algorithms that focus solely on exploration.It does not require any information about the objective function and has a wide range of applications.It performs very well in some types of problems.In the experiment, we uniformly sample batches of points in the solution space, and the batch size is set to 10.We run the algorithms 30 times for each test function.All source code is available at https:// github.com/ agier9/ TRLBO.
Synthetic function When handling synthetic function with 10, 20, and 50 dimensions, the number of function evaluations are set to 1000, 2000, and 5000 respectively.From the convergence graphs (Figs. 1, 2, 3) TRLBO, Turbo1, and Turbom perform much better than CMA-ES and RS.In addition, according to the convergence curve of RS, its performance is not satisfying, this also shows that the tendency towards exploration when BO is handling high-dimensional problems is the main reason for its poor performance.Figures 4,   5, 6 shows the optima found by the algorithms over the 30 runs and Figs. 7, 8, 9 presents the corresponding box plots.
Robot pushing The problem is about controlling a robot to push two objectives to their target positions.The moving trajectory of the robot is determined by a function with 14 parameters.The problem is first proposed by Wang et al. (2017) in 2017.Here, we set the number of objective function evaluations to 10 K.It can be observed from the convergence graph that TRLBO, Turbo1, and Turbom are superior to CMA-ES and random search.It is surprising that the performance of CMA-ES is not as good as the random search.Figures 10,11,12 give the corresponding results.
Rover trajectory planning This is a trajectory optimization problem in 2D, meant to emulate a rover navigation task.The objective function is a non-smooth, discontinuous function.More detailed descriptions of the problem can be found in Wang et al. 2018b.In our experiment, the number of objective function evaluations is set to 20 K. From the convergence graphs depicted in Fig. 5 (the top subgraph), it can be seen that TRLBO, Turbo1, Turbom, and CMA-ES have similar performance, followed by the random search.Figures 13, 14, 15 show the optima found by the algorithms over 30 runs and the right subgraph in Figs. 13, 14, 15 present the corresponding box plot.
Moreover, we record the best, worst, and mean results of the algorithm on all test problems over 30 runs.From the numerical results, when solving the majority of the synthetic functions, TRLBO is the best among all the compared to the rest algorithms.CMA-ES excepts to perform extremely well on the Griewank function.Given a sufficient number of function evaluations, the solutions found by CMA-EA are the near global optimal solution.For the 10D, 20D, and 50D problems, the number of function evaluations is set to 1000, 2000, and 5000.The performance of CMA-EA on low-dimensional problems is not as good as in highdimensional problems.In addition, when solving the robot pushing problem, TRLBO performs slightly better than Turbo1 and Turbom.For the Rover trajectory planning problem, Turbo1 and Turbom perform the best, followed by TRLBO and CMA-ES.Random Search is the worst among all the algorithms.The median performance of TRLBO is probably due to the non-smooth, non-continuous property of the objective function.The experimental results are listed in Tables 1, 2, 3, 4, 5, where the best results are marked in bold.
Judging from the results of the algorithm convergence graph experiment, in most experiments, our algorithm has the best convergence speed, especially for the case where the objective function is continuous.This is mainly because we chose to fit the objective function locally, as long as the trust domain size is appropriate, our local Gaussian process can fit the objective function well, so in the process of the algorithm, with the trust domain size dynamically changes, our algorithm has a better ability to search for the optimal solution.
Of course, another factor that prevents the algorithm from falling into a local optimal solution is that we use random sampling to optimize the acquisition function, which makes the algorithm have a certain ability to explore.

Conclusion
Bayesian optimization is an effective black-box optimization algorithm.It shows good performance when optimizing low-dimensional objective functions.However, extending BO to tackle high-dimensional problems is not an easy task.In this paper, we proposed a local Bayesian optimization algorithm based on the idea of the trust region.The algorithm incorporates the local Gaussian process and eliminates the need for exact optimization of the acquisition function.The experimental result on a number of test problems shows that our algorithm has similar or even better performance than other Bayesian optimization algorithms.Due to the limited computational resources, we did not test our algorithm on higher dimensional problems.In the future, we intend to adopt multiple local Gaussian processes to fit different parts of the search space of the objective function.From the optimization point of view, the information in local data is not necessarily lower than global data.

Algorithm 1 :
Local Gaussian process algorithm Input: = {( , )} =1 : The observed data: ℓ : The scale parameter of the trust region; : The scaling factor; : The current best point 1: n is the number of points in , ′ is the valid points set.2: for i = 1 to n do 3the update data ′ to train the Gaussian process Output: The Posterior of the Multivariate Gaussian distribution we use max-min normalization to normalize the posterior mean and variance.The normalized mean and variance are denoted by exploration; m: The batch size; n: The uniform sampling size 1: Generate n samples { } =1 by uniformly sampling from the solution space Ω, A=∅, B=∅.2: for i = 1 to n do 3: = ∪ ( ), = ∪ ( ) 4: end for 5: Let LA, UA, LB, UB denote the minimal, maximum values of set A, B respectively, C=∅.6: for i = 1 to n sort C in ascending order and select the first m acquisition function values.Output: m points { } =1 corresponding to the m acquisition function values.