Polynomial Convergence Rates of Piecewise Deterministic Markov Processes

We consider piecewise-deterministic Markov processes such as the Bouncy Particle sampler, on target densities with polynomial tails. Using direct drift condition methods, we provide bounds on the polynomial order of the processes’ convergence rate to stationary, on both one-dimensional and high-dimensional state spaces, in both total variation distance and f-norm.


Introduction
Markov chain Monte Carlo (MCMC) algorithms have become an indispensable part of statistical computation; see e.g. Brooks et al. (2011) and the many references therein. Piecewise-deterministic Markov processes (PDMP), such as the Bouncy Particle sampler (Bouchard-Côté et al. 2018) and the Zig-Zag algorithm , have emerged as a non-reversible alternative to traditional Metropolis-based MCMC. They are of great theoretical interest and also some practical relevance; see e.g. Bierkens (2021) and the references therein. An important question about PDMP is their rate of convergence, i.e. how quickly they converge to their target stationary distribution. For sufficiently lightly-tailed targets, geometric ergodicity has been established under certain conditions (Deligiannidis et al. 2019). However, if the target distribution has tails which are heavier than exponential, then geometric ergodicity does not apply.
In this paper, we instead focus on polynomial convergence rates of certain PDMP. That topic was previously approached using the concept of hypocoercivity in Andrieu et al. (2021a, b), but here we proceed using direct drift condition methods. We specifically consider the Bouncy Particle sampler (Bouchard-Côté et al. 2018), for a given target density in d . This PDMP has, at each time, a location x ∈ d and a velocity v ∈ d with |v| = 1 . It proceeds primarily by deterministically moving x through d at the fixed velocity v. It also reflects v along 's contour lines at hazard rate (x, v) = −v (log ) � (x) + . In addition, it refreshes at some specified hazard rate (which could depend on the current position x), at which point it replaces the velocity v by an independent draw from the uniform distribution Ψ on the unit sphere in d . This process is known (Bouchard-Côté et al. 2018) to be irreducible with stationary density , and to converge to exponentially quickly for sufficiently light-tailed target densities . This paper examines the polynomial convergence rate of this PDMP to target densities which are heavy-tailed. We first consider one-dimensional heavy-tailed targets (for which polynomial convergence rates of the Zig-Zag Process was also considered in Vasdekis and Roberts 2021). For targets with tails comparable to a t-distribution with r degrees of freedom, we derive sharp bounds on polynomial convergence (Theorem 4). In particular, we prove that the polynomial convergence order in total variation distance is precisely r, in the sense that lim t→∞ t a ‖P t (x, ⋅) − (⋅)‖ TV equals 0 for a < r and infinity for a > r . We also prove convergence in the V (1− )p -norm (see Section 3) at polynomial order approaching (1 − p)r , for any p ∈ [0, 1) . We then consider high-dimensional PDMP, and compute their infinitesimal generator applied to an appropriate drift function (Theorem 5). We specialise this generator computation to target densities with polynomial tails proportional to (1 + |x| 2 ) −(r+d)∕2 (Corollary 7), and use this to derive specific bounds on their polynomial convergence rate (Theorem 8) in both total variation distance and f-norm. Our theorem shows that for r > (2 − 1)d , the process converges in total variation distance at polynomial order approaching (r + d) √ 2 ∕d − 1. This paper is organised as follows. In Section 2, we present some computer simulations to illustrate the convergence of PDMP to stationarity. In Section 3, we review general polynomial convergence rate bounds for continuous-time processes as in Fort and Roberts (2005), and present some corollaries adapting those results to our needs. In Section 4, we consider one-dimensional PDMP, and prove an exact characterisation (Theorem 4) of the polynomial convergence order in that case. In Section 5, we prove a general result (Theorem 5) which derives the infinitesimal generator of PDMP acting on certain choices of drift function, which we then apply to target densities with polynomial tails (Corollary 7). In Section 6, we apply these results to derive specific polynomial rate bounds for highdimensional PDMP (Theorem 8). Finally, in Section 7, we present an auxiliary computation about expected values with respect to the refresh distribution Ψ , which is used in the proof of Theorem 8.

Computer Simulations of the PDMP
We begin by performing some computer simulations. Suppose the state space X is the one-dimensional real line , with C 1 target density (x) . In this case, the Piecewise Deterministic Markov Process (PDMP) enlarges the state space to X × {−1, 1} , and expands to (x, v) = 1 2 (x) for v ∈ {−1, 1} . It then proceeds by moving with fixed constant velocity v, except reflecting from v to −v with hazard rate (x, v) = −v (log ) � (x) + .
(We omit refreshes, i.e. take the refresh rate to be zero, since refreshes are not required in one dimension.) Simulating this process requires that we identify the reflection times, which arise in continuous time according to the hazard rate (x, v) . This could be approximated by advancing time in small discrete increments, but the errors in such approximations are difficult to control. Instead, we proceed as follows. Suppose the process is currently at some state (X t , V t ) at time t ≥ 0 , and we wish to simulate the next T units of time. We first find some value M for which we must have (X s , V s ) ≤ M for all t ≤ s ≤ t + T.
Then, we simulate a Poisson process with constant rate M for the next T time units. We then use "Poisson thinning" to proceed through those times points in order, with acceptance probability (x, v) M , until the first one is accepted and hence the next reflection time is identified. At that point, we discard the remaining Poisson time points, and continue the simulation anew from the identified reflection time. In this way, the reflection times are simulated accurately, without any discretisation error. (The R script that we used for our simulations is available for inspection at: probability.ca/Rpoly.) We first simulate this process where (x) = (1 + x 2 ) −3 (so, is essentially a t-distribution with parameter r = 5 ). In this case, (x, v) ∶= −v (log ) � (x) + = (1+r)(xv) + 1+x 2 , which is maximised at M ∶= (1, 1) = (1 + r)∕2 . (So, in this case (x, v) has a constant upper bound M, but in general M might depend on X t and V t and T.) A typical run of this process is shown in Fig. 1, starting with X 0 = 5 and V 0 = +1 . We see that the process moves at constant velocity ±1 , with reflections at appropriate random times to preserve stationarity.
To illustrate the convergence of this process X t to its stationary distribution , we consider (inspired by total variation distance, see next section) the expected values of functionals g ∶ X → , specifically the difference between the expected value [g(X t )] at time t of our process, compared to the stationary expected value (g) ∶= [g(X)] . By repeating the simulation a large number of times, we obtain a mean value and 95% confidence interval for [g(X t )] at various times t, for three different functionals g 1 (x) = x , g 2 (x) = x>0 , and g 3 (x) = x 2 . In each case, we compare [g(X t )] to the corresponding stationary expectation (g) (equal to 0 and 1/2 and 1/3, respectively), at various times t. The results are shown in Fig. 2. The mean value of each of the three functionals when running the process (blue) converges quickly to its stationary value (red). This provides confirmation that our PDMP process is indeed converging to the correct distribution. But how quickly? In fact, the convergence rate of this process is heavily affected by the tail behaviour of the target distribution (x) . To illustrate this, consider a second example where the target ̄(x) = e −x 2 ∕2 corresponds to a standard Gaussian distribution. Then the tails of (x) are much heavier than those of ̄(x) . Its hazard rate is equal to In particular, the process for ̄ will return to the origin much more quickly and consistently than for , leading to much faster convergence. This is illustrated in Fig. 3, which shows ten runs of the process for ̄ (top) and for (bottom) when started with X 0 = 10 and V 0 = +1 , with the processes much more variable in their return times. The difference also arises in Fig. 4, which shows ten runs for each target, but this time started with X 0 = 1000 and The mean value (blue) and 95% confidence interval (green dotted) for [g(X t )] at various times t, for the three different functionals g 1 (x) = x (top), g 2 (x) = x>0 (middle), and g 3 (x) = x 2 (bottom), compared to the corresponding stationary expectation (red), when running a PDMP for the Student's t-distribution V 0 = +1 , i.e. much farther out in the tails, with the processes even more variable in their return times.
Due to the heavy polynomial tails of the Student's t-distribution (x) , the convergence to cannot be exponentially quick, i.e. "geometrically" ergodic. But it might still be polynomially ergodic. To investigate that question, we next to turn our attention to the theory of polynomial convergence rates of Markov processes.

Polynomial Convergence Rates of Markov Processes
Quantitative convergence rates of discrete-time geometrically ergodic Markov chains have a long history, see e.g. Rosenthal (2002) and the many references therein. More recently, focus has turned to polynomial ergodicity, e.g. Fort and Moulines (2000) and Jarner and Roberts (2002). Most of these results are in discrete time, but Fort and Roberts (2005) yields the following continuous-time polynomial convergence bound. To state it, recall that if and are two probability distributions on X , and f ∶ X → (0, ∞) , then the f-norm distance between and is defined as t Xt and the total variation distance between and is defined as be the time-t transition probabilities. Also, recall that a continuous-time Markov process {X t } has an infinitesimal generator A which acts on appropriate functions f ∶ X → by (for background about generators see e.g. Ethier and Kurtz 1986). Then we have: Proposition 1 Suppose a continuous-time Markov process on state space X ⊆ d has stationary distribution , and infinitesimal generator A , and there is ∈ (0, 1) and c > 0 and b 0 < ∞ and a closed petite set C ⊆ X and a drift function Then for any p ∈ [0, 1) and x ∈ X, i.e. the process converges to stationary in the the process converges to stationarity in total variation distance at polynomial order (1 − )∕ .
Proof This result follows from Corollary 6 of Fort and Roberts (2005) upon setting their = 1 and c = c and b = 0 , and using that t ≤ 1 + t . (Note that the "b" in their convergence equation is different from the "b" in their drift equation (8), which we here refer to as Then, again, for any p ∈ [0, 1) and x ∈ X, Proof First of all, by replacing V by V∕c 0 and c by c∕c 0 if necessary, we can assume that c 0 = 1 . Then, let C = {x ∈ X ∶ V(x) ≤ Δ} . This C is closed by continuity of V, and is bounded since V is norm-like, so C is compact. Hence, by the compact-small property, C is small, and hence also petite. Then b 0 ∶= sup x∈C AV(x) < ∞ since AV is bounded on compact sets. This result now follows from Proposition 1, by noting that if

Corollary 3 Suppose a continuous-time compact-small Markov process on state space
X ⊆ d has stationary distribution , and infinitesimal generator A , and there is > 1 and c 0 , c 1 > 0 and > 0 and a drift function V(x) ≥ max(c 0 , c 1 |x| ) such that Then for any p ∈ [0, 1) and x ∈ X, and in particular so for all large |x|, where c = 2 (c 1 ) 1−(1∕ ) . Hence, we can apply Corollary 2 with = 1∕ ∈ (0, 1) . The result then follows since (1 − )∕ = (1 − 1 ) (1∕ ) = − 1 . ▪ Remark Although we focus here on the polynomial order of the convergence rates, using the above general polynomial bound results, it is also possible to use a similar approach to obtain actual quantitative (computable) bounds on the distance to stationarity of PDMP, similar in spirit to Rosenthal (2002) and the references therein; by using the related results of Fort and Moulines (2003).

Convergence Rate in One Dimension
Suppose again that the state space X is the one-dimensional real line , with C 1 target density (x) . Again consider the algorithm which enlarges the state space to X × {−1, 1} , and expands to (x, v) = 1 2 (x) for v ∈ {−1, 1} , and moves with fixed constant velocity v, except reflecting from v to −v with hazard rate (x, v) = −v (log ) � (x) + , and with zero refresh rate. We know that if has heavy tails, then this process cannot converge exponentially quickly. However, it might still converge polynomially quickly. Polynomial convergence rates for the related Zig-Zag process on one-dimensional heavy-tailed targets have been studied in Vasdekis and Roberts (2021). In this section, we present a result which gives precise polynomial convergence rates for the Bouncy Particle sampler, including a generalisation to f-norm convergence.
Consider now the specific example where (x) = (1 + x 2 ) −(1+r)∕2 for some fixed constant r ≥ 1 , at least when |x| ≥ Δ ≥ 1 (so, is essentially a Student's t-distribution). Then we have: Theorem 4 The above one-dimensional PDMP converges to stationarity in total variation distance at polynomial rate equal to r. More precisely, for any x ∈ X, Furthermore, for any p ∈ [0, 1) , for appropriate drift function V as defined in the proof, the process converges to stationarity in the V (1− )p -norm at polynomial order approaching (1 − p)r , i.e. for any a < r, Proof First, for the assumed form of , we have for |x| ≥ Δ that It follows that (x, v) ≥ (1 + r)∕(1 + x) for v = +1 and x ≥ Δ ≥ 1 . Next, for some > 1 and K > 1 to be determined later, let Next, note that this process has infinitesimal generator A which acts (cf. Ethier and Kurtz 1986;Deligiannidis et al. 2019) on appropriate C 1 functions f ∶ X × {−1, 1} → by: Hence, for v = +1 and x ≥ Δ,

Suppose it holds that
x ≥ Δ and v = −1 we have (x, v) = 0 , so we compute from (1) that Combining these two calculations, it follows that if K * = min , (K − 1)(r + 1) − K , then assuming (2), we have for x ≥ Δ and either v = +1 or v = −1 that By symmetry, this condition also holds for x ≤ −Δ , i.e. it holds whenever |x| ≥ Δ . This shows that the assumptions of Corollary 2 hold with = 1∕ , so (1 − )∕ = − 1 . Hence, that corollary gives that and in particular with p = 0, It remains to ensure that (2) holds. But (2) can be satisfied for any < 1 + r by using a sufficiently large K. It follows that the polynomial order − 1 can be made ≥ r − for any > 0 , i.e. we can take − 1 = a for any a < r , which gives the claimed upper bounds. .
Finally, for the lower bound, note that since the process never moves faster than speed 1, we must have P t ((x, ±1), (t, ∞)) = 0 for x ≤ 0 , and similarly that P t ((x, ±1), (−∞, −t)) = 0 for x ≥ 0 . Hence, for any x ∈ , by symmetry, which to first order as t → ∞ is This completes the proof. ▪

Multi-Dimensional Generator Bounds
We now turn to PDMP on X = d . At each time, the process has position x and velocity v with |v| = 1 . The process primarily moves at fixed constant velocity v. It also reflects along 's contour lines at the hazard rate And it refreshes, by drawing a new v independently from the uniform distribution Ψ on the unit sphere in d , with refresh rate which we take to be s/ |x| for some choice of s > 0 to be determined later, which does not depend on x (but might still depend on d).
(This choice of |x| −1 refresh rate decay helps avoid diffusive behaviour for large |x|, and makes the process self-similar in the sense that multiplying it by a constant preserves the trajectories just at a slower speed, and also balances the influence of refreshes with those of the continuous dynamics and reflections as we shall see, thus facilitating our calculations and analysis.) To proceed, consider a drift function of the form for some > 1 , where is the cosine of the angle between x and v, and W(C) ≥ 1 is a function which will be chosen later. We assume that W has right-hand first derivatives (at least), denoted W � (C) . Let E ∶= Ψ [W(C x,U )] be the expected value of W(C x,U ) where U ∼ Ψ . Extending (1) to multiple dimensions, and including the refreshes at rate s/ |x|, this process has infinitesimal generator A which acts on appropriate C 1 functions f ∶ X × {−1, 1} → by (for background see e.g. Ethier and Kurtz 1986;Deligiannidis et al. 2019). Then we have:

Theorem 5 The above PDMP has infinitesimal generator satisfying
where The proof of Theorem 5 requires a simple gradient lemma: Lemma 6 For any a ∈ , ∇ x (|x| a ) = a|x| a−2 x.
Hence, by the chain rule,

Proof of Theorem 5
We wish to compute the generator AV . Write this as where A 1 is the contribution from the continuous dynamics, and A 2 is the contribution from reflections, and A 3 is the contribution from refreshing.
We begin with A 1 V (the continuous dynamics). We compute that Now, since ∇ x x ⋅ v = v , Lemma 6 with a = 1 gives Hence, And, Lemma 6 with a = gives Hence, by the product rule for derivatives, We next consider A 2 V (reflections). They occur at rate (x, v) , and change v to −v , Finally, we consider A 3 V(x, v) (refreshing). Refreshes occur at rate s/|x|, and replace the current velocity v with a fresh i.i.d. draw from the spherically-symmetric distri- Putting this all together, the claim follows since A = A 1 + A 2 + A 3 . ▪ We now assume that has polynomial tails as in a Student's t-distribution, i.e. that at least for |x| ≥ Δ . Theorem 5 then gives: Corollary 7 If is given by 3, then the above PDMP has infinitesimal generator satisfying where Proof Here for |x| > Δ we have log (x) = − (r + d)∕2 log(1 + |x| 2 ) , so ∇ log (x) = − r + d 2 2x 1 + |x| 2 = −(r + d) x 1 + |x| 2 , and The result then follows from Theorem 5. ▪

Multi-Dimensional Convergence Rates
In this section, we prove the following bound on the polynomial convergence rate of highdimensional PDMP: Theorem 8 For the above PDMP, for all sufficiently large d ∈ , with as in (3) with r > (2 − 1)d , we have for any a < (r + d) √ 2 ∕d − 1 and any p ∈ [0, 1) that for appropriate choice of refresh parameter s and drift function V as defined in the proof.
In particular, That is, the process converges to stationarity in total variation distance at polynomial order approaching (r + d) √ 2 ∕d − 1 . On the other hand, for any a > r, Proof To obtain specific convergence rate bounds, we need to choose the function W(C) in the drift function V(x, v) = W(C x,v ) 1 + |x| . After considering many possible choices, including some complicated ones, we eventually settled on the simple piecewise-linear choice for some m ∈ (0, 1) . For this W(C), let M(C) be as in Corollary 7. Note that To proceed, let k = r∕d , so k > 2 − 1 . Then (k + 1)∕ (4) W(C) = 1 + m C C<0 ∶= 1, We now consider separately the cases C < 0 and C ≥ 0. For C < 0 , it follows from 4 that W(C) = 1 + m C and W � (C) = m and C + = 0 , so Hence Next we use Lemma 9 below, which states that and hence which is < 0 for all sufficiently large d since > For C ≥ 0 , it follows from (4) that W(C) = 1 and W � (C) = 0 and C + = C , and also W(−C) = 1 − mC , so Hence, M � (C) = − 2(r + d)mC . So, on [0, 1], the function M is first increasing and then decreasing, with a maximum where − 2(r + d)mC = 0 so C = ∕2(r + d)m . Hence, again using (6), Then, using the bound (5) and that (k + 1)d = r + d , this is so it must be < 0 for all sufficiently large d.

An Expectation Computation
To complete the proof of Theorem 8, we require the following computation:

Lemma 9
For W(C) as in (4), consider the expected value E ∶= Ψ [W(C x,U )] , where U ∼ Ψ where Ψ is the uniform distribution on the unit sphere in d for some d > 1 , and x ≠ 0 is any fixed vector in d , and C x,U is the cosine of the angle between x and U. Then To prove Lemma 9, we first need another lemma giving the C x,U density function: Lemma 10 Let U ∼ Ψ as in Lemma 9. Then the quantity C x,U has density function on [−1, 1] proportional to f (c) = (1 − c 2 ) (d−3)∕2 .
Proof Let (e 1 , … , e d ) be an orthonormal basis of d with e 1 = x∕|x| , and write Z = (Z 1 , … , Z d ) in this basis where {Z i } are i.i. d. N(0, 1). Then the unit vector Z ∕ |Z| has uniform distribution Ψ , so C x,U has the same distribution as Therefore, C x,U 2 has the same distribution as using the general property that if X ∼ 2 ( ) and Y ∼ 2 ( ) are independent, then X X+Y ∼ Beta( 2 , 2 ) . Hence, C x,U 2 has density function on [0, 1] proportional to h(c) = Then, |C x,U | = √ C x,U 2 = g(C x,U 2 ) where g(c) = √ c and g −1 (c) = c 2 . So, by the changeof-variable formula, |C x,U | has density on [0, 1] proportional to Finally, since C x,U is symmetric about 0, the density of C x,U on all of [−1, 1] must also be proportional to (1 − c 2 ) (d−3)∕2 . ▪