Equivalences of Geometric Ergodicity of Markov Chains

This paper gathers together different conditions which are all equivalent to geometric ergodicity of time-homogeneous Markov chains on general state spaces. A total of 34 different conditions are presented (27 for general chains plus 7 just for reversible chains), some old and some new, in terms of such notions as convergence bounds, drift conditions, spectral properties, etc., with different assumptions about the distance metric used, finiteness of function moments, initial distribution, uniformity of bounds, and more. Proofs of the connections between the different conditions are provided, mostly self-contained but using some results from the literature where appropriate.


Introduction
The increasing importance of Markov chain Monte Carlo (MCMC) algorithms (see e.g.[2] and the many references therein) has focused attention on the rate of convergence of (timehomogeneous) Markov chains to their stationary distribution.While it is most useful to have explicit quantitative bounds on the distance to stationarity (see e.g.[27,13] and the references therein), qualitative convergence bounds are often more feasible to obtain.The most commonly-used qualitative convergence property is geometric ergodicity, i.e. exponentially fast convergence to stationarity, which has been widely studied (e.g.[29,18,23]), and indeed has become a de facto method of assessing the value of MCMC algorithms.
In addition to fast convergence, geometric ergodicity also guarantees a Markov chain Central Limit Theorem (CLT), i.e. the convergence of scaled sums of functional values to a fixed normal distribution, for all functionals with finite 2 + δ moments [9,Theorem 18.5.3](see also [8]), or even just 2 nd moments assuming reversibility [22].Such CLTs are helpful for understanding the errors which arise from Monte Carlo estimation (see e.g.[29,25,12]).However, geometric ergodicity and CLTs do not hold for all Markov chains nor all MCMC algorithms (see e.g.[21] and [23,Theorem 22]).
For certain types of MCMC algorithms, geometric ergodicity is fairly well understood.For example, it is known that an Independence Sampler is geometrically ergodic if and only if its proposal density is bounded below by a constant multiple of the target density [16], and that the popular Random-Walk Metropolis algorithm is geometrically ergodic essentially if and only if its target distribution has exponentially light tails [17,25].However, for many other complicated Markov chains and MCMC algorithms, geometric ergodicity is not clear.
One promising way of establishing geometric ergodicity is to show that some other properties of Markov chains imply it, or are even equivalent to it.This has been shown, by [29,18,22,26] and others, for properties such as drift conditions, spectral bounds, and more.However, such relationships are scattered throughout the literature, are not always stated in full generality, and are often presented as just one-way implications.In the current work, we present a total of 34 different conditions which are equivalent to geometric ergodicity for Markov chains on general state spaces (27 for general chains plus 7 just for reversible chains; some previously known and some new).We then provide proofs of all of the equivalences (somewhat self-contained, though using known results where needed); see Figure 1.
To illustrate the flavour of the various equivalences, consider the following: • The usual definitions of geometric ergodicity state that the Markov chain's distance to stationarity after n iterations is bounded by a constant times ρ n for some ρ < 1.But what "distance" should be used: total variation, or V -norm, or L 2 (π)?And, how does the "constant" depend on the starting state X 0 = x?Must those constants have finite expected value with respect to π? What about finite j th moments?
• If the initial state X 0 is itself chosen from a non-degenerate initial distribution probability measure µ, then will the convergence to stationarity still be geometric, at least if µ is, say, in L p (π)?
• Geometric ergodicity is well-known to be implied by drift conditions of the form P V (x) ≤ λ V (x) + b 1 S (x) for some function V : X → [1, ∞] and λ < 1 and b < ∞ and small set S. But are such drift conditions actually equivalent to geometric ergodicity?And, can the drift function V be taken to have finite stationary mean?finite j th moment?
• Geometric ergodicity is also related to the Markov operator P having a spectral gap.But as an operator on what space: L ∞ V ?for what function V ?having which finite moments?And should the "gap" be identified by removing the eigenvalue 1 directly, or by subtracting off Π, or by restricting to the zero-mean space L ∞ V,0 ?
• Geometric ergodicity is implied by the Markov operator norm being less than 1.But for which operator: P , or P m for some m ∈ N? Regarded as an operator on L ∞ V or L ∞ V,0 ?For what choice of V ?Having which finite moments?
• If the Markov chain is assumed to be reversible, so that the operator P is self-adjoint on L 2 (π), then in which of the above conditions can the operator norm be taken to be L 2 (π)?
We shall see that the answer to these questions is, essentially, "all of the above".That is, we shall state many different conditions, which cover essentially all of the above possibilities, and shall prove that they are all equivalent.In our desire to be thorough, we might have gone a bit overboard listing so many different conditions, including some which are just minor variations of each other.However, we believe that additional equivalent conditions can only help: the equivalences with weaker assumptions are easier to establish, while the equivalences with stronger assumptions are most useful for drawing conclusions or analysing further.We know from bitter experience that it can be very frustrating to discover a statement about geometric ergodicity which is almost, but not quite, exactly what we can verify, or exactly what is needed to finish a particular proof.This has led us to adopt a "the more the merrier" attitude regarding different but similar conditions.The reader can, of course, choose to ignore all conditions which are not germaine to their work.
As mentioned, many of the equivalences presented herein were already known; see the Remark after Theorem 1 below.Thus, this paper falls somewhere in between an expository/review paper and a original research paper, but we hope it is helpful nonetheless.
Basic definitions necessary to understand the conditions, such as total variance distance, L ∞ V norms, L p (π) spaces, reversibility, etc., are presented in Section 2.Then, in Section 3, all of the equivalent conditions are introduced (Theorem 1).Sections 4 through 8 are then devoted to proving all of the equivalences; see Figure 1 for a visual guide showing which implications are proved by which of our results.Our proofs are somewhat self-contained, but we do use known results in the literature (especially [18]) where needed.Finally, we close in Section 9 with some future directions and open problems (Q 9.1 through Q 9.7).

Definitions and Background
Throughout this paper, Φ = {X n } ∞ n=0 is a discrete-time, time-homogeneous Markov chain on a general state space X equipped with a σ-algebra F .And, P is the corresponding Markov kernel, so that P (x, A) = P[X n ∈ A | X n−1 = x] for all x ∈ X and A ∈ F and n ∈ N. The kernel P acts to the left on (possibly signed) measures, and to the right on functions, by: (µP )(A) = P (x, A) µ(dx), (P f )(x) = f (y) P (x, dy).
The higher-order transitions are then defined inductively by: We shall assume throughout P has a stationary distribution, i.e. a probability distribution π on (X , F ) which is preserved by P in the sense that πP = π.We define Π := 1 X ⊗ π by If µ is a probability measure, then (µ Π)(A) = π(A), and µ(P n − Π) = µP n − π.Also, by stationarity of π, we have (P − Π) n = P n − Π for each n ∈ N.
We shall assume that our Markov chain is φ-irreducible, i.e. there exists a non-zero σ-finite measure φ on (X , F ) such that for all x ∈ X and A ⊆ X with φ(A) > 0, there is n ∈ N with P n (x, A) > 0. We shall also assume that it is aperiodic, i.e. there do not exist d ≥ 2 and disjoint X 1 , . . ., X d ⊆ X of positive π measure, such that P (x, X i+1 ) = 1 for all x ∈ X i (i = 1, . . ., d − 1) and P (x, X 1 ) = 1 for all x ∈ X d .It is well-known (e.g.[18,23]) that these conditions guarantee that P n (x, A) → π(A) as n → ∞ (see also Q 9.1 and Q 9.3 below).Geometric ergodicity then corresponds to the property, which may or may not hold, that this convergence occurs exponentially quickly.
We shall also assume that the state space (X , F ) is countably generated, i.e. that there exists A 1 , A 2 , . . .∈ F such that F = σ(A 1 , A 2 , . ..), i.e.F is the smallest σ-algebra containing all of the A i .This technical property ensures the existence of small sets [4,10,20] and the measurability of certain functions [22,Appendix] (see also Q 9.2 below).
A subset S ∈ F is called small if π(S) > 0 and there is m > 0 and a non-zero measure ν on (X , F ) such that P m (x, A) ≥ ν(A) for all x ∈ S and A ∈ F , i.e. if all of the m-step transition probabilities from within S all have some "overlap".This property is very useful for coupling constructions and for ensuring convergence to stationarity (see e.g.[18,23]).
The total variation distance between two probability measures µ 1 and µ 2 is defined by: (see e.g.[23, Proposition 3(b)]).Given a positive function V : X → R, we define [18, p. 390 . We let L ∞ V be the vector space of all functions f : . Then, we define the V -norm of a Markov kernel P as For a (possibly signed) measure µ, we define µ L p (π) for 1 ≤ p < ∞ by (If p = 1 and µ ≪ π, then the two definitions coincide.)We let L p (π) be the collection of all signed measures µ on (X , F ) with µ L p (π) < ∞, and define the L p (π)-norm of a transition kernel P acting on the set L p (π) by: (Note in particular that the L p (π) are collections of signed measures, while L ∞ V and L ∞ V,0 are collections of functions.) The transition kernel P is reversible with respect to π if π(dx) P (x, dy) = π(dy) P (y, dx) for all x, y ∈ X .This is equivalent to P being a self-adjoint operator on the Hilbert space L 2 (π), with inner product given by µ, ν = X dµ dπ dν dπ dπ.
Finally, given an operator P on a Banach space (i.e. a complete normed vector space) V, e.g.V = L ∞ V or L 2 (π), the spectrum of P , denoted by S(P ) or S V (P ), is the set of all complex numbers λ such that λI − P is not invertible (see e.g.[28, p. 253]).And, the spectral radius of P is the number r(P ) = r V (P ) = sup λ∈S V (P ) |λ|.

Main Result: Statement of Equivalences
We now provide a list of 27 conditions which are always equivalent to geometric ergodicity of Markov chains, and an additional 7 (for 34 total) which are also equivalent for reversible chains.Some of the conditions are very similar to each other, but are included to allow for maximum flexibility when establishing or using geometric ergodicity in both theoretical investigations and applications.For ease of comprehension, similar conditions are grouped together under common subheadings.
Theorem 1.Let P be the transition kernel of a φ-irreducible, aperiodic Markov chain Φ = {X n } with stationary probability distribution π on a countably generated measurable state space (X , F ). Then the following are equivalent (and all correspond to being "geometrically ergodic"):

Geometric Convergence in TV:
i) Φ is geometrically ergodic starting from π-a.e.x ∈ X with constant geometric rate.This means there is fixed ρ < 1 such that for π-a.e.x ∈ X there is C x < ∞ with ii) There exists A ∈ F with π(A) > 0 such that Φ is geometrically ergodic starting from each x ∈ A. This means for each x ∈ A, there are ρ x < 1 and C x < ∞ with iii) There exists p ∈ (1, ∞) such that Φ is geometrically ergodic starting from all probability measures in L p (π).This means there is some p ∈ (1, ∞) such that for each probability measure µ ∈ L p (π) there are constants ρ µ < 1 and iv) For all p ∈ (1, ∞), Φ is geometrically ergodic starting from all probability measures in L p (π) with geometric rate depending only on p.This means for each p ∈ (1, ∞), there is v) There exists a small set S ∈ F such that Φ is geometrically ergodic uniformly over starting states within S. This means there are constants ρ S < 1 and C S < ∞ with vi) There exists a small set S ∈ F such that Φ is geometrically ergodic starting from the stationary distribution restricted to S. This means there are constants ρ S < 1 and C S < ∞ with where π S is the probability measure defined by π S (A) = π(S ∩ A) π(S) for A ∈ F .

Geometric Return Time:
vii) There exists a small set S ∈ F and constant κ > 1 such that where τ S is the first return time to S, and E x is expected value conditional on X 0 = x.

Spectral Gap:
xiv) There exists j ∈ N and a π-a.e.-finite measurable function V : X → [1, ∞] with π(V j ) < ∞, such that P has a spectral gap as an operator on L ∞ V , meaning 1 is an eigenvalue of P (which must have multiplicity 1 by Lemma 4.7), and there is ρ < 1 such that xv) For all j ∈ N, there exists a π-a.e.-finite measurable function V : X → [1, ∞] with π(V j ) < ∞, such that P has a spectral gap as an operator on L ∞ V , meaning 1 is an eigenvalue of P (which must have multiplicity 1 by Lemma 4.7), and there is ρ <

Spectral Radius:
xvi) There exists j ∈ N and a π-a.e.-finite measurable function xvii) For all j ∈ N, there exists a π-a.e.-finite measurable function V : X → [1, ∞] with π(V j ) < ∞, such that P − Π has spectral radius less than one as an operator on L ∞ V , i.e.
xviii) There exists j ∈ N and a π-a.e.-finite measurable function V : X → [1, ∞] with π(V j ) < ∞, such that P has spectral radius less than one as an operator on L ∞ V,0 , i.e.
xix) For all j ∈ N, there exists a π-a.e.-finite measurable function V : X → [1, ∞] with π(V j ) < ∞, such that P has spectral radius less than one as an operator on L ∞ V,0 , i.e.

Conditions Assuming Reversibility:
Furthermore, if Φ is reversible, then the following are also equivalent to the above: xxviii) Φ is L 2 (π)-geometrically ergodic starting from any probability measure µ in L 2 (π) with uniform convergence rate.This means there is ρ < 1 such that for each probability xxix) There exists ρ < 1 such that for each probability measure µ ∈ L 2 (π), xxx) P has a spectral gap as an operator on L 2 (π), meaning that 1 is an eigenvalue of P (which must have multiplicity 1 by Lemma 4.7), and there is ρ < 1 with xxxi) P − Π has spectral radius less than one as an operator on L 2 (π), i.e.
xxxii) P − Π has operator norm less than one as an operator on L 2 (π), i.e.
xxxiii) P has operator norm less than one as an operator on π ⊥ , i.e.
xxxiv) P | π ⊥ has spectral radius less than one as an operator on π ⊥ , i.e.
Remark.A number of the above equivalences are already known, as follows.The fact that (vi ) implies (i ) was shown in [30]  Our Theorem 1 is an attempt to combine and bring together all of these various results, and add others too.(Since initiating this work, we also learned of the recent review [1], which presents certain equivalences for reversible chains in terms of mixing conditions and maximal correlations, which complement some of our conditions (xxviii) through (xxxiv ).In addition, the recent volume [5] expands upon much of the material in [18].) Most of the remainder of this paper is devoted to proving Theorem 1.The proof is divided up into different sections below, in terms of which types of conditions are being considered: Section 4 provides some preliminary lemmas, Section 5 relates to various "Geometric" conditions, Section 6 relates to various conditions involving V functions and L ∞ V bounds, Section 7 relates to various spectral conditions, and Section 8 relates to various conditions for reversible chains.To help the reader (and ourselves) keep track, Figure 1 provides a diagram showing which of our results prove implications between which of the equivalent conditions.Our proofs are somewhat self-contained, but we use known results from the literature (especially [18]) where appropriate.Section 9 then presents some future directions and open problems.

Preliminary Lemmas
We begin with some preliminary lemmas, which are used freely in the sequel, and can be referred to as needed.Proof.This result goes back to [4,10,20], and uses that F is countably generated; see e.g.Theorems 5.2.1 and 5.2.2 in [18].
Lemma 4.2.Let P be the transition kernel of a φ-irreducible, aperiodic Markov chain with stationary distribution π on a countably generated state space X .Then, the function Proof.This follows from [22,Appendix], which proves that for any bounded signed measure ν(•, A) on a countably generated space such that the function x → ν(x, A) is measurable for each fixed A ∈ F , the function x → sup A∈F ν(x, A) is also measurable.

and its negative takes the same maximum when
But then So, by the Cauchy-Schwarz inequality, Lemma 4.5.For all 1 ≤ p < s < ∞, we have L s (π) ⊆ L p (π).
Lemma 4.6.Suppose an operator P on a Banach space V can be decomposed as a direct sum P = P 1 ⊕ P 2 , where V = V 1 × V 2 and each P i is an operator on V i , meaning that the spectrum of P is the union of the spectra of the sub-operators P 1 and P 2 .
Proof.Since P = P 1 ⊕ P 2 , therefore P has the block decomposition Conversely, if λ ∈ S V (P ), then λI − P has some inverse operator, so in block form we have It follows that (λI 1 − P 1 )A = A(λI 1 − P 1 ) = I 1 and (λI Lemma 4.7.Let P be the transition kernel of a φ-irreducible Markov chain with stationary distribution π(•), and let V : X → [1, ∞] be a π-a.e.-finite measurable function.Then, the following hold: 4) The number 1 is an eigenvalue of P with multiplicity 1, regarding P as an operator on L p (π) for any 1 ≤ p < ∞.Furthermore, if π(V j ) < ∞ for some j ∈ N, then this also holds regarding P as an operator on 5) If there are λ < 1 and b < ∞ and a small set S ∈ F with 2) This follows since we always have V (x) ≤ V j (x) + 1. [In fact, since V ≥ 1, the "+1" is not actually necessary.] 3) Any f ∈ L ∞ V can be written as f = f 0 + c where f 0 ∈ L ∞ V,0 and c = π(f ).Then P f = P f 0 + c.It follows that P has the direct sum representation P = P 0 ⊕ I R , where I R is the identity operator on R. Hence, by Lemma 4.6, P ), as claimed.4) Since P is φ-irreducible with stationary probability measure π, it follows that P is "positive" as defined on [18, p. 235 V and L ∞ V,0 are subspaces of L j (π), and hence the result holds on L ∞ V and L ∞ V,0 too.
5) The implication "(iii) ⇒ (i)" of [18, Theorem 14.0.1] with the choice f ( Lemma 4.8.Let P be the transition kernel of a reversible Markov chain with stationary distribution π, such that P is a bounded operator on L 2 (π).Then, the following holds: 1) The operator P − Π is self-adjoint.

Proofs for Geometric Conditions
We now begin proving the actual equivalences of the various conditions in Theorem 1, as per the plan illustrated in Figure 1.We begin with some results related to some of the "geometric" conditions.Proposition 5.1.(iv) ⇒ (iii).

Proposition 5.2. (iii) ⇒ (vi).
Proof.By Lemma 4.1, there exists a small set S ⊂ X .Since by assumption P is geometrically ergodic starting from all probability measures in L p (π) it suffices to show that π S ∈ L p (π).Now for any measurable A ⊂ X we have, Proof.This is the result of [19, Theorem 1], which generalizes the countable state space result of [30].
Proof.Immediate upon choosing A = X , and ρ x = ρ for all x ∈ X .
x for all x ∈ A and n ∈ N.
Then each D n is measurable by Lemma 4.2, hence so are the functions r, s, M : A → [0, ∞] defined by In particular, for each n ∈ N, we have Hence, there is N(x) ∈ N such that for all n > N(x) we have D n (x) 1/n < s(x), i.e.D n (x)/s(x) n < 1.Then M(x) ≤ max D 1 (x)/s(x) 1 , D 2 (x)/s(x) 2 , . . ., D N (x) (x)/s(x) N (x) , 1 < ∞ .Now, since s and M are measurable, so are the nested subsets Since s(x) < 1 and M(x) < ∞ for each x ∈ A, we must have k B k = A. Continuity of measures then implies that lim k→∞ π(B k ) = π(A) > 0, so there is K ∈ N with π(B K ) > 0. By Lemma 4.1, there exists a small set S ⊆ B K .Then for x ∈ S, we have x ∈ B K , so s(x) ≤ 1 − 1 K and M(x) ≤ K.It follows that for x ∈ S and n ∈ N, This establishes (v ) with C S = K and ρ S = 1 − 1 K .
Then, |f | ≤ V , and, if (xi) holds, then for each x ∈ X and each n ∈ N, and therefore, Proof.
where Proof.
Proof.Take p = 2 in (iv ).Then it follows from the "(iii) ⇒ (ii)" implication of [22, Theorem 2] (which is proven by contradiction, using reversibility and the spectral measure of P acting on L 2 (π)) that there is ρ < 1 such that for all probability measures µ ∈ L 2 (π) with µ(X ) = 0. Hence, P π ⊥ ≤ ρ < 1. Proof.From Lemma 4.9 it follows that 9 Future Directions and Open Problems Our Theorem 1 above provides a fairly complete picture of equivalences of geometric ergodicity.However, it does lead to some additional questions which remain, including: Q 9.1.We have assumed throughout that the chain is φ-irreducible and aperiodic.Those properties are certainly required for, and implied by, geometric ergodicity.But do they need to be assumed explicitly?Many of our equivalent conditions imply them, so that they do not actually need to be mentioned.But some of our conditions do not, e.g. the drift conditions (viii ) and (ix ).So, which of our equivalences continue to hold without assuming φ-irreducibility and aperiodicity?Q 9.2.We also assumed that our state space (X , F ) is countably generated, which holds for e.g. the Borel subsets of R and of R d , but not for e.g. the Lebesgue-measurable subsets.It is a very standard assumption (e.g.[18, p. 66]), used to ensure the existence of small sets [4,10,20] and the measurability of certain functions (e.g.[22,Appendix]).But which of our equivalences would continue to hold without it?
Q 9.3.The property of aperiodicity is not necessary for other important properties such as Central Limit Theorems which involve averages of functional values like 1 M M i=1 h(X i ).The weaker notion of variance bounding essentially corresponds to geometric ergodicity without aperiodicity, and still implies CLTs.Many equivalences to variance bounding have been proven for reversible chains; see [24].But can equivalences similar to our Theorem 1 be derived for the variance bounding property without assuming reversibility?Q 9.4.Our later conditions (xxviii) through (xxxiv ) were only shown to be equivalent for reversible chains.But are there explicit counter-examples to show that they are not equivalent in the absence of reversibility?Or are some of them are still equivalent to geometric ergodicity, even without assuming reversibility?(For a start on this, [15,Theorem 1.3] proves that without reversibility the implication (xxxi ) ⇒ (i) still holds, but [15, Theorem 1.4] makes use of [7] to show that the converse might fail.)Q 9.5.Our equivalences are for the fairly strong property of geometric ergodicity.But are there similar equivalences for the even stronger property of uniform ergodicity, i.e. the property that P n (x, •) − π(•) TV ≤ C ρ n from π-a.e.x ∈ X where C does not depend on x? (For a start on this, see [18, Theorem 16.0.2].)Q 9.6.In the other direction, are there similar equivalences for the weaker property of polynomial ergodicity, i.e. the property that P n (x, •) − π(•) TV ≤ C x n −α for some α > 0? (For some discussion and results related to this property, see e.g.[6,11].)Q 9.7.And, are there similar equivalences for the even weaker property of simple ergodicity, i.e. the property that just P n (x, •) − π(•) TV → 0 as n → ∞ from π-a.e.x ∈ X , without specifying any rate?(For a start on this, see e.g.[18, Theorem 13.0.1].) We leave these questions as open problems for future work.
Acknowledgements.We thank Jim Hobert, Galin Jones, and Gareth Roberts for encouraging us to write this paper, and thank the anonymous referee for a very careful reading and helpful report.
Note added in proof: It follows from Proposition 16 on page 3607 of Annals of Applied Probability 25(6) (2015) that we can also include the additional equivalent condition: vii ′ ) There exists a small set S ∈ F and constant κ > 1 such that if V (x) = E x (κ τ S ) for all x ∈ X , then P V (x) ≤ λ V (x) + b 1 S (x) for all x ∈ X , where λ = κ −1 < 1 and b = sup x∈S V (x) < ∞.
Notes added after publication: It follows from [18,Theorem 15.4.1] that the following condition is implied by (i), so since it clearly implies (ii) it is also equivalent: ii ′ ) There exists an absorbing subset H ∈ F with π(H) = 1 such that there are ρ < 1 and C x < ∞ such that for all x ∈ H, is the conductance (Cheeger's constant).
Lemma 4.1.Let P be the transition kernel of a φ-irreducible, aperiodic Markov chain with stationary distribution π on a countably generated state space X .Then for any measurable subset A ⊆ X such that π(A) > 0, there exists a small set S, such that S ⊆ A.