Optimal thinning of MCMC output

The use of heuristics to assess the convergence and compress the output of Markov chain Monte Carlo can be sub‐optimal in terms of the empirical approximations that are produced. Typically a number of the initial states are attributed to ‘burn in’ and removed, while the remainder of the chain is ‘thinned’ if compression is also required. In this paper, we consider the problem of retrospectively selecting a subset of states, of fixed cardinality, from the sample path such that the approximation provided by their empirical distribution is close to optimal. A novel method is proposed, based on greedy minimisation of a kernel Stein discrepancy, that is suitable when the gradient of the log‐target can be evaluated and approximation using a small number of states is required. Theoretical results guarantee consistency of the method and its effectiveness is demonstrated in the challenging context of parameter inference for ordinary differential equations. Software is available in the Stein Thinning package in Python, R and MATLAB.


Introduction
The most popular computational tool for non-conjugate Bayesian inference is Markov chain Monte Carlo (MCMC). Introduced to statistics from the physics literature in Hastings (1970); Geman and Geman (1984); Tanner and Wong (1987); Gelfand and Smith (1990), an enormous amount of research effort has since been expended in the advancement of MCMC methodology. Such is the breadth of this topic that we do not attempt a survey here, but instead refer the reader to Robert and Casella (2013); Green et al. (2015) and the references therein to more advanced material. This paper is motivated by the fact that the approaches used for convergence assessment and to post-process the output of MCMC can strongly affect the estimates that are produced.
Let P be a distribution on a measurable space X and let (X i ) i∈N be a Markov chain that is P -invariant. The Markov chain sample path provides an empirical approximation to P , where δ(x) denotes a point mass centred at x ∈ X . Our discussion supposes that a practitioner is prepared to simulate a Markov chain up to a maximum number of iterations, n, and that simulating further iterations is not practical; a scenario that is often encountered (e.g. see Section 4.3). In this setting it is common (and indeed recommended) to replace (1) with an alternative estimator 1 m m j=1 δ(X π(j) ) that is based on a subset of the total MCMC output. The m indices π(j) ∈ {1, . . . , n} indicate which states are retained and the identification of a suitable index set π is informed by the following considerations: Removal of Initial Bias: The distribution of the initial states of the Markov chain may be quite different to P . To mitigate this, it is desirable to identify a "burn-in" (X i ) b i=1 which is then discarded. The burn-in period b is typically selected using convergence diagnostics (Cowles and Carlin, 1996). These are primarily based on the empirical distribution of simple moment, quantile or density estimates across independent chains and making a judgement as to whether the ensemble of chains has converged to the distributional target. The main limitation of convergence diagnostics, as far as we are concerned in this work, is that in taking b large enough to make bias negligible, the number n − b of remaining samples may be rather small, such that the statistical efficiency of the estimator in (2) is sub-optimal as an approximation of P . Nonetheless, a considerable portion of Bayesian pedagogy is devoted to the identification of the burn-in period, as facilitated using diagnostic tests that are built into commercial-grade software such as WinBUGS (Lunn et al., 2000), JAGS (Plummer, 2003), R (R Core Team, 2020), and Stan (Carpenter et al., 2017).
Increased Statistical Efficiency: It is often stated that discarding part of the MCMC output leads to a reduction in the statistical efficiency of the estimator (2) compared to (1). This argument, made e.g. in Geyer (1992), applies only when the procedure used to discard part of the MCMC output does not itself depend on the MCMC output and when the length n of the MCMC output is fixed. That estimation efficiency can be improved by discarding a portion of the samples in a way that depends on the samples themselves is in fact well-established (see e.g. Dwivedi et al., 2019).
Compression of MCMC Output: A third motivation for estimators of the form (2) is to control the cost of subsequent computation involving the MCMC output. Examples include approximating the expectation of a function f , where either evaluation of f or storage of its output is associated with a computational cost, and Monte Carlo Maximum Likelihood, where one constructs an approximate likelihood using MCMC, then performs optimisation on this approximate likelihood (Geyer and Thompson, 1992). In such situations one may want to control the cardinality m of the index set π and to use (X π(j) ) m j=1 as an experimental design on which f is evaluated. The most popular solution is to retain only every t th state visited by the Markov chain, a procedure known as "thinning" of the MCMC output. See also the more sophisticated approach in Paige et al. (2016).
Taking these considerations into account, the most common approach used to select an index set π is based on the identification of a suitable burn-in period b and/or a suitable thinning frequency t, leading to an approximation of the form Here r denotes the integer part of r. This corresponds to a set of indices π in (2) that discards the burn-in states and retains only every t th iteration from the remainder of the MCMC output. It includes the case where no states are removed when b = 0 and t = 1. Despite their widespread usage, the interplay between the Markov chain sample path and the heuristics used to select b and t is not widely appreciated. In general it is unclear how much bias may be introduced by employing a post-processing heuristic that is itself based on the MCMC output. Indeed, even the basic question of when the post-processed estimator in (3) is consistent when b and t are chosen based on the MCMC output appears not to have been studied. In this paper we propose a novel method, called Stein Thinning, that selects an index set π, of specified cardinality m, such that the associated discrete approximation in (2) is close to optimal among all approximations supported on the MCMC output. The method is designed to ensure that (2) is a consistent approximation of P . This includes situations when the Markov chain on which it is based is not P -invariant, but we do of course require that the regions of high probability under P are explored. To achieve this we adopt a kernel Stein discrepancy as our optimality criterion. The minimisation of kernel Stein discrepancy is performed using a greedy sequential algorithm and the main contribution of our theoretical analysis is to study the interplay of the greedy algorithm with the randomness inherent to the MCMC output. The proposed Stein Thinning method is simple (see Algorithm 1), applicable to most problems where gradients of the log-posterior density can be computed, and implemented as convenient Python and MATLAB packages that require no additional user input other than the number m of states to be selected (see Appendix S1).

Related Work
Our work contributes to an active area of research that attempts to cast post-processing of MCMC as an optimisation problem. Mak and Joseph (2018) proposed a method, called Support Points, which selects a small number of states in order that an empirical measure supported on those states minimises an "energy distance" to P . However, computation of the energy distance requires access to P , and minimisation of energy distance requires a challenging non-convex optimisation problem to be solved, meaning that in practice approximations are required. Stein discrepancy provides a computable alternative, which was used in Liu and Lee (2017) to optimally weight an arbitrary set (X i ) n i=1 ⊂ R d of states in an manner loosely analogous to importance sampling, at a computational cost of O(n 3 ). The combined effect of applying the approach of Liu and Lee (2017) to MCMC output was analysed in Hodgkinson et al. (2020), who established situations in which the overall procedure will be consistent.
If a compressed representation of the posterior P is required, but one is not wedded to the use of MCMC for generation of candidate states, then several other methods can be used. Joseph et al. (2015Joseph et al. ( , 2019 proposed a criterion to capture how well an empirical measure based on a point set approximates P and applied repeated numerical optimisation over X to arrive at a suitable point set. A similar approach was taken in Chen et al. (2018), where a Stein discrepancy was numerically minimised. The reliance of both of these algorithms on non-convex numerical optimisation over X renders their implementation and analysis difficult. Chen et al. (2019) considered using Markov chains to approximately perform numerical optimisation, allowing a tractable analytic treatment at the expense of a sub-optimal compression of P . An elegant alternative approach is to formulate a convex optimisation problem on the set of probability distributions on X . In this spirit, Liu and Wang (2016); Liu (2017) identified a gradient flow with P as a fixed point that can be approximately simulated using a particle method. At convergence, one obtains a compressed representation of P , however the theoretical analysis of this approach remains an open and active research topic (see e.g. Duncan et al., 2019).
The present paper differs from the contributions cited, in that (1) our algorithm requires only the output from one run of MCMC, which is a realistic requirement in many situations, and (2) we are able to provide a finite sample size error bound (Theorem 2) and a consistency guarantee (Theorem 3) for Stein Thinning, that cover precisely the algorithm that we implement.

Outline of the Paper
The paper proceeds, in Section 2, to recall the construction of a kernel Stein discrepancy and to present Stein Thinning. Then in Section 3 we establish a finite sample size error bound, as well as a widely-applicable consistency result that does not require the Markov chain to be P -invariant. In Section 4 we present an empirical assessment of Stein Thinning in the context of parameter inference for ordinary differential equation models. Conclusions are contained in Section 5.

Methods
In this section we introduce and analyse Stein Thinning. First, in Section 2.1, we recall the construction of a kernel Stein discrepancy and its theoretical properties. The Stein Thinning method is presented in Section 2.2, whilst Section 2.3 is devoted to implementational detail.
Before we proceed, we introduce a piece of notation that will often be used and recall the mathematical definition of a reproducing kernel: Notation: Let P denote the set of probability distributions P that admit a positive density p, with ∇ log p Lipschitz on R d .
Reproducing Kernel: A reproducing kernel Hilbert space (RKHS) of functions on a set X is a Hilbert space, denoted H(k), equipped with a function k : X ×X → R, called a kernel, such that ∀x ∈ X we have k(·, x) ∈ H(k) and ∀x ∈ X , h ∈ H(k) we have h(x) = h, k(·, x) H(k) . In this paper ·, · H(k) denotes the inner product in H(k) and the induced norm will be denoted · H(k) . For further details, see Berlinet and Thomas-Agnan (2004).

Kernel Stein Discrepancy
To construct a criterion for the selection of states from the MCMC output we require a notion of optimal approximation for probability distributions. To this end, recall that an integral probability metric (IPM) (Muller, 1997), based on a set F of measure-determining functions on a measurable space X , is defined as The fact that F is measure-determining means that D F (P, Q) = 0 if and only if P = Q is satisfied. Standard choices for F, e.g. that recover Wasserstein distance as the IPM, cannot be used in the Bayesian context due to the need to compute integrals with respect to P in (4). In the remainder of Section 2.1 we restrict attention to the setting P ∈ P. To circumvent intractability of (4), the notion of a Stein discrepancy was proposed in Gorham and Mackey (2015). This was based on Stein's method (Stein, 1972), which consists of finding a set G of sufficiently differentiable d-dimensional vector fields and a differential operator A P , depending on P and acting on elements of G, such that R d A P g dP = 0 for all g ∈ G. The proposal of Gorham and Mackey (2015) was to take F = A P G to be the image of G under A P in (4), leading to the Stein discrepancy Theoretical analysis had led to sufficient conditions for A P G to be measure-determining (Gorham and Mackey, 2015). In this paper we focus on a particular form of (5) due to Chwialkowski et al. (2016); Gorham and Mackey (2017), called a kernel Stein discrepancy (KSD). In this case, A P is the Langevin Stein operator A P g := p −1 ∇ · (pg) derived in Gorham and Mackey (2015), where ∇· denotes the divergence operator in R d and is the unit ball in a Cartesian product of RKHS. It follows from construction that the set A P G is the unit ball of another RKHS, denoted H(k P ), whose kernel is where ·, · denotes the standard Euclidean inner product, ∇ denotes the gradient operator and subscripts have been used to indicate the variables being acted on by the differential operators (Oates et al., 2017). Thus KSD is recognised as a maximum mean discrepancy in H(k P ) (Song, 2008) and is fully characterised by the kernel k P ; we therefore adopt the shorthand notation D k P (Q) for D A P G (P, Q).
In the remainder of this section we recall the main properties of KSD. The first is a condition on the kernel k that guarantees elements of H(k p ) have zero mean with respect to P . In what follows x = x, x 1/2 denotes the Euclidean norm on R d . It will be convenient to abuse operator notation, writing ∇ x ∇ y k for the Hessian matrix of a bivariate function (x, y) → k(x, y).
Proposition 1 (Proposition 1 of Gorham and Mackey (2017)). Let P ∈ P and assume that R d ∇ log p dP < ∞. Let (x, y) → ∇ x ∇ y k(x, y) be continuous and uniformly bounded on R d . Then R d hdP = 0 for all h ∈ H(k P ), where k P is defined in (6). The second main property of KSD that we will need is that it can be explicitly computed for an empirical measure Q = 1 n n i=1 δ(x i ), supported on states x i ∈ R d : Proposition 2 (Proposition 2 of Gorham and Mackey (2017)). Let P ∈ P and let (x, y) → ∇ x ∇ y k(x, y) be continuous on R d . Then where k P was defined in (6).
The third main property is that KSD provides convergence control. Let Q n ⇒ P denote weak convergence of a sequence (Q n ) of measures to P . Theoretical analysis in Gorham and Mackey (2017) Proposition 3 (Theorem 4 in Chen et al. (2019)). Let P ∈ P be distantly dissipative, meaning that lim inf r→∞ κ(r) > 0 where Consider the kernel k(x, y) = (c 2 + Γ −1/2 (x − y) 2 ) β for some fixed c > 0, a fixed positive definite matrix Γ and a fixed exponent β ∈ (−1, 0).
where k P is defined in (6). The properties just described ensure that KSD is a suitable optimality criterion to consider for the post-processing of MCMC output. However, all discrepancies are associated with finite sample size pathologies; see Matsubara et al. (2021, Section 3.4) for a discussion of the pathologies of KSD. Our attention turns next to the development of algorithms for minimisation of KSD.

Greedy Minimisation of KSD
The convergence control afforded by Proposition 3 motivates the design of methods that select points (x i ) n i=1 such that (7) is approximately minimised. Continuous optimisation algorithms were proposed for this task in Chen et al. (2018) and Chen et al. (2019). In Chen et al. (2018), deterministic optimisation techniques were considered for low-dimensional problems, whereas in Chen et al. (2019) a Markov chain was used to provide more a practical optimisation strategy when the state space is high-dimensional. In each case greedy sequential strategies were considered, wherein at iteration n a new state x n is appended to the current sequence (x 1 , . . . , x n−1 ). Chen et al. (2018) also considered the use of conditional gradient algorithms (so-called Frank-Wolfe, or kernel herding algorithms) but found that greedy algorithms provided better performance across a range of experiments and therefore we focus on greedy algorithms in this manuscript.
The present paper is distinguished from earlier work in that we do not attempt to solve a continuous optimisation problem for selection of the next point x n ∈ X . Such optimisation problems are fundamentally difficult and can at best be approximately solved. Instead, we exactly solve the discrete optimisation problem of selecting a suitable element x n from supplied MCMC output. In this sense we expect our findings will be more widely applicable than previous work, since we are simply performing post-processing of MCMC output and there exists a variety of commercial-grade software for MCMC. The method that we propose, called Stein Thinning, is straight-forward to implement, and is stated in Algorithm 1 for a distribution P on a general measurable space X . (The convention 0 i=1 = 0 is employed.) The algorithm is illustrated on a simple bivariate Gaussian mixture in Figure 1. Observe in this figure that the points selected by the Stein Thinning do not belong to the burn-in period (which is visually clear), and that although the MCMC spent a disproportionate amount of time in one of the mixture components, the number of points selected by Stein Thinning is approximately equal across the two components of the target. The accuracy of the approximation produced by Stein Thinning is, nevertheless, gated by the Data: The output (x i ) n i=1 from an MCMC method, a kernel k P for which the conclusion of Proposition 3 holds, and a desired cardinality m ∈ N. Result: The indices π of a sequence (x π(j) ) m j=1 ⊂ {x i } n i=1 where the π(j) are elements of {1, . . . , n}. for j = 1, . . . , m do end Algorithm 1: The proposed method; Stein Thinning.
quality of the MCMC output to which it is applied. A detailed empirical assessment is presented in Section 4.
Remark 1 (Tie-breaking). In the event of a tie, a tie-breaking rule should be used to select the next index. For example, if the minimum in Algorithm 1 is realised by multiple candidate values Π(j) ⊆ {1, . . . , n}, one could adopt a tie-breaking rule that selects the smallest element of Π(j) as the value that is assigned to π(j). The rule that is used has no bearing on our theoretical analysis in Section 3.
Remark 2 (Complexity). The computation associated with iteration j of Algorithm 1 is O(nr j ) where r j ≤ min(j, n) is the number of distinct indices in {π(1), . . . , π(j − 1)}; the computational complexity of Algorithm 1 is therefore O(n m j=1 r j ). For typical MCMC algorithms the computational complexity is O(n), so the complexity of Stein Thinning is equal to that for MCMC when m is fixed and higher when m is increasing with n, being at most O(nm 2 ).
Remark 3 (Re-sampling). In general the indices in π need not be distinct. That is, Algorithm 1 may prefer to include a duplicate state rather than to include a state which is not useful for representing P . Indeed, if m > n then the sequence (x π(j) ) m j=1 must contain duplicates entries. Theorem 1 in Section 3 clarifies this behaviour.
Remark 4 (Finite sample error bound). The approximation produced by Stein Thinning satisfies a finite sample error bound following Hickernell (1998, Equation (1.14) applied to f − f dP ). This can be contrasted with the typically asymptotic analysis of MCMC. The practical estimation of the final term in this bound was discussed in Section 4 of South et al. (2021).

Choice of Kernel
The suitability of KSD to quantify how well Q approximates P is determined by the choice of the kernel k in (6). Several choices are possible and for P ∈ P, based on Proposition 3 together with extensive empirical assessment, Chen et al. (2019) advocated the pre-conditioned inverse multi-quadric kernel k(x, y) := (1 + Γ −1/2 (x − y) 2 ) −1/2 where, compared to Proposition 3, we have fixed c = 1 (without loss of generality) and β = −1/2. The suitability of these choices for Stein Thinning is verified in Appendix S5.1. The positive definite matrix Γ remains to be specified and it is natural to take a data-driven approach where the MCMC output is used to select Γ. Provided that a fixed number n 0 ∈ N of the states (X i ) n 0 i=1 from the MCMC output are used in the construction of Γ, the consistency results for Stein Thinning that we establish in Section 3 are not affected. To explore different strategies for the selection of Γ, we focus on the following candidates: • Median (med): The scaled identity matrix Γ = 2 I, where = med := median{ X i − X j : 1 ≤ i < j ≤ n 0 } is the median Euclidean distance between states (Garreau et al., 2018). In the rare case that med = 0, an exception should be used, such as = 1, to ensure a positive definite Γ is used.
• Scaled median (sclmed): The scaled identity matrix Γ = 2 I, where = med/ log(m). This was proposed in Liu and Wang (2016) and can be motivated using the approximation m j =1 k P (x π(j) , x π(j ) ) ≈ m exp(− −2 med 2 ) = 1. Note the dependence on m means that the preceding theoretical analysis does not apply when this heuristic is used.
• Sample covariance (smpcov): The matrix Γ can be taken as a sample covariance matrix provided that this matrix is non-singular.
The experiments in Section 4 shed light on which of these settings is the most effective, but we acknowledge that many other settings could also be considered. In what follows, we set n 0 = min(n, 10 3 ) for the med and sclmed settings, to avoid an O(n 2 ) cost of computing , and otherwise set n 0 = n, so that the whole of the MCMC output is used to select Γ. Python, R and MATLAB packages are provided and their usage is described in Appendix S1.

Theoretical Assessment
The theoretical analysis in this section clarifies the limiting behaviour of Stein Thinning as m, n → ∞. Our first main result concerns the behaviour of Stein Thinning on a fixed sequence (x i ) n i=1 : Theorem 1. Let X be a measurable space and let P be a probability distribution on X . Let k P : X × X → R be a reproducing kernel with X k P (x, ·)dP = 0 for all x ∈ X . Let (x i ) n i=1 ⊂ X be fixed and consider an index sequence π of length m produced by Algorithm 1. Then we have the bound where the weights w * = (w * 1 , . . . , w * n ) in the first term satisfy w * ∈ arg min 1 n w=1 w≥0 where 1 n = (1, . . . , 1) and w ≥ 0 indicates that w i ≥ 0 for i = 1, . . . , n.
The proof of Theorem 1 is provided in Appendix S2.1. Its implication is that, given a sequence (x i ) n i=1 , Stein Thinning produces an empirical distribution that converges in KSD to the optimal weighted empirical distribution n i=1 w * i δ(x i ) based on that sequence. Properties of such optimally weighted empirical measures were studied in Liu and Lee (2017); Hodgkinson et al. (2020), and are not the focus of the present paper, where the case m n is of principal interest.
The role of Theorem 1 is to study the interaction between the greedy algorithm and a given sequence (x i ) n i=1 , and this bound is central to our proof of Theorem 2 which deals with the case where (x i ) n i=1 is replaced by MCMC output. Figure 2 illustrates the terms involved in Theorem 1. It is clear that a reduction in KSD is achieved by Stein Thinning of the MCMC output.
Remark 5 (Optimal weights). To further improve the empirical approximation, we can consider an optimally-weighted sum Figure 2: Illustration of Theorem 1: The gray curve represents the unprocessed output from MCMC in the example of Figure 1. The black curve represents Stein Thinning applied to this same output and, in addition, weighted output of Stein Thinning is shown for weights w * m subject subject to w * m,j = 1 and w * m,j ≥ 0 (solid red) and weights v * m subject only to v * m,j = 1 (solid blue). The dashed horizontal lines are the limiting values of their corresponding solid lines as the number m is increased.
to a linear and a non-negativity constraint and can therefore be precisely computed. If the non-negativity constraint is removed and the indices in π are distinct then v * m := arg min as derived in Oates et al. (2017). Figure 2 indicates that the benefit of applying weights w * m (red curve) to the output of Stein Thinning (black curve) is limited, likely because the x π(j) were selected in a way that avoids redundancy in the point set. A larger improvement is provided by the weights v * m (blue curve), but in this case the associated empirical measure may not be a probability distribution.
Remark 6. The use of a conditional gradient algorithm, instead of a greedy algorithm, in this context amounts to simply removing the term k P (x π(j) , x π(j) ) in Algorithm 1. As discussed in Chen et al. (2018), this term can be thought of as a regulariser that lends stability to the algorithm, avoiding selection of x i that are far from the effective support of P .
Remark 7. Theorem 1 is formulated at a high level of generality and can be applied on non-Euclidean domains X . In Barp et al. (2021); Liu and Zhu (2018); Xu and Matsuda (2020); Le et al. (2020) the authors proposed and discussed Stein operators A P for the non-Euclidean context.
Next we consider the properties of Stein Thinning applied to MCMC output. Let V be a function V : X → [1, ∞) and, for a function f : X → R and a measure µ on X , let Recall that a ψ-irreducible and aperiodic Markov chain (X i ) i∈N ⊂ X with n th step transition kernel P n is V -uniformly ergodic (see Theorem 16.0.1 of Meyn and Tweedie, 2012) if and only if ∃R ∈ [0, ∞), ρ ∈ (0, 1) such that for all initial states x ∈ X and all n ∈ N. The notation E will be used to denote expectation with respect to the law of the Markov chain in the sequel. Theorem 2 establishes a finite sample size error bound for Stein Thinning applied to MCMC output: Theorem 2. Let X be a measurable space and let P be a probability distribution on X . Let k P : X × X → R be a reproducing kernel with X k P (x, ·)dP = 0 for all x ∈ X . Consider a P -invariant, time-homogeneous Markov chain (X i ) i∈N ⊂ X generated using a V -uniformly ergodic transition kernel, such that (9) is satisfied with Let π be an index sequence of length m produced by Algorithm 1 applied to the Markov chain output The proof of Theorem 2 is provided in Appendix S2.2.
Remark 8. The upper bound in (10) is asymptotically minimised when (up to log factors) m is proportional to n. In practice we are interested in the case m n, so we may for example set m = n 1000 if we aim for substantial compression. It is not claimed that the bound in (10) is tight and indeed empirical results in Section 4 endorse the use of Stein Thinning in the small m context.
Remark 9. For P ∈ P and k P in (6), based on a radial kernel k, meaning that k(x, y) = φ(x − y) for some function φ : The function x → k P (x, x) appearing in the preconditions of Theorem 2 can therefore be understood in terms of ∇ log p(x) . Further discussion of the preconditions of Theorem 2 is provided in Appendix S2.4.
Since convergence in mean-square does not in general imply almost sure convergence, we next strengthen the conclusions of Theorem 2. Our final result, Theorem 3, therefore establishes an almost sure convergence guarantee for Stein Thinning. Furthermore, the result that follows applies also in the "biased sampler" case, where (X i ) i∈N is a Q-invariant Markov chain and Q need not equal P : Theorem 3. Let Q be a probability distribution on X with P absolutely continuous with respect to Q. Consider a Q-invariant, time-homogeneous Markov chain (X i ) i∈N ⊂ X generated using a V -uniformly ergodic transition kernel, such that Let π be an index sequence of length m produced by Algorithm 1 applied to the Markov chain output (X i ) n i=1 . If m ≤ n and the growth of n is limited to at most log(n) = O(m β/2 ) for some β < 1, then D k P 1 m m j=1 δ(X π(j) ) → 0 almost surely as m, n → ∞. Furthermore, if the preconditions of Proposition 3 are satisfied, then 1 m m j=1 δ(X π(j) ) ⇒ P almost surely as m, n → ∞.
The proof of Theorem 3 is provided in Appendix S2.3. The interpretation of Theorem 3 is that one may sample states from a Markov chain that is not P -invariant and yet, under the stated assumptions (which ensure that regions of high probability under P are explored), one can use Stein Thinning to still obtain a consistent approximation of P . This can be contrasted, for example, with the Support Points method of Mak and Joseph (2018), which relies on P being well-approximated by the MCMC output. This completes our theoretical analysis of Stein Thinning.

Empirical Assessment
In this section we compare the performance of Stein Thinning with existing methods for post-processing MCMC output. Our motivation derives from a problem in which we must infer a 38-dimensional parameter in a calcium signalling model defined by a stiff system of 6 coupled ordinary differential equations (ODEs). Posterior uncertainty is required to be propagated through a high-fidelity simulation in a multi-scale and multi-physics model f of the human heart. Here, compression of the MCMC output can be used to construct an approximately optimal experimental design on which f can be evaluated. The calcium model is, however, unsuitable for conducting a thorough in silico assessment due to its associated computational cost. Therefore in Section 4.1 we first consider a simpler ODE model, where P can be accurately approximated. Then, as an intermediate example, in Section 4.2 we consider an ODE model that induces stronger correlations among the parameters in P , before addressing the calcium model in Section 4.3.
In Appendix S3 we describe the generic structure of a parameter inference problem for ODEs. In all instances the aim is to post-process the output from MCMC, in order to produce an accurate empirical approximation of the posterior supported on a small number m n of the states that were visited. The following methods were compared: • The standard approach, which estimates a burn-in period using either the GR diagnosticb GR,L , L > 1, of Gelman and Rubin (1992); Brooks and Gelman (1998) ;Gelman et al. (2014) or the more sophisticated VK diagnosticb VK,L , L ≥ 1, of Vats and Knudson (2018), in each case based on L independent chains as described in Appendix S4, followed by thinning as per (3).
• The Support Points algorithm proposed in Mak and Joseph (2018), implemented in the R package support.
• The Stein Thinning algorithm that we have proposed, with each of the kernel choices described in Section 2.3.
To ensure that our empirical findings are not sensitive to the choice of MCMC method, we implemented four Metropolis-Hastings samplers that differ qualitatively according to the sophistication of their proposal. These were: (i) the Gaussian random walk (RW); (ii) the adaptive Gaussian random walk (ADA-RW), which uses an estimate of the covariance of the target (Haario et al., 1999); (iii) the Metropolis-adjusted Langevin algorithm (MALA), which takes a step in the direction of increasing Euclidean gradient, perturbed by Gaussian noise (Roberts and Tweedie, 1996); (iv) the preconditioned version of MALA (P-MALA), which employs a preconditioner based on the Fisher information matrix (Girolami and Calderhead, 2011). Full details are in Appendix S3. Metropolis-Hastings algorithms were selected on the basis that we were able to successfully implement them on the challenging calcium signalling model in Section 4.3, which required manually interfacing with the numerical integrator to produce reliable output.

Goodwin Oscillator
The first example that we consider is a negative feedback oscillator due to Goodwin (1965). The ODE model and the associated d = 4 dimensional inference problem are described in Appendix S5.2, where one trace plot for each MCMC method, of length n = 2 × 10 6 , are presented in Figure S3.
First we consider the standard approach to post-processing MCMC output, as per (3). From the trace plots in Figure S3, it is clear that a burn-in period b > 0 is required. For each method we therefore computed the GR and VK diagnostics, to arrive at candidate values b for the burn-in period. Default settings were used for all diagnostics, which were computed both for the multivariate d-dimensional state vector and for the univariate marginals, as reported in Appendix S5.2. The GR diagnostics were computed using L = 6 independent chains and the VK diagnostics were computed using both L = 1 and L = 6 independent chains; note that when L > 1, these diagnostics have access to more information in comparison with Stein Thinning, in terms of the number of samples that are available to the method. The estimated values for the burn-in period are reported in Appendix S5.2, Table S4. For all MCMC methods, neither the univariate nor the multivariate GR diagnostics were satisfied, so thatb GR,6 > n and estimation using (3) cannot proceed. The VK diagnostic produced valuesb VK,L < n, which typically led to about half of the MCMC output being discarded. Although well-suited for their intended task of minimising bias in MCMC output, the smaller number of states left after burn-in removal may lead to inefficient approximation of P and derived quantities of interest, strikingly so in the case of the GR diagnostic. The use of an optimality criterion enables Stein Thinning to directly address this bias-variance trade-off. Of course, one can in principle run more iterations of MCMC to provide more diversity in the remainder of the sample path after burn-in is removed, but in applications such as the calcium model of Section 4.3 the computational cost associated with each iteration presents a practical limitation in running more iterations of an MCMC method. Effective methods to post-process limited output (or, equivalently, a long output from a poorly mixing Markov chain) are therefore important.
Having identified a burn-in period, the standard approach thins the remainder of the sample path according to (3). In the experiments that follow we focus on the VK diagnostic and consider both the smallest and largest estimates obtained for the burn-in period. The resulting index sets π are displayed, for m = 20 and RW (the simplest MCMC method) in Figure Mak and Joseph (2018). The remaining panels display the output from Stein Thinning. Compared to the standard approach, Support Points and Stein Thinning produce sets that are more structured.
To assess the performance of these competing methods, we first considered the toy problem of approximating the posterior mean of each parameter in the Goodwin oscillator as an average of m points selected from the MCMC output. Figure 4 displays absolute errors for each method, based on RW; our ground truth was provided by an extended run of MCMC. Results for the other MCMC methods are provided in Appendix S5.2, Figures S9 (ADA-RW), S10 (MALA), S11 (P-MALA). Broadly speaking, Stein Thinning tends to provide more accurate estimators compared to the alternatives considered. From Figure 4 it is difficult to see any difference in performance between med, sclmed and smpcov. To gain more insight, in Appendix S5.2 we plot marginal density estimates in Figures S12 (RW), S13 (ADA-RW), S14 (MALA), S15 (P-MALA). It is apparent that Stein Thinning improves on the standard approach, whilst med and sclmed performed slightly better than smpcov. This may be because in smpcov there are more degrees of freedom in Γ that must be estimated. Support Points performed on a par with Stein Thinning based on smpcov.
To facilitate a more principled assessment, we computed two quantitative measures for how well the resulting empirical distributions approximate the posterior. These were (a) the energy distance (ED; Székely and Rizzo, 2004;Baringhaus and Franz, 2004), given up to an additive constant by where in this paper we used the norm x Σ := Σ −1/2 x induced by the covariance matrix of P , with both Σ and (11) being estimated from MCMC output, and (b) the KSD based on med, the simplest setting for Γ. ED serves as an objective performance measure, being closely related to the quantity that Support Points attempts to minimise (Mak and Joseph (2018) used the · norm in place of · Σ ), while KSD is the performance measure that is Figure 3: Projections on the first two coordinates of the RW MCMC output from the Goodwin oscillator (gray dots), together with the first m = 20 points selected using: the standard approach of discarding burn-in and thinning the remainder (the estimated burn in period is indicated in the legend); the Support Points method; Stein Thinning, for each of the settings med, sclmed, smpcov.
being directly optimised in Stein Thinning. Our decision to include KSD in the assessment is motivated by three factors; (i) ED is somewhat insensitive to detail, making it difficult to rank competing methods; (ii) the empirical approximation of ED in (11) relies on access to high-quality MCMC output, but this will not be available in Section 4.3; (iii) Stein discrepancies are the only computable performance measures in the Bayesian context, to the best of our knowledge, that have been proven to provide convergence control.  The results for ED are shown in Figure 5. Here Stein Thinning based on sclmed performed at least as well as the other methods considered and, surprisingly, out-performed Support Points when applied to MALA and P-MALA output. This may be because MALA and P-MALA provided worse approximations to P compared with RW and ADA-RW (recall that Support Points relies on the MCMC output providing an accurate approximation of P ). Note that neither ED nor KSD values will tend to 0 as m → ∞ in this experiment, since the number n of MCMC iterations was fixed. The corresponding results for KSD are presented in Figure 6 and show a clearer performance ordering of the competing methods, with Stein Thinning based on med and sclmed out-performing all other methods for all but the largest values of m considered. The smpcov setting performed well for small m but for large m its performance degraded. The performance ordering under KSD was identical across the different MCMC output.

Lotka-Volterra
The second example that we consider is the predator-prey model of Lotka (1926) and Volterra (1926). A description of the d = 4 dimensional inference task, the output from MCMC methods and the implementation of thinning procedures is reserved for Appendix S5.3. Compared to the Goodwin oscillator, the Lotka-Volterra posterior P exhibits stronger correlation among parameters. This has consequences for our assessment, since now all MCMC methods,  (11), for empirical distributions obtained through traditional burn-in and thinning (grey lines), Support Points (black line) and Stein Thinning (colored lines), based on output from four different MCMC methods. and in particular MALA, mixed less well compared to corresponding results for the Goodwin oscillator, as can be seen from the trace plots in Appendix S5.3, Figure S17. Results are reported for ED in Figure 7. It can be seen that Stein Thinning based on med and sclmed performed comparably with Support Points, being better for small m in the case of RW and ADA-RW and marginally worse for large m in RW, ADA-RW and P-MALA. Interestingly, the setting smpcov was associated with poor performance on output from RW, ADA-RW and especially P-MALA. This may be because, when Γ is poorly conditioned, any error in an estimate for Γ will be amplified when computing Γ −1 . However, in the case of MALA, which mixed poorly, the standard approach of burn-in removal and thinning performed poorly and all settings of Stein Thinning provided an improvement.
Results for KSD are reported in Figure 8. The performance ordering of competing methods under KSD is similar to that reported in Section 4.1, except for the smpcov setting which appears to improve the performance of Stein Thinning for larger values of m in the context of MALA. This may be because smpcov serves to "whiten" the correlation structure in P , such that the resulting geometry is more favourable for the construction of an empirical approximation. However, this improved performance was not seen on P-MALA. In all cases Stein Thinning out-performed Support Points.

Calcium Signalling Model
Our final example a model for calcium signalling in cardiac cells, illustrated in Appendix S5.4, Figure S24. The model describes an electrically-activated intracellular calcium signal that in turn activates the sub-cellular sarcomere, causing the muscle cell to contract and the heart to beat. The intracellular calcium signal is crucial for healthy cardiac function. However, under pathological conditions, dysregulation of this intra-cellular signal can play a central role in the initiation and sustenance of life-threatening arrhythmias. Computational models are increasingly being applied to study this highly-orchestrated multi-scale signalling cascade to determine how changes in cell-scale calcium regulation, encoded in calcium model parameters, impact whole-organ cardiac function (Campos et al., 2015;Niederer et al., 2019;Colman, 2019). The computational cost of simulating from tissue-scale and organ-scale models is high, with single simulations taking thousands of CPU hours (Niederer et al., 2011;Augustin et al., 2016;Strocchi et al., 2020). This limits the capacity to propagate uncertainty in calcium signalling model parameters up to organ-scale simulations, so that at present it remains unclear how uncertainty in calcium signalling parameters impacts the predictions made by a whole-organ model. Our motivation for developing Stein Thinning was to obtain a compressed representation of the posterior distribution for the d = 38 dimensional parameter of a calcium signalling model, based on a cell-scale experimental dataset, which can subsequently be used as an experimental design to propagate uncertainty through a This motivating problem entails a second complication in that, compared to the example in Section 4.1 and even the example in Section 4.2, the development of an efficient MCMC method appears to be difficult. Thus, in the experiment that follows, we cannot rely on any of the MCMC methods that we described at the start of Section 4 to provide anything more than a crude approximation of the posterior, at best. This is evidenced by the nonoverlapping approximations to the posterior marginals produced when different random seeds are used; see Figures S26 to S29. (Of course, it is possible that a more sophisticated sampling method may be designed for this task, but our aim here is not to develop a new sampling method.) Tempering of the likelihood provides a straightforward route to improve the mixing of MCMC, but the invariant distribution Q will then no longer equal P . Here we explore the potential for Stein Thinning to perform bias-correction for such Q-invariant MCMC output, in the spirit of Theorem 3.
Our focus in the remainder is on output from the RW MCMC method. This MCMC method was selected since (a) gradient-free methods can be easier to tune when the posterior is concentrated (Livingstone and Zanella, 2020), and (b) once the sample path has been computed, the associated gradients can be computed in parallel. Both standard and tempered MCMC were performed; in the latter case the likelihood was tempered so that the (biased) target Q was just about tractable for MCMC (see Appendix S5.4). In each case a total of n = 4 × 10 6 iterations of MCMC were preformed, representing two weeks' CPU time. Figure 9 reports the KSD based on med, for index sets of cardinality up to m = 500; see Appendix S5.4 for results for KSD based on sclmed ( Figure S30) and smpcov ( Figure S31). Considering first the tempered MCMC output, the lower values of KSD achieved by Stein Thinning are consistent with fact that Stein Thinning corrects for bias due to tempering, while Support Points does not. Furthermore, Stein Thinning of tempered MCMC results in lower values of KSD compared to Support Points applied to standard MCMC output, with the latter being negatively affected by the non-convergence of the MCMC. Inspection of the univariate marginals demonstrates that the combination of tempering and Stein Thinning produces approximations that are robust to changes in the random seed, while the approximations produced by standard MCMC with an equivalent computational budget are not; see Figures S26 to S29.

Conclusion
In this paper, standard approaches used to post-process and compress output from MCMC were identified as being sub-optimal when one considers the approximation quality of the empirical distribution that is produced. A novel method, Stein Thinning, was proposed that seeks a subset of the MCMC output, of fixed cardinality, such that the associated empirical approximation is close to optimal. The theoretical analysis that we have provided for Stein Thinning handles the effect of the post-processing procedure jointly with the randomness involved in simulating from the Markov chain, such that consistency of the overall estimator is established.
Although we focused on MCMC, the proposed method can be applied to any computational method that provides a collection of states as output. These include approximate (biased) MCMC methods, where Stein Thinning may be able to provide bias correction in the spirit of Theorem 3. On the other hand, the main limitation of Stein Thinning is that it requires gradients of the log-target to be computed, which is not always practical.
Our research was motivated by challenging parameter inference problems that arise in ODEs, in particular in cardiac modelling where one is interested in propagating calcium signalling parameter uncertainty through a whole-organ simulation -a task that would naïvely be impractical or impossible using the full MCMC output. Our ongoing research is exploiting Stein Thinning in this context and is enabling us to perform scientific investigations that were not feasible beforehand. Furthermore, in a sequel we demonstrate that approximate implementations of Stein Thinning can massively reduced its implementation cost (Teymur et al., 2021).

Supplementary Material
This electronic supplement contains details for our Python, R and MATLAB implementations of Stein Thinning, proofs for all novel theoretical results reported in Section 3 of the main text, as well as additional material relating to the experimental assessment in Section 4 of the main text. It is structured as follows:

S1 Software
To assist with applications of Stein Thinning we have provided code in Python, R and MATLAB. The code is available at: http://stein-thinning.org/ In this section we demonstrate Stein Thinning in Python, but the syntax for Stein Thinning in R and in MATLAB is almost identical. As an illustration of how Stein Thinning can be used to post-process output from Stan, consider the following simple Stan script that produces 1000 correlated samples from a bivariate Gaussian model: The bivariate Gaussian model is used for illustration, but regardless of the complexity of the model being sampled the output of Stan will be a fit object. The sampled points x i and the gradients ∇ log p(x i ) can be extracted from the returned fit object: import numpy as np smpl = fit['x'] grad = np.apply_along_axis(fit.grad_log_prob, 1, smpl) One can then perform Stein Thinning to obtain a subset of m = 40 states by running the following code: from stein_thinning.thinning import thin idx = thin (smpl, grad, 40) The thin function returns a NumPy array containing the row indices in smpl (and grad) of the selected points. The default usage requires no additional user input and is based on the sclmed setting from Section 2.3, informed by the empirical analysis of Section 4. Alternatively, the user can choose to specify which setting to use for computing the preconditioning matrix Γ by setting the option string pre to either 'med', 'sclmed', or 'smpcov'. For example, the default setting corresponds to idx = thin (smpl, grad, 40, pre='sclmed') The ease with which Stein Thinning can be used makes it possible to consider a wide variety of applications, including the ODE models that we considered in Section 4 of the main text.

S2 Proofs
This appendix contains detailed proofs for all novel theoretical results in the main text.

S2.1 Proof of Theorem 1
First we state and prove two elementary results that will be useful: Proof. Since all quantities are non-negative, we may square both sides to get an equivalent inequality 4a 2 (a 2 + b) ≤ (2a 2 + b) 2 . Expanding the brackets and cancelling terms leads to 0 ≤ b 2 , which is guaranteed to hold.
Lemma 2. For all m ∈ N it holds that m j=1 1 j ≤ 1 + log(m).
Proof. Since x → 1 x is convex on x ∈ (0, ∞), we have that the Riemann sum m j=2 1 j is a lower bound for the Riemann integral δ(x π(j) )) 2 , f m := m j=1 k P (x π(j) , ·) and also let S 2 := max i=1,...,n k P (x i , x i ), so that Recall that H(k P ) denotes the reproducing kernel Hilbert space of the kernel k P and pick an element h * ∈ H(k P ) of the form h * := , which is the minimal KSD attainable under the constraint (8). Now, let M denote the convex hull of {k P (x i , ·)} n i=1 , so that h * ∈ M ⊂ H(k P ) and therefore Noting that a m = f m 2 H(k P ) , we have from (12) and Cauchy-Schwarz that and therefore Letting we will establish by induction that This will in turn prove the result, since where the upper bound on C m follows from the fact that h * H(k 0 ) ≤ S, combined with Lemma 2.
The remainder of the proof is dedicated to establishing the induction in (15). The base case m = 1 is satisfied since a 1 = D k P (δ(x π(1) )) = k P (x π(1) , x π(1) ) ≤ S 2 and C 1 = S 2 − h * 2 H(k P ) , so that a 1 ≤ h * 2 H(k P ) + C 1 . For the inductive step, we assume that (15) holds when m is replaced by m − 1 and aim to derive (15). From (13) and the inductive assumption, we have that where The induction (15) will therefore follow from (16) From Lemma 1 it must hold that Algebraic simplification of (17) reveals that (17) is equivalent to and, using (14), we verify that (18) is satisfied as an equality. This completes the inductive argument.
Remark 10. Our results can also be applied to maximum mean discrepancies, where D F in (4) is given by F = {f ∈ H(k) : f H(k) ≤ 1} and k : X × X → R denotes a reproducing kernel (Song, 2008). Indeed, we can set in order for X k P (x, ·)dP = 0 to be satisfied for all x ∈ X , and observe that this construction ensures D F and D k P are identical. Teymur et al. (2021) explores these consequences of our results in detail.

S2.2 Proof of Theorem 2
First we state and prove a technical lemma that will be useful: Lemma 3. Let X be a measurable space and let Q be a probability distribution on X . Let k Q : X × X → R be a reproducing kernel with X k Q (x, ·)dQ = 0 for all x ∈ X . Consider a Q-invariant, time-homogeneous Markov chain (X i ) i∈N ⊂ X generated using a V -uniformly ergodic transition kernel, such that V (x) ≥ k Q (x, x) for all x ∈ X , with parameters R ∈ [0, ∞) and ρ ∈ (0, 1) as in (9). Then with C = 2Rρ 1−ρ we have that n i=1 r∈{1,...,n}\{i} Proof. First recall that given random variables X, Y taking values in X , the conditional mean embedding of the distribution P[X|Y = y] is the function E[k Q (X, ·)|Y = y] ∈ H(k Q ) (Song et al., 2009). By the reproducing property we have E[k Q (X, y)|Y In what follows it is convenient to introduce a new random variable Z, independent from the Markov chain, such that Z ∼ Q. Then, since E[k Q (Z, ·)] = 0, we have E[h(Z)] = 0 for any h ∈ H(k Q ). Hence we have that Let Q n denote the n th step transition kernel of the Markov chain. From V -uniform ergodicity it follows that Applying this to Y = X i , X = X i+r , we find and taking the expectation on both sides yields Finally, we can use (19) to obtain that n i=1 r∈{1,...,n}\{i} as claimed.
We can now prove the main result: Proof of Theorem 2. Taking expectations of the bound in Theorem 1, we have that .
Bounding ( * ): To bound the term ( * ), note that due to the optimality property of the weights w * presented in (8). It is therefore sufficient to study the KSD of the un-weighted empirical distribution 1 n n i=1 δ(X i ). To this end, we have that To bound the first term in (20) we use Jensen's inequality: The second term in (20) can be bounded via Lemma 3 with Q = P : where C is defined in Lemma 3.
Bounding ( * * ): We proceed as follows: Overall Bound: Combining our bounds on ( * ) and ( * * ) leads to the overall bound

S2.3 Proof of Theorem 3
To facilitate a neat proof of Theorem 3 we first present two useful lemmas, the first of which establishes almost sure convergence in KSD of the empirical distribution based on the full MCMC output: Lemma 4. Let Q be a probability distribution on X . Let k Q : X × X → R be a reproducing kernel with X k Q (x, ·)dQ = 0 for all x ∈ X . Consider a Q-invariant, time-homogeneous Markov chain (X i ) i∈N ⊂ X , generated using a V -uniformly ergodic transition kernel such that V (x) ≥ k Q (x, x) for all x ∈ X . Suppose that, for some γ > 0, almost surely as n → ∞.
Proof. Similarly to the proof of Theorem 2, we start by bounding To bound the first term in (22) we use Jensen's inequality: The second term in (22) can be bounded via Lemma 3: where C is defined in Lemma 3. This establishes that To simplify notation we adopt the shorthand in this proof only. Fix > 0. If D n > occurs infinitely often (i.o.) then there are infinitely many r such that max r 2 ≤n<(r+1) 2 D 2 n > , so that Now, consider the bound where the inequality follows from the fact that, for any a, b ∈ R, if a > then either b > 2 or |a − b| > 2 . In the remainder we will show that the sums ( * ) and ( * * ) are finite, so that from the Borel-Cantelli lemma Since (25) holds for all > 0, it will follow from (24) that P[D n → 0] = 1, as claimed.
Our second lemma is a technical result on almost sure convergence: Lemma 5. Let f be a non-negative function on X . Consider a sequence of random variables (X i ) i∈N ⊂ X such that, for some γ > 0, If m ≤ n and the growth of n is limited to at most log(n) = O(m β/2 ) for some β < 1, then almost surely as m, n → ∞.
Proof. To simplify notation we adopt the shorthand in this proof only, where n = n(m). The argument is similar to the proof of Lemma 4. Fix > 0. If E m > i.o. then there are infinitely many r such that max r 2 ≤m<(r+1) 2 E m > , so that Now, consider the bound In the remainder we will show that the sums ( * ) and ( * * ) are finite, so that from the Borel-Cantelli lemma Since (28) holds for all > 0, it will follow from (27) that P[E m → 0] = 1, as claimed.

Now we present the proof of Theorem 3:
Proof of Theorem 3. Our starting point is again the bound in Theorem 1: .
For term ( * ), note that due to the optimality property of the weights w * presented in (8). Further note that where k Q (x, y) := dP dQ (x)k P (x, y) dP dQ (y) is a reproducing kernel such that X k Q (x, ·)dQ = 0 for all x ∈ X . The preconditions of Theorem 3 ensure that V (x) ≥ k Q (x, x) and Therefore we may apply Lemma 4 to obtain that almost surely as n → ∞. Moreover, since X dP dQ dQ < ∞, it follows from Meyn and Tweedie (2012, Theorem 17.0.1, part (i) almost surely as n → ∞. Standard properties of almost sure convergence thus imply that ( * ) → 0 almost surely as n → ∞.
For term ( * * ), we notice that and we can therefore use Lemma 5 with f (x) = k P (x, x) to deduce that ( * * ) → 0 almost surely as m, n → ∞.
Thus we have established that almost surely as m, n → ∞. The final part of the statement of Theorem 3 is immediate from Proposition 3.

S2.4 Satisfying the Conditions of Theorem 2
The conditions for Theorem 2 are agnostic to the specific Markov chain used (e.g. Metropolis-Hastings, Gibbs sampling, etc), making it quite general. In this appendix we discuss how explicit sufficient conditions can be obtained if one restricts attention to a specific MCMC method. Here we focus on the Metropolis-adjusted Langevin algorithm (MALA; whose definition is recalled in Appendix S3). Let q(x, y) denote the probability density for the proposal x → y in MALA, with step size > 0 fixed. Let A(x) ⊂ R d be the set of values y which, if a move x → y is proposed, then y is always accepted. Let I(x) := {y ∈ R d : y ≤ x } and let A∆B : see Section 4 of Roberts and Tweedie (1996). The following result will then be established: Lemma 6. Let P ∈ P be distantly dissipitive on X = R d , let E X∼P [exp(β X 2 )] < ∞ for some β ∈ (0, ∞), and assume that MALA is inwardly convergent. Then, with kernel k(x, y) = (1 + x − y 2 ) −1/2 , the conditions of Theorem 2 are satisfied.
Proof. This proof exploits Theorem 9 of Chen et al. (2019), which establishes V -uniform ergodicity of MALA for each of V (x) = exp(γ x ) (any γ > 0), V (x) = exp(γ x 2 ) (for γ > 0 sufficiently small) and V (x) = 1 + x γ (γ ∈ {1, 2}). Each choice of V leads to a different set of preconditions for Theorem 2, and the claimed result follows from taking V (x) = 1 + x . Note that w.l.o.g. we can consider V (x) = C(1 + x ) for any fixed C ∈ [1, ∞), since the constant C cancels in the definition of V -uniform ergodicity. The conditions for Theorem 2 now simplify as follows: First Condition: The form of k implies that k P (x, x) = d + ∇ log p(x) 2 (see Remark 9). Since ∇ log p is assumed to be Lipschitz we have, for C sufficiently large, so that the first condition of Theorem 2 is automatically satisfied.
Third Condition: A similar argument used for the second condition can again be used, this time withṼ (x) = 1 + x s for s ∈ {1, 2}. Specifically, we have that (31)) with the latter being implied by the stronger moment condition in (32).
The sufficient conditions presented in Lemma 6 may be explicitly verified, with the possible exception of the inwards convergence condition of Roberts and Rosenthal (2004).

S3 Experimental Protocol
In this appendix we describe the generic structure of a parameter inference problem for a system of ODEs, that forms our empirical test-bed.
Consider the solution u of a system of q coupled ODEs of the form . . .
together with the initial condition u(0) = u 0 ∈ R q . The functions F i that define the gradient field are assumed to depend on a number d of parameters, collectively denoted x ∈ R d , and the F i are assumed to be differentiable with respect to u 1 , . . . , u q and x. It is assumed that u(t) exists and is unique on an interval t ∈ [0, T ] for all values x ∈ R d . For simplicity in this work we assumed that the initial condition u 0 is not dependent on x and is known. The goal is to make inferences about the parameters x based on noisy observations of the state vector u(t i ) at discrete times t i ; this information is assumed to be contained in a likelihood of the form where the functions φ i : R q → [0, ∞), describing the nature of the measurement at time t i , are problem-specific and to be specified. The parameter x is endowed with a prior density π(x) and the posterior of interest P admits a density p(x) ∝ π(x)L(x). Computation of the gradient ∇ log p therefore requires computation of ∇ log π and ∇ log L; the latter can be performed by augmenting the system in (33) with the sensitivity equations, as described next.
Straight-forward application of the chain rule leads to the following expression for the gradient of the log-likelihood: where (∂u/∂x) r,s := ∂u r /∂x s is the matrix of sensitivities of the solution u to the parameter x and is time-dependent. Sensitivities can be computed by augmenting the system in (33) and simultaneously solving the forward sensitivity equations d dt together with the initial condition (∂u r /∂x s )(0) = 0, which follows from the independence of u 0 and x. The experiments reported in Section 4 were based on four distinct Metropolis-Hastings MCMC methods, whose details have not yet been described. The generic structure of the proposal mechanism is x * = x n−1 + H∇ log p(x n−1 ) + Gξ n , where the ξ n ∼ N (0, I) are independent. The matrices H and G are specified in Table S1. Our implementation of these samplers interfaces with the CVODES library (Hindmarsh et al., 2005), which presents a practical barrier to reproducibility. Moreover, the CPU time required to obtain MCMC samples was approximately two weeks for the calcium model. Since our research focused on post-processing of MCMC output, rather than MCMC itself, we directly make available the full output from each sampler on each model considered at https://doi.org/10.7910/DVN/MDKNWM. This Harvard database download link consists of a single ZIP archive (1.5GB) that contains, for each ODE model and each MCMC method, the states (x i ) n i=1 visited by the Markov chain, their corresponding gradients ∇ log p(x i ) and the values p(x i ) up to an unknown normalisation constant. The Stein Thinning software described in S1 can be used to postprocess these datasets at minimal effort, enabling our findings to be reproduced.

S4 Convergence Diagnostics for MCMC
Rigorous approaches for selecting a burn-in period b have been proposed by authors including Meyn and Tweedie (1994); Rosenthal (1995); Roberts and Tweedie (1999); see also Jones and Hobert (2001). Unfortunately, these often involve conditions that are difficult to establish (Biswas et al., 2019, discuss how some of the terms appearing in these conditions can be estimated), or, when they hold, they provide loose bounds, implying an unreasonably long burn-in period.

H G Details
RW 0 I Step size selected following Roberts and Rosenthal (2001) ADA-RW (Haario et al., 1999) 0 ΣΣ is the sample covariance matrix of preliminary MCMC output MALA (Roberts and Tweedie, 1996) 2 2 I I Step size selected following Roberts and Rosenthal (2001) P-MALA (Girolami and Calderhead, 2011) is the Fisher information matrix at x and Σ 0 is the prior covariance matrix. Table S1: Parameters H and G used in the Metropolis-Hastings proposal.
Convergence diagnostics have emerged as a practical solution to the need to test for nonconvergence of MCMC. Their use is limited to reducing bias in MCMC output; they are not optimised for the fixed n setting, which requires a bias-variance trade-off. Nevertheless, convergence diagnostics constitute the principal means by which MCMC output is postprocessed. In this section we recall standard practice for selection of a burn-in period b in constructing an estimator of the form (3), focussing on the widely-used diagnostics of Gelman and Rubin (1992); Brooks and Gelman (1998); Gelman et al. (2014) (the GR diagnostic), as well as the more recent work of Vats and Knudson (2018) The GR diagnostic is based on running L independent chains, each of length n, with starting points that are over-dispersed with respect to the target. Obtaining initial points with such characterisation is not trivial because the target is not known beforehand; we refer to the original literature for advice on how to select these initial points, but, in practice, it is not uncommon to guess them. When the support of the target distribution is uni-dimensional (or when d > 1, but a specific uni-dimensional summary f (x) is used), the GR diagnostics (R GR,L ) is obtained as the square root of the ratio of two estimators of the variance σ 2 of the target. In particular,R GR,L := σ 2 s 2 , where s 2 is the (arithmetic) mean of the sample variances s 2 l , l = 1, . . . , L, of the chains, which typically provides an underestimate of σ 2 , andσ 2 is constructed as an overestimate of the target varianceσ where the term B/n is an estimate of the asymptotic variance of the sample mean of the Markov chain. In the original GR diagnostics, this asymptotic variance was estimated as the sample variance of the meansX l , l = 1, . . . , L, from the L chains, leading to The improved VK diagnostic,R VK,L , is formally obtained in the same way as (36), but with more efficient estimators τ 2 /n for the asymptotic variance used in place of B/n. A number of options are available here, but the (lugsail) batch mean estimator of Vats and Flegal (2018) is recommended because it is guaranteed to be biased from above, while still being consistent (in our simulations we use batches of size 3 √ n). This gain in efficiency leads to improved performance of the VK diagnostic over the GR diagnostic, in the sense that it is less sensitive to the randomness in the Markov chains and the number of chains used. In particular,R VK,L can be computed using one chain only (L = 1), which has clear practical appeal.
For an ergodic Markov chain,R GR,L andR VK,L converge to 1 as n → ∞, so that selection of a suitable burn-in period b amounts to observing when these diagnostics are below 1 + δ, where δ is a suitable threshold. In the literature onR GR,L , the somewhat arbitrary choice δ = 0.1 is commonly used, see Gelman et al. (2014, Ch. 11.5) and the survey in Vats and Knudson (2018). In the literature onR KV,L , Vats and Knudson (2018) showed how δ can be selected by exploiting the relationship betweenR VK,L and the effective sample size (ESS) when estimating the mean of the target. In particular, it is possible to re-writê where ESS is a strongly consistent estimator of the ESS. One can therefore (approximately) select a δ threshold that corresponds to a pre-specified value of the ESS. The literature on error assessment for MCMC provides guidance on how large the ESS ought to be in order that the width of a (1−α)% confidence interval for the mean is less that a specified threshold ; see Jones and Hobert (2001) ESS ≥ M α, := 2 2 π (Γ(1/2)) 2 where Γ(·) is the Gamma function, χ 2 1−α is the (1 − α) th quantile of the χ 2 distribution with one degree of freedom. Plugging (38) in (37) leads to the conclusion that, after the first iteration for whichR VK,L is below 1 + δ, where the chain will provide an estimate of the mean with small Monte Carlo error, when compared to the variability of the target. The default choices α = 0.05 and = 0.05 were suggested in Vats and Knudson (2018), and were used in our work. For experiments reported in this paper we used (39) to select an appropriate threshold for bothR GR,L andR VK,L , which leads to estimated burn-in periods that we denoteb GR,L , andb VK,L , respectively, in the main text.
The above discussion focussed on the univariate case, but generalisations of these convergence diagnostics are available and can be found in Brooks and Gelman (1998) and Vats and Knudson (2018). All convergence diagnostics in this work were computed using the R packages coda (Plummer et al., 2006) and stableGR (Knudson and Vats, 2020) 1 .

S5 Empirical Assessment: Additional Results
This section first explores the effect of the choice of kernel, then collects together additional empirical results that accompany the assessment in Section 4.

S5.1 Choice of Kernel
This section concerns the selection of the kernel parameter β ∈ (−1, 0), which enters into the kernel k(x, y) = (c 2 + Γ −1/2 (x − y) 2 ) β used within Stein Thinning. The experiment of Figure 1 was repeated for different values of β ∈ {−0.1, −0.5, −0.9}, and for each setting of the Γ preconditioner matrix (med, sclmed, smpcov), with the results displayed in Figure S1. It can be seen that the states selected when β = −0.5 are qualitatively reasonable. However, poor results were observed for β ∈ {−0.1, −0.9}. At β = −0.1, states were selected that were closer together compared to what would intuitively have been expected. At β = −0.9, stratification of selected states across the two components of the target was not achieved. These results reflect that β ∈ {−0.1, −0.9} are edge cases for KSD, which has been shown to enjoy convergence control only for β ∈ (−1, 0) (see Theorem 4 in Chen et al., 2019). These results support the use of β = −0.5 as a default in Stein Thinning.

S5.2 Goodwin Oscillator
The Goodwin oscillator is a phenomenological model for genetic regulatory processes in a cell and is described by g coupled ODEs of the form where the first component u 1 represents the concentration of mRNA, u 2 that of its corresponding protein product, while u 3 , . . . u g represent concentrations of proteins in a signalling cascade, that can either be present (g > 2) or absent (g = 2), and with the g th protein having a negative feedback on the production of mRNA, by means the Hill curve in the first equation. The nontrivial oscillations that result have led to the Goodwin oscillator being used in previous studies to assess the performance of Bayesian computational methods (Calderhead and Girolami, 2009;Oates et al., 2016;Chen et al., 2019). The parameters a 1 , k 1 , . . . k g−1 > 0 represent synthesis rates and a 2 , α > 0 representing degradation rates. To cast this model in the setting of Section 2 we set x ∈ R g+2 to be the vector whose entries are log(a 1 ), log(k 1 ), and so forth, so that we have a d = g + 2 dimensional parameter for which inference is performed. The experiment that we report considers synthetic data y i ∈ R g generated in the simple case g = 2, which are then corrupted by Gaussian noise such that the terms φ i in (34) are equal to with C = diag(0.1 2 , 0.05 2 ). The initial condition was u(0) = (0, 0) and the data-generating parameters were (a 1 , a 2 , α, k 1 ) = (1, 3, 0.5, 1). The times t i , i = 1, . . . , 2400, at which data were obtained were taken to be uniformly spaced on [1,25], in order to capture both the oscillatory behaviour of the system and its steady state. This relatively high frequency of observation and corresponding informativeness of the dataset was used to pre-empt a similarly high frequency observation process in the calcium signalling model of Section 4.3. Figure S2 displays the dataset. A standard Gaussian prior π(x) was placed on the parameter x and each MCMC method was applied to approximately sample from the posterior P . Exemplar trace plots for the MCMC methods are presented in Figure S3. The overdispersed initial states used for the L chains are reported in Table S2, while the univariate and multivariate convergence diagnostics, computed every 1000 iterations, are shown respectively in Figure S4 and Figure S5. The values of the thresholds δ(L, α, ) are reported in Table S3. For each MCMC method, the estimated burn-in period is presented in Table S4. The GR diagnostic did not fall below the 1 + δ threshold in the allowed number of iterations, which is consistent with the empirical observations of Vats and Knudson (2018).
The additional results for the Goodwin oscillator that we present in this appendix are as follows: • Figures 3 (RW), S7 (MALA) and S8 (P-MALA) display point sets of size m = 20 selected using traditional burn in and thinning methods, Support Points and Stein Thinning, based on MCMC output. Note that the gray regions are not necessarily regions of high posterior probability; they are the regions explored by the sample path and, moreover, these panels are two-dimensional projections from R 4 . Therefore we are hesitant to draw strong conclusions from these figures.
• Figures 4 (RW), S10 (MALA) and S11 (P-MALA) display the absolute error in estimating the first moment of each parameter, for each of the competing methods, where an extended run from MCMC provided the ground truth.
• Figures S12 (RW), S13 (ADA-RW), S14 (MALA) and S15 (P-MALA) show marginal density estimates, for each parameter and each of the competing methods, where an extended run from MCMC provided the ground truth. (2, 2, 2, 2) 6 (5, 5, 1, 1) Table S2: Initial states, over-dispersed with respect to the posterior, for the L = 6 independent Markov chains used in the Goodwin oscillator. The parameters used to generate the data were (a 1 , a 2 , α, k) = (1, 3, 0.5, 1).   10 3 10 4 10 5 10 6 n 10 3 10 4 10 5 10 6 n 10 3 10 4 10 5 10 6 n Figure S4: Univariate convergence diagnostics, for the Goodwin oscillator, plotted against the MCMC iteration number. The black line represents the GR diagnostic (based on L = 6 chains), while the blue and red lines represent the VK diagnostic (based on L = 6 and L = 1 chains, respectively). The dash-dotted (L = 6) and dashed (L = 1) horizontal lines correspond to the critical values δ(L, α, ), used to determine the burn-in period; see Table S3.   Table S4: Estimated burn-in period for the Goodwin oscillator, using the GR diagnostic based on L chains,b GR,L (L = 6), and the VK diagnostic based on L chains,b VK,L , (L = 1, 6). In each case both univariate and multivariate convergence diagnostics are presented; in the univariate case we report the largest value obtained when looking at each of the d parameters individually to estimate the burn-in period. The symbol "> n" indicates the case in which a diagnostic did not go below the 1 + δ threshold.

S5.3 Lotka-Volterra
The Lotka-Volterra model describes the oscillatory evolution of prey (u 1 ) and predator (u 2 ) species in a closed environment. The prey has an intrinsic mechanism for growth proportional to its abundance, described by a parameter θ 1 > 0, whilst interaction with the predator leads to a decrease in the prey population at a rate described by a parameter θ 2 > 0. Conversely, the predator has an intrinsic mechanism for decline proportional to its abundance, described by a parameter θ 3 > 0, whilst interaction with the prey leads to an increase in the predator population at a rate described by a parameter θ 4 > 0. The resulting system of ODEs is: To cast this model in the setting of Section 2 we set x ∈ R 4 to be the vector whose entries are log(θ 1 ), . . . , log(θ 4 ), so that we have a d = 4 dimensional parameter for which inference is performed. The experiment that we report considers synthetic data which are corrupted by Gaussian noise such that the terms φ i in (34) have expression (40), with C = diag(0.2 2 , 0.2 2 ). The initial condition was u(0) = (1, 1) and the data-generating parameters were x = log(θ), with θ = (0.67, 1.33, 1, 1). The times t i , i = 1, . . . , 2400, at which data were obtained were taken to be uniformly spaced on [0,25]. Figure S16 displays the dataset. A standard Gaussian prior π(x) was used.
Exemplar trace plots for the MCMC methods are presented in Figure S17. The overdispersed initial states used for the L chains are the same as those reported in Table S2, while the univariate and multivariate convergence diagnostics, computed every 1000 iterations, are shown respectively in Figure S18 and Figure S19. The values of the thresholds δ(L, α, ) are the same as those reported in Table S3. For each MCMC method, the estimated burn-in period is presented in Table S6.
The additional results for the Lotka-Volterra model that we present in this appendix are as follows: • Figures Figure S16: Data (gray) and ODE solution corresponding to the true data-generating parameters (black) for the Lotka-Volterra model.   Table S6: Estimated burn-in period for the Lotka-Volterra model, using the GR diagnostic based on L chains,b GR,L (L = 5), and the VK diagnostic based on L chains,b VK,L , (L = 1, 5).
In each case both univariate and multivariate convergence diagnostics are presented; in the univariate case we report the largest value obtained when looking at each of the d parameters individually to estimate the burn-in period. The symbol "> n" indicates the case in which a diagnostic did not go below the 1 + δ threshold. 10 3 10 4 10 5 10 6 n 10 3 10 4 10 5 10 6 n 10 3 10 4 10 5 10 6 n Figure S18: Univariate convergence diagnostics, for the Lotka-Volterra model, plotted against the MCMC iteration number. The black line represents the GR diagnostic (based on L = 5 chains), while the blue and red lines represent the VK diagnostic (based on L = 5 and L = 1 chains, respectively). The dash-dotted (L = 5) and dashed (L = 1) horizontal lines correspond to the critical values δ(L, α, ), used to determine the burn-in period; see Table S3.  Table S3.

S5.4 Calcium Signalling Model
This appendix contains a detailed biochemical description of the calcium singalling model studied in Section 4.3 of the main text, together with the experimental dataset that we collected. The Hinch et al. (2004) single cell model simulates the calcium transient evoked by membrane depolarisation in a cardiac cell. The model has a mathematical representation of the extracellular space and the intracellular compartment consisting of the sarcoplasmic reticulum (SR), dyadic space and cytosol. The major sarcolemmal calcium pathways are included: the L-type Ca channel (LCC), the plasmalemmal membrane calcium ATPase (PMCA) and the sodium-calcium exchanger (NCX). Inside the cell, the model has mathematical representations for calcium release from the SR to dyadic space through ryanodine receptors (RyR) and re-sequestration of calcium from the dyadic space into the SR by the SR ATPase (SERCA). Calcium buffering is also featured for the cytosol. A schematic representation of the cell model is given in Figure S24.
Membrane depolarisation is triggered by an electrical event. This causes calcium to enter through LCCs into the dyadic space, producing a local rise in Ca concentration, sufficient to activate RyRs. This process engages a feedback, whereby Ca release from the SR causes more RyR opening events. As the released Ca diffuses into the cytosol, most of it becomes buffered, but some ions remain free and underpin the Ca transient. Recovery following Ca release is driven by SERCA, which re-sequesters Ca into the SR, and NCX and PMCA which extrude calcium across the sarcolemma. This returns the cell to is initial conditions, ready for the next electrical stimulation.
The Hinch model describes the nonlinear, time-dependent interaction of the four Ca handling transporters (LCC, PMCA, RyR and SERCA) and lumped buffering by a system of 7 ODEs whose parameter is d = 38 dimensional. The first three differential equations provide a simplified four-state model describing the interaction between LCC and RyR within the dyadic space; here, only three states are simulated due to a conservation of mass constraint. The remaining four differential equations describe: calcium concentration in the sarcoplasmic reticulum and the cytosol, the calcium bound to cytosolic buffers and calcium current across the cellular membrane. Of these state variables, only the concentration of free calcium in the cytosol and the transmembrane current can be experimentally observed.
To provide a rich dataset for characterising calcium dynamics in a single cardiac myocyte, we applied three experimental protocols in sequence on a single myocyte. During these protocols, we controlled membrane potential and measured membrane currents electrophysiologically and, after appropriate calibration, followed Ca fluorimetrically. The calcium handling proteins were interrogated by relating currents and Ca concentration in response to defined membrane potential manoeuvres, and in the presence of drugs to eliminate various confounding components. The first voltage protocol interrogated LCC currents at different voltages, and measured their response in terms of SR release. In the second protocol, a train of depolarisations then triggered Ca transients which provided information about SR release and their recovery provided a readout of SERCA, NCX and PMCA activities. The third protocol consistent of rapid exposure to caffeine which emptied the SR and short-circuited SERCA. This provided information about SR load, and the subsequent recovery is a readout of NCX and PMCA. Buffering was calculated from the quotient of measured Ca rise upon caffeine exposure and the amount of Ca released back-calculated from sarcolemmal current generated by NCX. The dataset contains 12998 observations of cytosolic free calcium concentration observed at a 60 Hz sampling frequency, and 22260 transmembrane current observations, both for a duration of 3 minutes. The data are displayed in Figure S25, where the different colours show the three parts of the biological protocol explained above. The calcium signalling model in Figure S24 is represented by a coupled system of 7 ODEs and depends on a d = 38 dimensional parameter, which is to be estimated based on the experimental dataset. As just described, the data consist of measurements of calcium concentration in the cytoplasm and transmembrane current whilst the cell was externally stimulated, so that only two of the state variables (in our case, u 5 and u 7 ) were observed (we denote the observations of these states y 5 and y 7 , respectively). Our likelihood took the simple Gaussian form φ i (u(t i )) ∝ exp(− 1 2σ 2 5 (y 5 i − u 5 (t i )) 2 ) + exp(− 1 2σ 2 7 (y 7 i − u 7 (t i )) 2 ) with σ 5 = 2.07 × 10 −8 and σ 7 = 1.62 × 10 −10 . The ODE was numerically solved using CVODES (Hindmarsh et al., 2005) and sensitivities were computed by solving the forward sensitivity equations; see Appendix S3. Further details of the expert-elicited prior, the data pre-processing procedure and numerical details associated with the ODE solver will be reported in a separate manuscript, in preparation as of 12th July 2021, and are available on request.
In the experiments that follow, RW MCMC was used both to target the posterior P and to target a tempered distribution Q. The latter is equivalent to multiplying the measurement error standard deviations σ 5 and σ 7 by 8, and has the effect of rendering Q more diffuse than P , in order that Q is more favourable for MCMC. The specific value of 8 corresponded to the smallest amount of tempering required to achieve convergence within the available computational budget. A total of n = 4 × 10 6 iterations were performed, and in each case the first 10 6 iterations were used to adapt the scale of the Gaussian proposal distribution in the RW sampler, so that an acceptance rate close to 0.234 (Gelman et al., 1997) was achieved. The first 10 6 iterations were then discarded.
The additional results for the calcium signalling model that we present in this appendix are as follows: • Figures S26 to S29 contain kernel density estimates for posterior marginals obtained by Stein Thinning applied to tempered RW MCMC output, versus standard RW MCMC output.
• Figures S30 and S30 present results for KSD based on sclmed and smpcov settings, to complement Figure 9 in the main text. Figure S24: Calcium signalling model; a schematic representation due to Hinch et al. (2004). The model consists of 6 coupled ordinary differential equations and depends upon 38 realvalued parameters that must be estimated from an experimental dataset. Parameter 4 Parameter 5 Parameter 6 Parameter 7 Parameter 8 Parameter 9 Parameter 10 Figure S26: Kernel density estimates for posterior marginals in the calcium signalling model. Stein thinning with med, sclmed and smpcov preconditioners (first three columns) was applied to tempered RW MCMC output (to obtain m = 500 points).  Figure S29: Kernel density estimates for posterior marginals in the calcium signalling model. Stein thinning with med, sclmed and smpcov preconditioners (first three columns) was applied to tempered RW MCMC output (to obtain m = 500 points). These can be contrasted with the last column, where kernel density estimates based on standard RW MCMC are displayed. [In each case, two distinct random seeds were used. For reference, the black curve represents the prior marginal.]   Figure S31: Calcium signalling model. Kernel Stein discrepancy (KSD) based on smpcov, for empirical distributions obtained using Support Points and Stein Thinning, based on output from RW MCMC applied to either P or a tempered version of P .