Semi-Exact Control Functionals From Sard's Method

The numerical approximation of posterior expected quantities of interest is considered. A novel control variate technique is proposed for post-processing of Markov chain Monte Carlo output, based both on Stein's method and an approach to numerical integration due to Sard. The resulting estimators are proven to be polynomially exact in the Gaussian context, while empirical results suggest the estimators approximate a Gaussian cubature method near the Bernstein-von-Mises limit. The main theoretical result establishes a bias-correction property in settings where the Markov chain does not leave the posterior invariant. Empirical results are presented across a selection of Bayesian inference tasks. All methods used in this paper are available in the R package ZVCV.


Introduction
This paper focuses on the numerical approximation of integrals of the form where f is a function of interest and p is a positive and continuously differentiable probability density on R d , under the restriction that p and its gradient can only be evaluated pointwise up to an intractable normalisation constant. The standard approach 35 to computing I(f ) in this context is to simulate the first n steps of a p-invariant Markov chain (x (i) ) ∞ i=1 , possibly after an initial burn-in period, and to take the average along the sample path as an approximation to the integral: See Chapters 6-10 of Robert & Casella (2013) for background. In this paper E, V and C respectively denote expectation, variance and covariance with respect to the law P of the 40 Markov chain. Under regularity conditions on p that ensure the Markov chain ( is aperiodic, irreducible and reversible, the convergence of I MC (f ) to I(f ) as n → ∞ is described by a central limit theorem where convergence occurs in distribution and, if the chain starts in stationarity, is the asymptotic variance of f along the sample path. See Theorem 4.7.7 of Robert & 45 Casella (2013) and more generally Meyn & Tweedie (2012) for theoretical background. Note that for all but the most trivial function f we have σ(f ) 2 > 0 and hence, to achieve an approximation error of O P ( ), a potentially large number O( −2 ) of calls to f and p are required. One approach to reduce the computational cost is to employ control variates (Ham-50 mersley & Handscomb, 1964;Ripley, 1987), which involves finding an approximation f n to f that can be exactly integrated under p, such that σ(f − f n ) 2 σ(f ) 2 . Given a choice of f n , the standard estimator (1) is replaced with where ( * ) is exactly computed. This last requirement makes it challenging to develop control variates for general use, particularly in Bayesian statistics where often the den-55 sity p can only be accessed in a form that is un-normalised. In the Bayesian context, Assaraf & Caffarel (1999); Mira et al. (2013) and Oates et al. (2017) addressed this challenge by using f n = c n + Lg n where c n ∈ R, g n is a user-chosen parametric or nonparametric function and L is an operator, for example the Langevin Stein operator (Stein, 1972;Gorham & Mackey, 2015), that depends on p through its gradient and sat- I CV (f ) to I(f ) has been studied under (strong) regularity conditions and, in particular (i) if g n is chosen parametrically, then in general lim inf σ(f − f n ) 2 > 0 so that, even if asymptotic variance is reduced, convergence rates are unaffected; (ii) if g n is chosen in an appropriate non-parametric manner then lim sup σ(f − f n ) 2 = 0 and a smaller number 65 O( −2+δ ), 0 < δ < 2, of calls to f , p and its gradient are required to achieve an approximation error of O P ( ) for the integral (see Oates et al., 2019;Mijatović & Vogrinc, 2018;Barp et al., 2021;Belomestny et al., 2017Belomestny et al., , 2019Belomestny et al., , 2020. In the parametric case Lg n is called a control variate while in the non-parametric case it is called a control functional. Practical parametric approaches to the choice of g n have been well-studied in the 70 Bayesian context, typically based on polynomial regression models (Assaraf & Caffarel, 1999;Mira et al., 2013;Papamarkou et al., 2014;Oates et al., 2016;Brosse et al., 2019), but neural networks have also been proposed recently (Wan et al., 2019;Si et al., 2020). In particular, existing control variates based on polynomial regression have the attractive property of being semi-exact, meaning that there is a well-characterized set of functions 75 f ∈ F for which f n can be shown to exactly equal f after a finite number of samples n have been obtained. For the control variates of Assaraf & Caffarel (1999) and Mira et al. (2013) the set F contains certain low order polynomials when p is a Gaussian distribution on R d . Those authors term their control variates zero variance, but we prefer the term semi-exact since a general integrand f will not be an element of F. Regardless 80 of terminology, semi-exactness of the control variate is an appealing property because it implies that the approximation I CV (f ) to I(f ) is exact on F. Intuitively, the performance of the control variate method is related to the richness of the set F on which it is exact. For example, polynomial exactness of cubature rules is used to establish their high order convergence rates using a Taylor expansion argument (e.g. Hildebrand, 1987, Chapter 8). 85 The development of non-parametric approaches to the choice of g n has to-date focused on kernel methods (Oates et al., 2017;Barp et al., 2021), piecewise constant approximations (Mijatović & Vogrinc, 2018) and non-linear approximations based on selecting basis functions from a dictionary (Belomestny et al., 2017;South et al., 2019). Theoretical analysis of non-parametric control variates was provided in the papers cited above, but 90 compared to parametric methods, practical implementations of non-parametric methods are less well-developed.
In this paper we propose a semi-exact control functional method. This constitutes the best of both worlds, where at small n the semi-exactness property promotes stability and robustness of the estimator I CV (f ), while at large n the non-parametric regression 95 component can be used to accelerate the convergence of I CV (f ) to I(f ). In particular we argue that, in the Bernstein-von-Mises limit, the set F on which our method is exact is precisely the set of low order polynomials, so that our method can be considered as an approximately polynomially-exact cubature rule developed for the Bayesian context. Furthermore, we establish a bias-correcting property, which guarantees the approxima-100 tions produced using our method are consistent in certain settings where the Markov chain is not p-invariant.
Our motivation comes from the approach to numerical integration due to Sard (1949). Many numerical integration methods are based on constructing an approximation f n to the integrand f that can be exactly integrated. In this case the integral I(f ) is approxi-105 mated using ( * ) in (3). In Gaussian and related cubatures, the function f n is chosen in such a way that polynomial exactness is guaranteed (Gautschi, 2004, Section 1.4). On the other hand, in kernel cubature and related approaches, f n is an element of a reproduc-A c c e p t e d M a n u s c r i p t Downloaded from https://academic.oup.com/biomet/advance-article/doi/10.1093/biomet/asab036/6309456 by guest on 26 June 2021 ing kernel Hilbert space chosen such that an error criterion is minimised (Larkin, 1970). The contribution of Sard was to combine these two concepts in numerical integration by choosing f n to enforce exactness on a low-dimensional space F of functions and use the remaining degrees of freedom to find a minimum-norm interpolant to the integrand.

Sard's Method
Many popular methods for numerical integration are based on either (i) enforcing 115 exactness of the integral estimator on a finite-dimensional set of functions F, typically a linear space of polynomials, or on (ii) integration of a minimum-norm interpolant selected from an infinite-dimensional set of functions H. In each case, the result is a cubature method of the form Classical examples of methods in 120 the former category include univariate Gaussian quadrature rules (Gautschi, 2004, Section 1.4), which are determined by the unique f is a polynomial of order at most 2n − 1, and Clenshaw-Curtis rules (Clenshaw & Curtis, 1960). Methods in the minimum-norm interpolant category specify a suitable normed space (H, · H ) of functions, construct an interpolant f n ∈ H 125 such that and use the integral of f n to approximate the true integral. Specific examples include splines (Wahba, 1990) and kernel or Gaussian process based methods (Larkin, 1970;O'Hagan, 1991;Briol et al., 2019). If the set of points {x (i) } n i=1 is fixed, the cubature method in (4) has n degrees of 130 freedom corresponding to the choice of the weights {w i } n i=1 . The approach proposed by Sard (1949) is a hybrid of the two classical approaches just described, calling for m ≤ n of these degrees of freedom to be used to ensure that I NI (f ) is exact for f in a given m-dimensional linear function space F and, if m < n, allocating the remaining n − m degrees of freedom to select a minimal norm interpolant from a large class of functions H.

135
The approach of Sard is therefore exact for functions in the finite-dimensional set F and, at the same time, suitable for the integration of functions in the infinite-dimensional set H. Further background on Sard's method can be found in Larkin (1974) and Karvonen et al. (2018).
However, it is difficult to implement Sard's method, or indeed any of the classical 140 approaches just discussed, in the Bayesian context, since 1. the density p can be evaluated pointwise only up to an intractable normalization constant; 2. to construct weights one needs to evaluate the integrals of basis functions of F and of the interpolant f n , which can be as difficult as evaluating the original integral.
To circumvent these issues, in this paper we propose to combine Sard's approach to integration with Stein operators (Stein, 1972;Gorham & Mackey, 2015), thus eliminating the need to access normalization constants and to exactly evaluate integrals.

Stein Operators
and ∆ x denote the Laplacian ∆ x = ∇ x · ∇ x . Let x = (x · x) 1/2 denote the Euclidean norm on R d . The construction that enables us to realize Sard's method in the Bayesian context is the Langevin Stein operator L (Gorham & Mackey, 2015) on R d , defined for sufficiently regular g and p as We refer to L as a Stein operator due to the use of equations of the form (6) (up to a simple substitution) in the method of Stein (1972) for assessing convergence in distribution and due to its property of producing functions whose integrals with respect to p are zero under suitable conditions such as those described in Lemma 1.
where L is the Stein operator in (6). The proof is provided in Appendix 1. Although our attention is limited to (6), the choice of Stein operator is not unique and other Stein operators can be derived using the gener-165 ator method of Barbour (1988) or using Schrödinger Hamiltonians (Assaraf & Caffarel, 1999). Contrary to the standard requirements for a Stein operator, the operator L in control functionals does not need to fully characterize convergence and, as a consequence, a broader class of functions g can be considered than in more traditional applications of Stein's method (Stein, 1972).

170
It follows that, if the conditions of Lemma 1 are satisfied by g n : R d → R, the integral of a function of the form f n = c n + Lg n is simply c n , the constant. The main challenge in developing control variates, or functionals, based on Stein operators is therefore to find a function g n such that the asymptotic variance σ(f − f n ) 2 is small. To explicitly minimize asymptotic variance, Mijatović & Vogrinc (2018); Belomestny et al. (2020) and 175 Brosse et al. (2019) restricted attention to particular Metropolis-Hastings or Langevin samplers for which asymptotic variance can be explicitly characterized. The minimization of empirical variance has also been proposed and studied in the case where samples are independent (Belomestny et al., 2017) and dependent (Belomestny et al., 2020(Belomestny et al., , 2019. For an approach that is not tied to a particular Markov kernel, authors such as Assaraf 180 & Caffarel (1999) and Mira et al. (2013) proposed to minimize mean squared error along the sample path, which corresponds to the case of an independent sampling method. In a similar spirit, the constructions in Oates et al. (2017Oates et al. ( , 2019 and Barp et al. (2021) were based on a minimum-norm interpolant, where the choice of norm is decoupled from the mechanism from where the points are sampled.

The Proposed Method
In this section we first construct an infinite-dimensional space H and a finitedimensional space F of functions; these will underpin the proposed semi-exact control functional method.
For the infinite-dimensional component, let k : Recall that such a k induces a unique reproducing kernel Hilbert space H(k). This is a Hilbert space that consists of functions g : R d → R and is equipped with an inner product ·, · H(k) . The kernel k is such that k(·, x) ∈ H(k) for all 195 x ∈ R d and it is reproducing in the sense that g, k(·, x) H(k) = g(x) for any g ∈ H(k) and x ∈ R d . For α ∈ N d 0 the multi-index notation x α := x α 1 1 · · · x α d d and |α| = α 1 + · · · + α d will be used. If k is twice continuously differentiable in the sense of Steinwart & Christmann (2008, Definition 4.35), meaning that the derivatives where L x stands for application of the Stein operator defined in (6) with respect to variable x, is a well-defined and positive-definite kernel (Steinwart & Christmann, 2008, Lemma 4.34). The kernel in (7) can be written as where If k is radial then (8) can be simplified; see Appendix 2.

205
Lemma 2 establishes conditions under which the functions x → k 0 (x, y), y ∈ R d , and hence elements of the Hilbert space H(k 0 ) reproduced by k 0 , have zero integral. Let The proof is provided in Appendix 4. The infinite-dimensional space H used in this work is exactly the reproducing kernel Hilbert space H(k 0 ). The basic mathematical properties 215 of k 0 and the Hilbert space it reproduces are contained in Appendix 3 and these can be used to inform the selection of an appropriate kernel.
For the finite-dimensional component, let Φ be a linear space of twice-continuously differentiable functions with dimension m − 1, m ∈ N, and a basis {φ i } m−1 i=1 . Define then the space obtained by applying the differential operator (6) to Φ as LΦ = span{Lφ 1 , . . . , Lφ m−1 }. If the pre-conditions of Lemma 1 are satisfied for each basis function g = φ i then linearity of the Stein operator implies that (Lφ)dp = 0 for every φ ∈ Φ. Typically we will select Φ = P r as the polynomial space P r = span{x α : α ∈ N d 0 , 0 < |α| ≤ r} for some non-negative integer r. Note that constant functions are excluded from P r since they are in the null space of L; when required we let P r 0 = span{1} ⊕ P r denote 225 the larger space with the constant functions included. The finite-dimensional space F is then taken to be F = span{1} ⊕ LΦ = span{1, Lφ 1 , . . . Lφ m−1 }.
It is now possible to state the proposed method. Following Sard, we approximate the integrand f with a function f n that interpolates f at the locations x (i) , is exact on the m-dimensional linear space F, and minimises a particular (semi-)norm subject to the 230 first two constraints. It will occasionally be useful to emphasise the dependence of f n on f using the notation f n (·) = f n (·; f ). The proposed interpolant takes the form where the coefficients a = (a 1 , . . . , a n ) ∈ R n and b = (b 1 , . . . , b m ) ∈ R m are selected such that the following two conditions hold: Since F is m-dimensional, these requirements correspond to the total of n + m constraints. Under weak conditions, discussed in Section 2.5, the total number of degrees of freedom due to selection of a and b is equal to n + m and the above constraints can be satisfied. Furthermore, the corresponding function f n can be shown to minimise a 240 particular (semi-)norm on a larger space of functions, subject to the interpolation and exactness constraints (to limit scope, we do not discuss this characterisation further but the semi-norm is defined in (17) and the reader can find full details in Wendland, 2004, Theorem 13.1). Figure 1 illustrates one such interpolant. The proposed estimator of the integral is then a special case of (3) (the interpolation condition causes the first term in (3) to vanish) that we call a semi-exact control functional. The following is immediate from (10) and (11): Corollary 1. Under the hypotheses of Lemma 1 for each g = φ i , i = 1, . . . , m − 1, and Lemma 2, it holds that, whenever the estimator I SECF (f ) is well-defined, I SECF (f ) = 250 b 1 , where b 1 is the constant term in (10). The earlier work of Assaraf & Caffarel (1999) and Mira et al. (2013) corresponds to a = 0 and b = 0, while setting b = 0 in (10) and ignoring the semi-exactness requirement recovers the unique minimum-norm interpolant in the Hilbert space H(k 0 ) where k 0 is reproducing, in the sense of (5). The work of Oates et al. (2017) corresponds to b i = 0 for 255 i = 2, . . . , m. It is therefore clear that the proposed approach is a strict generalization of existing work and can be seen as a compromise between semi-exactness and minimumnorm interpolation.
A c c e p t e d M a n u s c r i p t   (7).

Polynomial Exactness in the Bernstein-von-Mises Limit
A central motivation for our approach is the prototypical case where p is the density of a 260 posterior distribution P x|y 1 ,...,y N for a latent variable x given independent and identically distributed data y 1 , . . . , y N ∼ P y 1 ,...,y N |x . Under regularity conditions discussed in Section 10.2 of van der Vaart (1998), the Bernstein-von-Mises theorem states that wherex N is a maximum likelihood estimate for x, I(x) is the Fisher information matrix evaluated at x, · TV is the total variation norm and convergence is in probability 265 as N → ∞ with respect to the law P y 1 ,...,y N |x of the dataset. In this limit, polynomial exactness of the proposed method can be established. Indeed, for a Gaussian density p with meanx N ∈ R d and precision N I( where P i (x) = 2e i I(x N )(x −x N ) and e i is the ith coordinate vector in R d . This allows us to obtain the following result, whose proof is provided in Appendix 5:

270
Lemma 3. Consider the Bernstein-von-Mises limit and suppose that the Fisher information matrix I(x N ) is non-singular. Then, for the choice Φ = P r , r ∈ N, the estimator I SECF is exact on F = P r 0 . Thus the proposed estimator is polynomially exact up to order r in the Bernstein-von-Mises limit. We believe this property can confer robustness of the estimator in a broad 275 range of applied contexts. At finite N , when the limit has not been reached, the above argument can only be expected to approximately hold.

Computation for the Proposed Method
Define the n × m matrix which is sometimes called a Vandermonde (or alternant) matrix corresponding to the 280 linear space F. Let K 0 be the n × n matrix with entries [K 0 ] i,j = k 0 (x (i) , x (j) ) and let f be the n-dimensional column vector with entries [f ] i = f (x (i) ). Lemma 4. Let the n ≥ m points x (i) be distinct and F-unisolvent, meaning that the matrix P in (12) has full rank. Let k 0 be a positive-definite kernel for which (9) is satisfied. Then I SECF (f ) is well-defined and the coefficients a and b are given by the 285 solution of the linear system In particular, The proof is provided in Appendix 6. Notice that (14) is a linear combination of the values in f and therefore the proposed estimator is recognized as a cubature method of the form (4) with weights The requirement in Lemma 4 for the x (i) to be distinct precludes, for example, the direct use of Metropolis-Hastings output. However, as emphasized in Oates et al. (2017) for control functionals and studied further in Liu & Lee (2017); Hodgkinson et al. (2020), the consistency of I SECF does not require that the Markov chain is p-invariant. It is therefore trivial to, for example, filter out duplicate states from Metropolis-Hastings 295 output.
The solution of linear systems of equations defined by an n × n matrix K 0 and an m × m matrix P K −1 0 P entails a computational cost of O(n 3 + m 3 ). In some situations this cost may yet be smaller than the cost associated with evaluation of f and p, but in general this computational requirement limits the applicability of the method just described. In 300 Appendix 7 we therefore propose a computationally efficient approximation, I ASECF , to the full method, based on a combination of the Nyström approximation (Williams & Seeger, 2001) and the well-known conjugate gradient method, inspired by the recent work of Rudi et al. (2017). All proposed methods are implemented in the R package ZVCV (South, 2020). where non-parametric control functional methods perform well; Section 3.4 considers a setting where parametric control variate methods are known to be successful. In each case we determine whether or not the proposed semi-exact control functional method is competitive with the state-of-the-art.
Specifically, we compared the following estimators, which are all instances of I CV in

315
(3) for a particular choice of f n , which may or may not be an interpolant: r Standard Monte Carlo integration, (1), based on Markov chain output. r The control functional estimator recommended in Oates et al. (2017), r The zero variance polynomial control variate method of Assaraf & Caffarel (1999) and r An approximation, I ASECF , of (14) based on the Nyström approximation and the conjugate gradient method, described in Appendix 7.
Open-source software for implementing all of the above methods is available in the R package ZVCV (South, 2020). The same sets of n samples were used for all estimators, 330 in both the construction of f n and the evaluation of I CV . For methods where there is a fixed polynomial basis we considered only orders r = 1 and r = 2, following the recommendation of Mira et al. (2013). For kernel-based methods, duplicate values of x i were removed (as discussed in Section 2.5) and Frobenius regularization was employed whenever the condition number of the kernel matrix K 0 was close to machine precision 335 (Higham, 1988). Several choices of kernel were considered, but for brevity in the main text we focus on the rational quadratic kernel k(x, y; λ) = {1 + λ −2 x − y 2 } −1 . This kernel was found to provide the best performance across a range of experiments; a comparison to the Matérn and Gaussian kernels is provided in Appendix 8. The parameter λ was selected using 5-fold cross-validation, based again on performance across a spectrum of 340 experiments; a comparison to the median heuristic (Garreau et al., 2017) is presented in Appendix 8.
To ensure that our assessment is practically relevant, the estimators were compared on the basis of both statistical and computational efficiency relative to the standard Monte Carlo estimator. Statistical efficiency E(I CV ) and computational efficiency C(I CV ) of an 345 estimator I CV of the integral I are defined as where T CV denotes the combined wall time for sampling the x (i) and computing the estimator I CV . For the results reported below, E and C were approximated using averageŝ E andĈ over 100 realizations of the Markov chain output.

350
Here we consider a Gaussian integral that serves as an analytically tractable caricature of a posterior near to the Bernstein-von-Mises limit. This enables us to assess A c c e p t e d M a n u s c r i p t   the effect of the sample size n and dimension d on each estimator, in a setting that is not confounded by the idiosyncrasies of any particular MCMC method. Specifically, we For the parametric component we set 355 Φ = P r , so that (from Lemma 3) I SECF is exact on polynomials of order at most r; this holds also for I ZV .
For the integrand f : in order that the integral is analytically tractable (I(f ) = 1) and that no method will be exact. Figure 2 displays the statistical efficiency of each estimator for 10 ≤ n ≤ 1000 and 360 3 ≤ d ≤ 100. Computational efficiency is not shown since exact sampling from p in this example is trivial. The proposed semi-exact control functional method performs consistently well compared to its competitors for this non-polynomial integrand. Unsurprisingly, the best improvements are for high n and small d, where the proposed method results in a statistical efficiency over 100 times better than the baseline estimator and 365 up to 5 times better than the next best method.

Capture-Recapture Example
The two remaining examples, here and in Section 3.4, are applications of Bayesian statistics described in South et al. (2019). In each case the aim is to estimate expectations with respect to a posterior distribution P x|y of the parameters x of a statistical 370 model based on y, an observed dataset. Samples x (i) were obtained using the Metropolisadjusted Langevin algorithm (Roberts & Tweedie, 1996), which is a Metropolis-Hastings algorithm with proposal N (x (i−1) + h 2 1 2 Σ∇ x log P x|y (x (i−1) | y), h 2 Σ).
Step sizes of h = 0.72 for the capture-recapture example and h = 0.3 for the sonar example (see Section 3.4) were selected and an empirical approximation of the posterior covariance matrix was 375 used as the pre-conditioner Σ ∈ R d×d . Since the proposed method does not rely on the Markov chain being P x|y -invariant we also repeated these experiments using the unad-A c c e p t e d M a n u s c r i p t   justed Langevin algorithm (Parisi, 1981;Ermak, 1975), with similar results reported in Appendix 9. In this first example, a Cormack-Jolly-Seber capture-recapture model (Lebreton et al., 380 1992) is used to model data on the capture and recapture of the bird species Cinclus Cinclus (Marzolin, 1988). The integrands of interest are the marginal posterior means f i (x) = x i for i = 1, . . . , 11, where x = (φ 1 , . . . , φ 5 , p 2 , . . . , p 6 , φ 6 p 7 ), φ j is the probability of survival from year j to j + 1 and p j is the probability of being captured in year j. The likelihood is and the data y consists of D i , the number of birds released in year i, and y ik , the number of animals caught in year k out of the number released in year i, for i = 1, . . . , 6 and k = 2, . . . , 7. Following South et al. (2019), parameters are transformed to the real line usingx j = 390 log{x j /(1 − x j )} and the adjusted prior density forx j is exp(x j )/{1 + exp(x j )} 2 , for j = 1, . . . , 11. South et al. (2019) found that non-parametric methods outperform standard parametric methods for this 11-dimensional example. The estimator I SECF combines elements of both approaches, so there is interest in determining how the method performs. It is clear 395 from Figure 3 that all variance reduction approaches are helpful in improving upon the vanilla Monte Carlo estimator in this example. The best improvement in terms of statistical and computational efficiency is offered by I SECF , which also has similar performance to I CF .
A c c e p t e d M a n u s c r i p t

400
Our final application is a 61-dimensional logistic regression example using data from Gorman & Sejnowski (1988) and Dheeru & Karra Taniskidou (2017). To use standard regression notation, the parameters are denoted β ∈ R 61 , the matrix of covariates in the logistic regression model is denoted X ∈ R 208×61 where the first column is all 1's to fit an intercept and the response is denoted y ∈ R 208 . In this application, X contains 405 information related to energy frequencies reflected from either a metal cylinder (y = 1) or a rock (y = 0). The log likelihood for this model is We use a N (0, 5 2 ) prior for the predictors (after standardising to have standard deviation of 0.5) and N (0, 20 2 ) prior for the intercept, following South et al. (2019); Chopin & Ridgway (2017), but we focus on estimating the more challenging integrand f (β) = 410 {1 + exp(−Xβ)} −1 , which can be interpreted as the probability that observed covariates X emanate from a metal cylinder. The gold standard of I ≈ 0.4971 was obtained from a 10 million iteration Metropolis-Hastings (Hastings, 1970) run with multivariate normal random walk proposal. Figure 4 illustrates the statistical and computational efficiency of estimators for various 415 n in this example. It is interesting to note that I SECF and I ASECF offer similar statistical efficiency to I ZV , especially given the poor relative performance of I CF . Since it is inexpensive to obtain the m samples using the Metropolis-adjusted Langevin algorithm in this example, I ZV and I ASECF are the only approaches which offer improvements in computational efficiency over the baseline estimator for the majority of n values considered, 420 and even in these instances the improvements are marginal.

Theoretical Properties and Convergence Assessment
4.1. Finite Sample Error and a Practical Diagnostic The performance of the proposed method can be monitored using the finite sample error bound provided in Proposition 1. Proposition 1 makes use of the semi-norm which is well-defined when the infimum is taken over a non-empty set, otherwise |f | k 0 ,F := ∞. Proposition 1. Let the hypotheses of Corollary 1 hold. Then the integration error satisfies the bound where the weights w, defined in (15), satisfy The proof is provided in Appendix 11. The first quantity |f | k 0 ,F in (18) can be approximated by |f n | k 0 ,F when f n is a reasonable approximation for f and this can in turn can be bounded as |f n | k 0 ,F ≤ (a K 0 a) 1/2 . The finiteness of |f | k 0 ,F ensures the existence of a solution to the Stein equation, sufficient conditions for which are discussed in Mackey 435 & Gorham (2016); Si et al. (2020). The second quantity (w K 0 w) 1/2 in (18) is computable and is recognized as a kernel Stein discrepancy between the empirical measure n i=1 w i δ(x (i) ) and the distribution whose density is p, based on the Stein operator L (Chwialkowski et al., 2016;Liu et al., 2016). Note that our choice of Stein operator differs to that in Chwialkowski et al. (2016) and Liu et al. (2016). There has been substantial 440 recent research into the use of kernel Stein discrepancies for assessing algorithm performance in the Bayesian computational context (Gorham & Mackey, 2017;Chen et al., 2018Chen et al., , 2019Singhal et al., 2019;Hodgkinson et al., 2020) and one can also exploit this discrepancy as a diagnostic for the performance of the semi-exact control functional. The diagnostic that we propose to monitor is the product (w K 0 w) 1/2 (a K 0 a) 1/2 . This 445 approach to error estimation was also suggested (outside the Bayesian context) in Section 5.1 of Fasshauer (2011).
Empirical results in Figure 5 suggest that this diagnostic provides a conservative approximation of the actual error. Further work is required to establish whether this diagnostic detects convergence and non-convergence in general.

Consistency of the Estimator
In what follows we consider an increasing number n of samples x (i) whilst the finitedimensional space Φ, with basis {φ 1 , . . . , φ m−1 }, is held fixed. The samples x (i) will be assumed to arise from a V -uniformly ergodic Markov chain; the reader is referred to Chapter 16 of Meyn & Tweedie (2012) for the relevant background. Recall that the points 455 (x (i) ) n i=1 are called F-unisolvent if the matrix in (12) has full rank. It will be convenient to introduce an inner product u, v n = u K −1 0 v and associated norm u n = u, u 1/2 n . Let Π be the matrix that projects orthogonally onto the columns of [Ψ] i,j := Lφ j (x (i) ) with respect to the ·, · n inner product.
A c c e p t e d M a n u s c r i p t Theorem 1. Let the hypotheses of Corollary 1 hold and let f be any function for which 460 |f | k 0 ,F < ∞. Let q be a probability density with p/q > 0 on R d and consider a q-invariant Markov chain (x (i) ) n i=1 , assumed to be V -uniformly ergodic for some V : R d → [1, ∞), such that A1. sup x∈R d V (x) −r {p(x)/q(x)} 4 k 0 (x, x) 2 < ∞ for some 0 < r < 1; A2. the points (x (i) ) n i=1 are almost surely distinct and F-unisolvent;
Then |I SECF (f ) − I(f )| = O P (n 1/2 ). This demonstrates that, even in the biased sampling setting, the proposed estimator is consistent. The proof is provided in Appendix 12 and exploits a recent theoretical contribution in Hodgkinson et al. (2020). Assumption A1 serves to ensure that q is sim-470 ilar enough to p that a q-invariant Markov chain will also explore the high probability regions of p, as discussed in Hodgkinson et al. (2020). Sufficient conditions for V -uniform ergodicity are necessarily Markov chain dependent. The case of the Metropolis-adjusted Langevin algorithm is discussed in Roberts & Tweedie (1996); Chen et al. (2019) and, in particular, Theorem 9 of Chen et al. (2019) provides sufficient conditions for V -uniform 475 ergodicity with V (x) = exp(s x ) for all s > 0. Under these conditions, and with the rational quadratic kernel k considered in Section 3, we have k 0 (x, x) = O( ∇ x log p(x) 2 ) and therefore A1 is satisfied whenever {p(x)/q(x)} ∇ x log p(x) = O(exp(t x )) for some t > 0; a weak requirement. Assumption A2 ensures that the finite sample size bound (18) is almost surely well-defined. Assumption A3 ensures the points in the se-480 quence (x (i) ) n i=1 distinguish (asymptotically) the constant function from the functions {φ i } m−1 i=1 , which is a weak technical requirement.

Discussion
Several possible extensions of the proposed method can be considered. For example, the parametric component Φ could be adapted to the particular f and p using a di-485 A c c e p t e d M a n u s c r i p t Downloaded from https://academic.oup.com/biomet/advance-article/doi/10.1093/biomet/asab036/6309456 by guest on 26 June 2021 mensionality reduction method. Likewise, extending cross-validation to encompass the choice of kernel and even the choice of control variate or control functional estimator may be useful. The potential for alternatives to the Nyström approximation to further improve scalability of the method can also be explored. In terms of the points x (i) on which the estimator is defined, these could be optimally selected to minimize the error 490 bound in (18), for example following the approaches of Chen et al. (2018Chen et al. ( , 2019. Finally, we highlight a possible extension to the case where only stochastic gradient information is available, following Friel et al. (2016) in the parametric context.

Acknowledgement
CJO is grateful to Yvik Swan for discussion of Stein's method. TK was supported by 495 the Aalto ELEC Doctoral School and the Vilho, Yrjö and Kalle Väisälä Foundation. MG was supported by a Royal Academy of Engineering Research Chair and by EPSRC grants EP/T000414/1, EP/R018413/2, EP/P020720/2, EP/R034710/1, EP/R004889/1. TK, MG and CJO were supported by the Lloyd's Register Foundation programme on datacentric engineering at the Alan Turing Institute, UK. CN and LFS were supported by 500 EPSRC grants EP/S00159X/1 and EP/V022636/1. The authors are grateful for feedback from three anonymous Reviewers, an Associate Editor and an Editor.

Supplementary material
Supplementary material available at Biometrika online includes further technical details, additional simulation results and the proofs of results stated in the main text.