The Metropolis-Hastings Algorithm¶

Markov chain. Considering a $\sigma$ -algebra $\cA$ on $\Omega$ , a Markov chain is a process ${(X_k)}_{k\in\Nset}$ such that

$\begin{aligned} \forall{}(A,x_0,\ldots,x_{k-1})\in\cA\times\Omega^k \quad \Prob{X_k\in A \,|\, X_0=x_0, \ldots, X_{k-1}=x_{k-1}} = \Prob{X_k\in A \,|\, X_{k-1}=x_{k-1}}. \end{aligned}$

An example is the random walk for which $X_k = X_{k-1} + \varepsilon_k$ where the steps $\varepsilon_k$ are independent and identically distributed.

Transition kernel. A transition kernel on $(\Omega, \cA)$ is a mapping $K: (\Omega, \cA) \rightarrow [0, 1]$ such that

For all $A\in\cA, \quad K(., A)$ is measurable;
For all $x\in\Omega, \quad K(x, .)$ is a probability distribution on $(\Omega, \cA)$ .

The kernel $K$ has density $k$ if $\forall(x,A)\in\Omega\times\cA \quad K(x, A) = \displaystyle\int_A \: k(x, y) \mbox{d}y$ .

${(X_k)}_{k\in\Nset}$ is a homogeneous Markov Chain of transition $K$ if $\forall(A,x)\in\cA\times\Omega \quad \Prob{X_k\in{}A | X_{k-1}=x} = K(x, A)$ .

Some Notations. Let ${(X_k)}_{k\in\Nset}$ be a homogeneous Markov Chain of transition $K$ on $(\Omega, \cA)$ with initial distribution $\nu$ (that is $X_0 \sim \nu$ ):

$K_\nu$ denotes the probability distribution of the Markov Chain ${(X_k)}_{k\in\Nset}$ ;
$\nu{}K^k$ denotes the probability distribution of $X_k$ ( $X_k \sim \nu{}K^k$ );
$K^k$ denotes the mapping defined by $K^k(x,A) = \Prob{X_k\in{}A|X_0=x}$ for all $(x,A)\in\Omega\times\cA$ .

Total variation convergence. A Markov Chain of distribution $K_\nu$ is said to converge in total variation distance towards the distribution $t$ if

$\begin{aligned} \lim_{k\to+\infty} \sup_{A\in\cA} \left| \nu{}K^k(A) - t(A) \right| = 0. \end{aligned}$

Then the notation used here is $\nu{}K^k \rightarrow_{TV} t$ .

Some interesting properties. Let $t$ be a (target) distribution on $(\Omega, \cA)$ , then a transition kernel $K$ is said to be:

$t$ -invariant if $t{}K = t$ ;
$t$ -irreducible if, $\forall(A,x)\in\Omega\times\cA$ such that $t(A)>0$ , $\exists{}k\in\cN^* \quad {}K^k(x, A) > 0$ holds.

Markov Chain Monte-Carlo techniques allows one to sample and integrate according to a distribution $t$ which is only known up to a multiplicative constant. This situation is common in Bayesian statistics where the “target” distribution, the posterior one $t(\vect{\theta})=\pi(\vect{\theta} | \mat{y})$ , is proportional to the product of prior and likelihood: see equation (1).

In particular, given a “target” distribution $t$ and a $t$ -irreducible kernel transition $Q$ , the Metropolis-Hastings algorithm produces a Markov chain ${(X_k)}_{k\in\Nset}$ of distribution $K_\nu$ with the following properties:

the transition kernel of the Markov chain is $t$ -invariant;
$\nu{}K^k \rightarrow_{TV} t$ ;
the Markov chain satisfies the ergodic theorem: let $\phi$ be a real-valued function such that $\mathbb{E}_{X\sim{}t}\left[ |\phi(X)| \right] <+\infty$ , then, whatever the initial distribution $\nu$ is:

$\begin{aligned} \displaystyle\frac{1}{n} \sum_{k=1}^n \: \phi(X_k) \tendto{k}{+\infty} \mathbb{E}_{X\sim{}t}\left[ |\phi(X)| \right] \mbox{ almost surely}. \end{aligned}$

In that sense, simulating ${(X_k)}_{k\in\Nset}$ amounts to sampling according to $t$ and can be used to integrate relatively to the probability measure $t$ . Let us remark that the ergodic theorem implies that $\displaystyle\frac{1}{n} \sum_{k=1}^n \: \fcar{A}{X_k} \tendto{k}{+\infty} \ProbCond{X\sim{}t}{X\in{}A}$ almost surely.

By abusing the notation, $t(x)$ represents, in the remainder of this section, a function of $x$ which is proportional to the PDF of the target distribution $t$ . Given a transition kernel $Q$ of density $q$ , the scheme of the Metropolis-Hastings algorithm is the following (lower case letters are used hereafter for both random variables and realizations as usual in the Bayesian literature):

Draw $x_0 \sim \nu$ and set $k = 1$ .
Draw a candidate for $x_k$ according to the given transition kernel $Q$ : $\tilde{x} \sim Q(x_{k-1}, .)$ .
Compute the ratio $\rho = \displaystyle\frac{t(\tilde{x})/q(x_{k-1},\tilde{x})} {t(x_{k-1})/q(\tilde{x},x_{k-1})}$ .
Draw $u \sim \cU([0, 1])$ ; if $u \leq \rho$ then set $x_k = \tilde{x}$ , otherwise set $x_k = x_{k-1}$ .
Set $k=k+1$ and go back to 1).

Of course, if $t$ is replaced by a different function of $x$ which is proportional to it, the algorithm keeps unchanged, since $t$ only takes part in the latter in the ratio $\frac{t(\tilde{x})}{t(x_{k-1})}$ . Moreover, if $Q$ proposes some candidates in a uniform manner (constant density $q$ ), the candidate $\tilde{x}$ is accepted according to a ratio $\rho$ which reduces to the previous “natural” ratio $\frac{t(\tilde{x})}{t(x_{k-1})}$ of PDF. The introduction of $q$ in the ratio $\rho$ prevents from the bias of a non-uniform proposition of candidates which would favor some areas of $\Omega$ .

The $t$ -invariance is ensured by the symmetry of the expression of $\rho$ ( $t$ -reversibility).

In practice, $Q$ is specified as a random walk ( $\exists{}q_{RW}$ such that $q(x,y)=q_{RW}(x-y)$ ) or as a independent sampling ( $\exists{}q_{IS}$ such that $q(x,y)=q_{IS}(y)$ ), or as a mixture of random walk and independent sampling.

The important property the practitioner have to keep in mind when choosing the transition kernel $Q$ is the $t$ -irreducibility. Moreover, for efficient convergence, $Q$ has to be chosen so as to explore quickly the whole support of $t$ without conducting to a too small acceptance ratio (the ratio of accepted candidates $\tilde{x}$ ). It is usually recommended that this latter ratio is about $0.2$ but such a ratio is neither a warranty of efficiency, nor a substitute to a convergence diagnosis.

API:

Examples:

References:

Robert, C.P. and Casella, G. (2004). Monte Carlo Statistical Methods (Second Edition), Springer.
Meyn, S. and Tweedie R.L. (2009). Markov Chains and Stochastic Stability (Second Edition), Cambridge University Press.

OpenTURNS

An Open source initiative for the Treatment of Uncertainties, Risks'N Statistics

Previous topic

Next topic

This Page

The Metropolis-Hastings Algorithm¶