The Metropolis-Hastings Algorithm¶

Markov chain. Considering a $\sigma$ -algebra $\cA$ on $\Omega$ , a Markov chain is a process ${(X_k)}_{k\in\Nset}$ such that

$\begin{aligned} \forall{}(A,x_0,\ldots,x_{k-1})\in\cA\times\Omega^k \quad \Prob{X_k\in A \,|\, X_0=x_0, \ldots, X_{k-1}=x_{k-1}} = \Prob{X_k\in A \,|\, X_{k-1}=x_{k-1}}. \end{aligned}$

An example is the random walk for which $X_k = X_{k-1} + \varepsilon_k$ where the steps $\varepsilon_k$ are independent and identically distributed.

Transition kernel. A transition kernel on $(\Omega, \cA)$ is a mapping $K: (\Omega, \cA) \rightarrow [0, 1]$ such that

For all $A\in\cA, \quad K(., A)$ is measurable;
For all $x\in\Omega, \quad K(x, .)$ is a probability distribution on $(\Omega, \cA)$ .

The kernel $K$ has density $k$ if $\forall(x,A)\in\Omega\times\cA \quad K(x, A) = \displaystyle\int_A \: k(x, y) \mbox{d}y$ .

${(X_k)}_{k\in\Nset}$ is a homogeneous Markov Chain of transition $K$ if $\forall(A,x)\in\cA\times\Omega \quad \Prob{X_k\in{}A | X_{k-1}=x} = K(x, A)$ .

Some Notations. Let ${(X_k)}_{k\in\Nset}$ be a homogeneous Markov Chain of transition $K$ on $(\Omega, \cA)$ with initial distribution $\nu$ (that is $X_0 \sim \nu$ ):

$K_\nu$ denotes the probability distribution of the Markov Chain ${(X_k)}_{k\in\Nset}$ ;
$\nu{}K^k$ denotes the probability distribution of $X_k$ ( $X_k \sim \nu{}K^k$ );
$K^k$ denotes the mapping defined by $K^k(x,A) = \Prob{X_k\in{}A|X_0=x}$ for all $(x,A)\in\Omega\times\cA$ .

Total variation convergence. A Markov Chain of distribution $K_\nu$ is said to converge in total variation distance towards the distribution $t$ if

$\begin{aligned} \lim_{k\to+\infty} \sup_{A\in\cA} \left| \nu{}K^k(A) - t(A) \right| = 0. \end{aligned}$

Then the notation used here is $\nu{}K^k \rightarrow_{TV} t$ .

Some interesting properties. Let $t$ be a (target) distribution on $(\Omega, \cA)$ , then a transition kernel $K$ is said to be:

$t$ -invariant if $t{}K = t$ ;
$t$ -irreducible if, $\forall(A,x)\in\Omega\times\cA$ such that $t(A)>0$ , $\exists{}k\in\cN^* \quad {}K^k(x, A) > 0$ holds.

Markov Chain Monte-Carlo techniques allows one to sample and integrate according to a distribution $t$ which is only known up to a multiplicative constant. This situation is common in Bayesian statistics where the “target” distribution, the posterior one $t(\vect{\theta})=\pi(\vect{\theta} | \mat{y})$ , is proportional to the product of prior and likelihood: see equation (1).

In particular, given a “target” distribution $t$ and a $t$ -irreducible kernel transition $Q$ , the Metropolis-Hastings algorithm produces a Markov chain ${(X_k)}_{k\in\Nset}$ of distribution $K_\nu$ with the following properties:

the transition kernel of the Markov chain is $t$ -invariant;
$\nu{}K^k \rightarrow_{TV} t$ ;
the Markov chain satisfies the ergodic theorem: let $\phi$ be a real-valued function such that $\mathbb{E}_{X\sim{}t}\left[ |\phi(X)| \right] <+\infty$ , then, whatever the initial distribution $\nu$ is:

$\begin{aligned} \displaystyle\frac{1}{n} \sum_{k=1}^n \: \phi(X_k) \tendto{k}{+\infty} \mathbb{E}_{X\sim{}t}\left[ |\phi(X)| \right] \mbox{ almost surely}. \end{aligned}$

In that sense, simulating ${(X_k)}_{k\in\Nset}$ amounts to sampling according to $t$ and can be used to integrate relatively to the probability measure $t$ . Let us remark that the ergodic theorem implies that $\displaystyle\frac{1}{n} \sum_{k=1}^n \: \fcar{A}{X_k} \tendto{k}{+\infty} \ProbCond{X\sim{}t}{X\in{}A}$ almost surely.

By abusing the notation, $t(x)$ represents, in the remainder of this section, a function of $x$ which is proportional to the PDF of the target distribution $t$ . Given a transition kernel $Q$ of density $q$ , the scheme of the Metropolis-Hastings algorithm is the following (lower case letters are used hereafter for both random variables and realizations as usual in the Bayesian literature):

Draw $x_0 \sim \nu$ and set $k = 1$ .
Draw a candidate for $x_k$ according to the given transition kernel $Q$ : $\tilde{x} \sim Q(x_{k-1}, .)$ .
Compute the ratio $\rho = \displaystyle\frac{t(\tilde{x})/q(x_{k-1},\tilde{x})} {t(x_{k-1})/q(\tilde{x},x_{k-1})}$ .
Draw $u \sim \cU([0, 1])$ ; if $u \leq \rho$ then set $x_k = \tilde{x}$ , otherwise set $x_k = x_{k-1}$ .
Set $k=k+1$ and go back to 1).

Of course, if $t$ is replaced by a different function of $x$ which is proportional to it, the algorithm keeps unchanged, since $t$ only takes part in the latter in the ratio $\frac{t(\tilde{x})}{t(x_{k-1})}$ . Moreover, if $Q$ proposes some candidates in a uniform manner (constant density $q$ ), the candidate $\tilde{x}$ is accepted according to a ratio $\rho$ which reduces to the previous “natural” ratio $\frac{t(\tilde{x})}{t(x_{k-1})}$ of PDF. The introduction of $q$ in the ratio $\rho$ prevents from the bias of a non-uniform proposition of candidates which would favor some areas of $\Omega$ .

The $t$ -invariance is ensured by the symmetry of the expression of $\rho$ ( $t$ -reversibility).

In practice, $Q$ is specified as a random walk ( $\exists{}q_{RW}$ such that $q(x,y)=q_{RW}(x-y)$ ) or as a independent sampling ( $\exists{}q_{IS}$ such that $q(x,y)=q_{IS}(y)$ ), or as a mixture of random walk and independent sampling.

The important property the practitioner have to keep in mind when choosing the transition kernel $Q$ is the $t$ -irreducibility. Moreover, for efficient convergence, $Q$ has to be chosen so as to explore quickly the whole support of $t$ without conducting to a too small acceptance ratio (the ratio of accepted candidates $\tilde{x}$ ). It is usually recommended that this latter ratio is about $0.2$ but such a ratio is neither a warranty of efficiency, nor a substitute to a convergence diagnosis.

OpenTURNS

An Open source initiative for the Treatment of Uncertainties, Risks'N Statistics

Previous topic

Next topic

This Page

The Metropolis-Hastings Algorithm¶