The Metropolis-Hastings Algorithm

Markov chain. Considering a \sigma-algebra \cA on \Omega, a Markov chain is a process {(X_k)}_{k\in\Nset} such that

    \quad \Prob{X_k\in A \,|\, X_0=x_0, \ldots, X_{k-1}=x_{k-1}}
    = \Prob{X_k\in A \,|\, X_{k-1}=x_{k-1}}.

An example is the random walk for which X_k = X_{k-1} + \varepsilon_k where the steps \varepsilon_k are independent and identically distributed.

Transition kernel. A transition kernel on (\Omega, \cA) is a mapping K: (\Omega, \cA) \rightarrow [0, 1] such that
  • For all A\in\cA, \quad K(., A) is measurable;

  • For all x\in\Omega, \quad K(x, .) is a probability distribution on (\Omega, \cA).

The kernel K has density k if \forall(x,A)\in\Omega\times\cA \quad K(x, A) = \displaystyle\int_A \: k(x, y) \mbox{d}y.

{(X_k)}_{k\in\Nset} is a homogeneous Markov Chain of transition K if \forall(A,x)\in\cA\times\Omega \quad \Prob{X_k\in{}A | X_{k-1}=x} = K(x, A).
Some Notations. Let {(X_k)}_{k\in\Nset} be a homogeneous Markov Chain of transition K on (\Omega, \cA) with initial distribution \nu (that is X_0 \sim \nu):
  • K_\nu denotes the probability distribution of the Markov Chain {(X_k)}_{k\in\Nset};

  • \nu{}K^k denotes the probability distribution of X_k (X_k \sim \nu{}K^k);

  • K^k denotes the mapping defined by K^k(x,A) = \Prob{X_k\in{}A|X_0=x} for all (x,A)\in\Omega\times\cA.
Total variation convergence. A Markov Chain of distribution K_\nu is said to converge in total variation distance towards the distribution t if

    \lim_{k\to+\infty} \sup_{A\in\cA} \left|
    \nu{}K^k(A) - t(A)
    \right| = 0.

Then the notation used here is \nu{}K^k \rightarrow_{TV} t.

Some interesting properties. Let t be a (target) distribution on (\Omega, \cA), then a transition kernel K is said to be:
  • t-invariant if t{}K = t;

  • t-irreducible if, \forall(A,x)\in\Omega\times\cA such that t(A)>0, \exists{}k\in\cN^* \quad {}K^k(x, A) > 0 holds.

Markov Chain Monte-Carlo techniques allows one to sample and integrate according to a distribution t which is only known up to a multiplicative constant. This situation is common in Bayesian statistics where the “target” distribution, the posterior one t(\vect{\theta})=\pi(\vect{\theta} | \mat{y}), is proportional to the product of prior and likelihood: see equation (1).

In particular, given a “target” distribution t and a t-irreducible kernel transition Q, the Metropolis-Hastings algorithm produces a Markov chain {(X_k)}_{k\in\Nset} of distribution K_\nu with the following properties:

  • the transition kernel of the Markov chain is t-invariant;

  • \nu{}K^k \rightarrow_{TV} t;

  • the Markov chain satisfies the ergodic theorem: let \phi be a real-valued function such that \mathbb{E}_{X\sim{}t}\left[ |\phi(X)| \right] <+\infty, then, whatever the initial distribution \nu is:

      \displaystyle\frac{1}{n} \sum_{k=1}^n \: \phi(X_k) \tendto{k}{+\infty} \mathbb{E}_{X\sim{}t}\left[ |\phi(X)| \right]
      \mbox{ almost surely}.

In that sense, simulating {(X_k)}_{k\in\Nset} amounts to sampling according to t and can be used to integrate relatively to the probability measure t. Let us remark that the ergodic theorem implies that \displaystyle\frac{1}{n} \sum_{k=1}^n \: \fcar{A}{X_k} \tendto{k}{+\infty} \ProbCond{X\sim{}t}{X\in{}A} almost surely.

By abusing the notation, t(x) represents, in the remainder of this section, a function of x which is proportional to the PDF of the target distribution t. Given a transition kernel Q of density q, the scheme of the Metropolis-Hastings algorithm is the following (lower case letters are used hereafter for both random variables and realizations as usual in the Bayesian literature):

  1. Draw x_0 \sim \nu and set k = 1.

  2. Draw a candidate for x_k according to the given transition kernel Q: \tilde{x} \sim Q(x_{k-1}, .).

  3. Compute the ratio \rho = \displaystyle\frac{t(\tilde{x})/q(x_{k-1},\tilde{x})} {t(x_{k-1})/q(\tilde{x},x_{k-1})}.

  4. Draw u \sim \cU([0, 1]); if u \leq \rho then set x_k = \tilde{x}, otherwise set x_k = x_{k-1}.

  5. Set k=k+1 and go back to 1).

Of course, if t is replaced by a different function of x which is proportional to it, the algorithm keeps unchanged, since t only takes part in the latter in the ratio \frac{t(\tilde{x})}{t(x_{k-1})}. Moreover, if Q proposes some candidates in a uniform manner (constant density q), the candidate \tilde{x} is accepted according to a ratio \rho which reduces to the previous “natural” ratio \frac{t(\tilde{x})}{t(x_{k-1})} of PDF. The introduction of q in the ratio \rho prevents from the bias of a non-uniform proposition of candidates which would favor some areas of \Omega.

The t-invariance is ensured by the symmetry of the expression of \rho (t-reversibility).

In practice, Q is specified as a random walk (\exists{}q_{RW} such that q(x,y)=q_{RW}(x-y)) or as a independent sampling (\exists{}q_{IS} such that q(x,y)=q_{IS}(y)), or as a mixture of random walk and independent sampling.

The important property the practitioner have to keep in mind when choosing the transition kernel Q is the t-irreducibility. Moreover, for efficient convergence, Q has to be chosen so as to explore quickly the whole support of t without conducting to a too small acceptance ratio (the ratio of accepted candidates \tilde{x} ). It is usually recommended that this latter ratio is about 0.2 but such a ratio is neither a warranty of efficiency, nor a substitute to a convergence diagnosis.