Bayesian Information Criterion (BIC)

This method deals with the modelling of a probability distribution of a random vector \vect{X} = \left( X^1,\ldots,X^{n_X} \right). It seeks to rank variable candidate distributions by using a sample of data \left\{ \vect{x}_1,\vect{x}_2,\ldots,\vect{x}_n \right\}. The Bayesian Information Criterion (BIC) allows one to answer this question in the one dimensional case n_X =1.

Let us consider the particular case where n_X = 1. Thus we denote \vect{X} = X^1 = X. Moreover, let us denote by \cM_1, \dots, \cM_K the parametric models among the parametric models. We suppose here that the parameters of these models have been estimated previously by Maximum Likelihood the on the basis of the sample \left\{ \vect{x}_1,\vect{x}_2,\ldots,\vect{x}_n \right\}. We denote by L_i the maximized likelihood for the model \cM_i.

By definition of the likelihood, the higher L_i, the better the model describes the sample. However, using the likelihood as a criterion to rank the candidate probability distributions would involve a risk: one would almost always favor complex models involving many parameters. If such models provide indeed a large numbers of degrees-of-freedom that can be used to fit the sample, one has to keep in mind that complex models may be less robust that simpler models with less parameters. Actually, the limited available information (n data points) does not allow one to estimate robustly too many parameters.

The BIC criterion can be used to avoid this problem. The principle is to rank \cM_1,\dots,\cM_K according to the following quantity:

\begin{aligned}
    \textrm{BIC}_i = -2 \frac{\log(L_i)}{n} + \frac{p_i \log(n)}{n}
  \end{aligned}

where p_i denotes the number of parameters being adjusted for the model \cM_i. The smaller the \textrm{BIC}_i, the better the model. Note that the idea is to introduce a penalization term that increases with the numbers of parameters to be estimated. A complex model will then have a good score only if the gain in terms of likelihood is high enough to justify the number of parameters used.

The term “Bayesian Information Criterion” comes the interpretation of the quantity \textrm{BIC}_i. In a bayesian context, the unknown “true” model may be seen as a random variable. Suppose now that the user does not have any informative prior information on which model is more relevant among \cM_1, \dots, \cM_K; all the models are thus equally likely from the point of view of the user. Then, one can show that \textrm{BIC}_i is an approximation of the posterior distribution’s logarithm for the model \cM_i.

This criterion is a valuable criterion to reject a model which is not relevant, but can be tricky to interpret in some cases. For example, if two models have very close BIC, these two models should be considered instead of keeping only the model which has the lowest BIC.