Bayesian Information Criterion (BIC)¶
This method deals with the modelling of a probability distribution of a random vector . It seeks to rank variable candidate distributions by using a sample of data . The Bayesian Information Criterion (BIC) allows one to answer this question in the one dimensional case .
Let us limit the case to . Thus we denote . Moreover, let us denote by ,dots, the parametric models envisaged by user among the parametric models. We suppose here that the parameters of these models have been estimated previously by Maximum Likelihood the on the basis of the sample . We denote by the maximized likelihood for the model .
By definition of the likelihood, the higher , the better the model describes the sample. However, using the likelihood as a criterion to rank the candidate probability distributions would involve a risk: one would almost always favor complex models involving many parameters. If such models provide indeed a large numbers of degrees-of-freedom that can be used to fit the sample, one has to keep in mind that complex models may be less robust that simpler models with less parameters. Actually, the limited available information ( data points) does not allow to estimate robustly too many parameters.
The BIC criterion can be used to avoid this problem. The principle is to rank according to the following quantity:
where denotes the number of parameters being adjusted for the model . The smaller , the better the model. Note that the idea is to introduce a penalization term that increases with the numbers of parameters to be estimated. A complex model will then have a good score only if the gain in terms of likelihood is high enough to justify the number of parameters used.
The term “Bayesian Information Criterion” comes the interpretation of the quantity . In a bayesian context, the unknown “true” model may be seen as a random variable. Suppose now that the user does not have any informative prior information on which model is more relevant among ,dots, ; all the models are thus equally likely from the point of view of the user. Then, one can show that is an approximation of the posterior distribution’s logarithm for the model .