Bayesian Information Criterion (BIC)¶
This method deals with the modelling of a probability distribution of a
random vector . It
seeks to rank variable candidate distributions by using a sample of data
.
The Bayesian Information Criterion (BIC) allows one to
answer this question in the one dimensional case
.
Let us limit the case to . Thus we denote
. Moreover, let us denote by
,dots,
the parametric models envisaged by user among the
parametric models. We
suppose here that the parameters of these models have been estimated
previously by Maximum Likelihood
the on the basis of the sample
. We
denote by
the maximized likelihood for the model
.
By definition of the likelihood, the higher , the better the
model describes the sample. However, using the likelihood as a criterion
to rank the candidate probability distributions would involve a risk:
one would almost always favor complex models involving many parameters.
If such models provide indeed a large numbers of degrees-of-freedom that
can be used to fit the sample, one has to keep in mind that complex
models may be less robust that simpler models with less parameters.
Actually, the limited available information (
data points) does
not allow to estimate robustly too many parameters.
The BIC criterion can be used to avoid this problem. The principle is to
rank according to the following quantity:
where denotes the number of parameters being adjusted for
the model
. The smaller
, the better
the model. Note that the idea is to introduce a penalization term that
increases with the numbers of parameters to be estimated. A complex
model will then have a good score only if the gain in terms of
likelihood is high enough to justify the number of parameters used.
The term “Bayesian Information Criterion” comes the interpretation of
the quantity . In a bayesian context, the unknown
“true” model may be seen as a random variable. Suppose now that the user
does not have any informative prior information on which model is more
relevant among
,dots,
; all the models are thus
equally likely from the point of view of the user. Then, one can show
that
is an approximation of the posterior
distribution’s logarithm for the model
.