Chi-squared test for independence

This method deals with the parametric modelling of a probability distribution for a random vector \vect{X} = \left( X^1,\ldots,X^{n_X} \right). We seek here to detect possible dependencies that may exist between two components X^i and X^j. The \chi^2 test for Independence for discrete probability distributions can be used.

As we are considering discrete distributions, the possible values for X^i and X^j respectively belong to the discrete sets \cE_i and \cE_j. The \chi^2 test of independence can be applied when we have a sample consisting of N pairs \left\{ (x^i_1,x^j_1),(x^i_2,x^j_2),(x^i_N,x^j_N) \right\}. We denote:

  • n_{u,v} the number of pairs in the sample such that x^i_k = u and x^j_k = v,

  • n^i_{u} the number of pairs in the sample such that x^i_k = u,

  • n^j_{v} the number of pairs in the sample such that x^j_k = v.

The test thus uses the quantity denoted \widehat{D}_N^2:

\begin{aligned}
    \widehat{D}_N^2 = \sum_{u \in \cE_i}\sum_{v\in \cE_2}\frac{\left(p_{u,v} - p^j_{v}p^i_{u}\right)^2}{p^i_{u}p^j_{v}}
  \end{aligned}

where:

\begin{aligned}
    p_{u,v} = \frac{n_{u,v}}{N},\ p^i_{u} =  \frac{n^i_{u}}{N},\ p^j_{v} =  \frac{n^j_{v}}{N}
  \end{aligned}

The probability distribution of the distance \widehat{D}_N^2 is asymptotically known (i.e. as the size of the sample tends to infinity). If N is sufficiently large, this means that for a probability \alpha, one can calculate the threshold (critical value) d_\alpha such that:

  • if \widehat{D}_N>d_{\alpha}, we conclude, with the risk of error \alpha, that a dependency exists between X^i and X^j,

  • if \widehat{D}_N \leq d_{\alpha}, the independence hypothesis is considered acceptable.

An important notion is the so-called “p-value” of the test. This quantity is equal to the limit error probability \alpha_\textrm{lim} under which the independence hypothesis is rejected. Thus, independence is assumed if and only if \alpha_\textrm{lim} is greater than the value \alpha desired by the user. Note that the higher \alpha_\textrm{lim} - \alpha, the more robust the decision.

This method is also referred to in the literature as the \chi^2 test of contingency.

Examples: