Chi-squared test for independenceΒΆ

The \chi^2 test can be used to detect dependencies between two random discrete variables.

Let \vect{X} = (X^1, X^2) be a random variable of dimension 2 with values in \{b_1, \dots, b_{\ell} \} \times \{c_1, \dots, c_{r} \}.

We want to test whether \vect{X} has independent components.

Let \vect{X}_1, \ldots , \vect{X}_\sampleSize be i.i.d. random variables following the distribution of \vect{X}. Two test statistics can be defined by:

D_{\sampleSize}^{(1)}  = \sum_{i=1}^{\ell} \sum_{j=1}^{r} \dfrac{\left(N_{i,j} -
\frac{N_{i,.}N_{.,j}}{\sampleSize}\right)}{N_{i,j}} \\
D_{\sampleSize}^{(2)}  = \sampleSize \sum_{i=1}^{\ell} \sum_{j=1}^{r}
\dfrac{\left(N_{i,j} - \frac{N_{i,.}N_{.,j}}{\sampleSize}\right)}{N_{i,.}N_{.,j}}

where:

  • N_{i,j} = \sum_{k=1}^{\sampleSize}1_{X^1_k = b_i, X^2_k = c_j} be the number of pairs equal to (b_i, c_j),

  • N_{i,.}= \sum_{k=1}^{\sampleSize}1_{X^1_k = b_i} be the number of pairs such that the first component is equal to b_i,

  • N_{., j}= \sum_{k=1}^{\sampleSize}1_{X^2_k = c_j} be the number of pairs such that the second component is equal to c_j.

Let d_{\sampleSize}^{(i)} be the realization of the test statistic D_{\sampleSize}^{(i)} on the sample \left\{ \vect{x}_1,\dots,\vect{x}_{\sampleSize} \right\} with i=1,2.

Under the null hypothesis \mathcal{H}_0 = \{ \vect{X} \mbox{ has independent components}\}, the distribution of both test statistics D_{\sampleSize}^{(i)} is asymptotically known: i.e. when \sampleSize \rightarrow +\infty: this is the \chi^2((\ell-1)(r-1)) distribution. If \sampleSize is sufficiently large, we can use the asymptotic distribution to apply the test as follows.

We fix a risk \alpha (error type I) and we evaluate the associated critical value d_\alpha which is the quantile of order 1-\alpha of D_{\sampleSize}^{(i)}.

Then a decision is made, either by comparing the test statistic to the theoretical threshold d_\alpha^{(i)} (or equivalently by evaluating the p-value of the sample defined as \Prob{D_{\sampleSize}^{(i)} > d_{\sampleSize}^{(i)}} and by comparing it to \alpha):

  • if d_{\sampleSize}^{(i)}>d_{\alpha}^{(i)} (or equivalently \Prob{D_{\sampleSize}^{(i)} > d_{\sampleSize}^{(i)}} < \alpha), then we reject the independence between the components,

  • if d_{\sampleSize}^{(i)} \leq d_{\alpha}^{(i)} (or equivalently \Prob{D_{\sampleSize}^{(i)} > d_{\sampleSize}^{(i)}} \geq \alpha), then the independence between the components is considered acceptable.