Chi-squared test¶

The $\chi^2$ test is a statistical test of whether a given sample of data is drawn from a given discrete distribution. The library only provides the $\chi^2$ test for distributions of dimension 1.

We denote by $\left\{ x_1,\dots,x_{\sampleSize} \right\}$ a sample of dimension 1. Let $F$ be the (unknown) cumulative distribution function of the discrete distribution. We want to test whether the sample is drawn from the discrete distribution characterized by the probabilities $\left\{ p(x;\vect{\theta}) \right\}_{x \in \cE}$ where $\vect{\theta}$ is the set of parameters of the distribution and and $\cE$ its support. Let $G$ be the cumulative distribution function of this candidate distribution.

This test involves the calculation of the test statistic which is the distance between the empirical number of values equal to $x$ in the sample and the theoretical mean one evaluated from the discrete distribution.

Let $X_1, \ldots , X_{\sampleSize}$ be i.i.d. random variables following the distribution with CDF $F$ . According to the tested distribution $G$ , the theoretical mean number of values equal to $x$ is $\sampleSize p(x;\vect{\theta})$ whereas the number evaluated from $X_1, \ldots , X_{\sampleSize}$ is $N(x) = \sum_{i=1}^{\sampleSize} 1_{X_i=x}$ . Then the test statistic is defined by:

$D_{\sampleSize} = \sum_{x \in \cE} \frac{\left[\sampleSize p(x)-N(x)\right]^2}{N(x)}.$

If some values of $x$ have not been observed in the sample, we have to gather values in classes so that they contain at least 5 data points (empirical rule). Then the theoretical probabilities of all the values in the class are added to get the theoretical probability of the class.

Let $d_{\sampleSize}$ be the realization of the test statistic $d_{\sampleSize}$ on the sample $\left\{ x_1,\dots,x_{\sampleSize} \right\}$ . Under the null hypothesis $\mathcal{H}_0 = \{ G = F\}$ , the distribution of the test statistic $D_{\sampleSize}$ is known: this is the $\chi^2(J-1)$ distribution, where $J$ is the number of distinct values in the support of $G$ . We apply the test as follows.

We fix a risk $\alpha$ (error type I) and we evaluate the associated critical value $d_\alpha$ which is the quantile of order $1-\alpha$ of $D_{\sampleSize}$ . Then a decision is made, either by comparing the test statistic to the theoretical threshold $d_\alpha$ (or equivalently by evaluating the p-value of the sample defined as $\Prob{D_{\sampleSize} > d_{\sampleSize}}$ and by comparing it to $\alpha$ ):

if $d_{\sampleSize}>d_{\alpha}$ (or equivalently $\Prob{D_{\sampleSize} > d_{\sampleSize}} < \alpha$ ), then we reject $G$ ,
if $d_{\sampleSize} \leq d_{\alpha}$ (or equivalently $\Prob{D_{\sampleSize} > d_{\sampleSize}} \geq \alpha$ ), then $G$ is considered acceptable.

OpenTURNS

An Open source initiative for the Treatment of Uncertainties, Risks'N Statistics

Previous topic

Next topic

This Page

Chi-squared test¶