Kolmogorov-Smirnov two samples test

Let X be a scalar uncertain variable modeled as a random variable. This method deals with the construction of a dataset prior to the choice of a probability distribution for X. This statatistical test is used to compare two samples \left\{x_1,\ldots,x_N \right\} and \left\{x'_1,\ldots,x'_M \right\}; the goal is to determine whether these two samples come from the same probability distribution or not. If this is the case, the two samples should be aggregated in order to increase the robustness of further statistical analysis.

The test relies on the maximum distance between the cumulative distribution functions \widehat{F}_N and \widehat{F}'_M of the samples \left\{x_1,\ldots,x_N \right\} and \left\{x'_1,\ldots,x'_M \right\}. This distance is expressed as follows:

    \widehat{D}_{M,N} = \sup_x \left|\widehat{F}_N\left(x\right) - \widehat{F}'_M\left(x\right)\right|

The probability distribution of the distance \widehat{D}_{M,N} is asymptotically known (i.e. as the size of the samples tends to infinity). If M and N are sufficiently large, this means that for a probability \alpha, one can calculate the threshold / critical value d_\alpha such that:

  • if \widehat{D}_{M,N} >d_{\alpha}, we conclude that the two samples are not identically distributed, with a risk of error \alpha,

  • if \widehat{D}_{M,N} \leq d_{\alpha}, it is reasonable to say that both samples arise from the same distribution.

An important notion is the so-called “p-value” of the test. This quantity is equal to the limit error probability \alpha_\textrm{lim} under which the “identically-distributed” hypothesis is rejected. Thus, the two samples will be supposed identically distributed if and only if \alpha_\textrm{lim} is greater than the value \alpha desired by the user. Note that the higher \alpha_\textrm{lim} - \alpha, the more robust the decision.

This test is also referred to as the Kolmogorov-Smirnov’s test for two samples.