Kolmogorov-Smirnov two samples test¶
Let be a scalar uncertain variable modeled as a random variable. This method deals with the construction of a dataset prior to the choice of a probability distribution for . This statatistical test is used to compare two samples and ; the goal is to determine whether these two samples come from the same probability distribution or not. If this is the case, the two samples should be aggregated in order to increase the robustness of further statistical analysis.
The test relies on the maximum distance between the cumulative distribution functions and of the samples and . This distance is expressed as follows:
The probability distribution of the distance is asymptotically known (i.e. as the size of the samples tends to infinity). If and are sufficiently large, this means that for a probability , one can calculate the threshold / critical value such that:
if , we conclude that the two samples are not identically distributed, with a risk of error ,
if , it is reasonable to say that both samples arise from the same distribution.
An important notion is the so-called “-value” of the test. This quantity is equal to the limit error probability under which the “identically-distributed” hypothesis is rejected. Thus, the two samples will be supposed identically distributed if and only if is greater than the value desired by the user. Note that the higher , the more robust the decision.
This test is also referred to as the Kolmogorov-Smirnov’s test for two samples.