Using QQ-plot to compare two samples¶

Let $X$ be a scalar uncertain variable modeled as a random variable. This method deals with the construction of a dataset prior to the choice of a probability distribution for $X$ . A QQ-plot (where “QQ” stands for “quantile-quantile”) is a tool that may be used to compare two samples $\left\{x_1,\ldots,x_N \right\}$ and $\left\{x'_1,\ldots,x'_M \right\}$ ; the goal is to determine graphically whether these two samples come from the same probability distribution or not. If this is the case, the two samples should be aggregated in order to increase the robustness of further statistical analysis.

A QQ-plot is based on the notion of quantile. The $\alpha$ -quantile $q_{X}(\alpha)$ of $X$ , where $\alpha \in (0, 1)$ , is defined as follows:

$\begin{aligned} \Prob{ X \leq q_{X}(\alpha)} = \alpha \end{aligned}$

If a sample $\left\{x_1,\ldots,x_N \right\}$ of $X$ is available, the quantile can be estimated empirically:

the sample $\left\{x_1,\ldots,x_N \right\}$ is first placed in ascending order, which gives the sample $\left\{ x_{(1)},\ldots,x_{(N)} \right\}$ ;
then, an estimate of the $\alpha$ -quantile is:

$\begin{aligned} \widehat{q}_{X}(\alpha) = x_{([N\alpha]+1)} \end{aligned}$

where $[N\alpha]$ denotes the integral part of $N\alpha$ .

Thus, the $j^\textrm{th}$ smallest value of the sample $x_{(j)}$ is an estimate $\widehat{q}_{X}(\alpha)$ of the $\alpha$ -quantile where $\alpha = (j-1)/N$ ( $1 < j \leq N$ ). Let us then consider our second sample $\left\{x'_1,\ldots,x'_M \right\}$ ; this one also provides an estimate $\widehat{q}'_{X}(\alpha)$ of this same quantile:

$\begin{aligned} \widehat{q}'_{X}(\alpha) = x'_{([M\times(j-1)/N]+1)} \end{aligned}$

If both samples correspond to the same probability distribution, then $\widehat{q}_{X}(\alpha)$ and $\widehat{q}'_{X}(\alpha)$ should be close. Thus, graphically, the points $\left\{ \left( \widehat{q}_{X}(\alpha),\widehat{q}'_{X}(\alpha)\right),\ \alpha = (j-1)/N,\ 1 < j \leq N \right\}$ should be close to the diagonal.

The following figure illustrates the principle of a QQ-plot with two samples of size $M=50$ and $N=50$ . Note that the unit of the two axis is that of the variable $X$ studied. In this example, the points remain close to the diagonal and the hypothesis “the two samples come from the same distribution” does not seem irrelevant, even if a more quantitative analysis should be carried out to confirm this.