Graphical goodness-of-fit tests

We gather some graphical tools to validate whether a given sample of data is drawn from a given continuous distribution of dimension 1.

We denote by \left\{ x_1,\ldots,x_{\sampleSize} \right\} the data of dimension 1 which have been independently generated by the random variable X. Let F be a continuous cumulative distribution function.

We want to validate whether X follows the distribution characterized by F.

QQ-plot

The Quantile - Quantile - Plot (QQ Plot) is based on the comparison of some quantiles between the tested distribution and the empirical ones. Let q_{X}(\alpha) be the quantile of order \alpha of the distribution F, with \alpha \in (0, 1). It is defined by:

\begin{aligned}
    q_{X}(\alpha) = \inf \{ x \in \Rset \, |\, F(x) \geq \alpha \}
  \end{aligned}

The empirical quantile of order \alpha built on the sample is defined by:

\begin{aligned}
        \widehat{q}_{X}(\alpha) = x_{([\sampleSize \alpha]+1)}
\end{aligned}

where [\sampleSize\alpha] denotes the integral part of \sampleSize \alpha and \left\{ x_{(1)},\ldots,x_{(\sampleSize)} \right\} is the sample sorted in ascended order:

x_{(1)} \leq \dots \leq x_{(\sampleSize)}

Thus, the j^\textrm{th} smallest value of the sample x_{(j)} is an estimate \widehat{q}_{X}(\alpha) of the \alpha-quantile where \alpha = (j-1)/\sampleSize, for 1 < j \leq \sampleSize.

The QQ-plot draws the couples (x_{(j)}, q_{X}\left(\dfrac{j-1}{\sampleSize}\right))_{1 < j \leq \sampleSize}. If X follows the distribution F, then the points should be close to the diagonal.

The following figure illustrates a QQ-plot with a sample of size \sampleSize=50. In this example, the points remain close to the diagonal and the hypothesis “F is the cumulative distribution function of X” does not seem false, even if a more quantitative analysis should be carried out to confirm this.

(Source code, png)

../../_images/graphical_fitting_test-1.png

In this second example, the tested continuous distribution is clearly false.

(Source code, png)

../../_images/graphical_fitting_test-2.png

Normal probability plot (Henry’s line)

This test is dedicated to the normal distribution.

The following result is used in the test: if X follows the \cN(\mu,\sigma) distribution, then (X-\mu) / \sigma follows the \cN(0,1) distribution. Furthermore, let q_{\cN(\mu,\sigma)}(\alpha) be the quantile of order \alpha of \cN(\mu,\sigma) and let q_{\cN(0,1)}(\alpha) be the quantile of order \alpha of \cN(0,1). Then we have the relation:

q_{\cN(0,1)}(\alpha) = \dfrac{q_{\cN(\mu,\sigma)}(\alpha) - \mu}{\sigma}

Then the Henri line draws the QQ-plot built from the empirical quantiles of order \dfrac{j-1}{\sampleSize} and the quantiles of same order of the \cN(0,1) distribution. If the sample comes from the \cN(\mu,\sigma) distribution, then the points should be close to the line of equation y = \dfrac{x-\mu}{\sigma}.

The following figure illustrates the Henry’s line with a sample of size \sampleSize=50. In this example, the points remain close to a line and the hypothesis “X follows a normal distribution“ does not seem false, even if a more quantitative analysis should be carried out to confirm this.

(Source code, png)

../../_images/graphical_fitting_test-3.png

In this second example, the hypothesis of a normal distribution seems far less plausible because of the behavior for small values of X.

(Source code, png)

../../_images/graphical_fitting_test-4.png

Kendall plot

In the bivariate case, the Kendall Plot test allows one to validate whether a sample is drawn from a given copula or to check whether two samples share the same copula.

Let \inputRV = (X_1, X_2) be a bivariate random vector with the copula C and the marginal cumulative distribution functions (F_1, F_2). Let (U_1, U_2) = (F_1(X_1), F_2(X_2)) be the random vector with \cU(0,1) marginal distributions and C copula.

Let (\inputReal_i)_{1 \leq i \leq \sampleSize} a sample drawn from \inputRV. We build the rank sample defined by (\vect{u}_i)_{1 \leq i \leq \sampleSize} where \vect{u}_i =(F_1(x_{1,i}), F_2(x_{2,i})).

We define:

H = C(U,V)

where (U,V) is a bivariate random vector with \cU(0,1) marginal distributions and C copula. We denote by K_0 the cumulative distribution function of H.

We can get a sample of H denoted by (h_i)_{1 \leq i \leq \sampleSize} from the sample (\vect{u}_i)_{1 \leq i \leq \sampleSize} as follows:

h_i & = C(u_{1,i}, u_{2,i}) \\
    & =  \Prob{F_1(X_1) \leq u_{1,i}, F_2(X_2) \leq u_{2,i}}\\
    & = F_{(U_1, U_2)}(u_{1,i}, u_{2,i}) \\
    & \approx \widehat{F}_{(U_1, U_2)}(u_{1,i}, u_{2,i})

where \widehat{F}_{(U_1, U_2)} is the empirical cumulative distribution function of the sample (\vect{u}_i)_{1 \leq i \leq \sampleSize}. Then, we have, for all 1 \leq i \leq \sampleSize:

\widehat{h}_i = \frac{1}{\sampleSize-1} Card
\left\{  j \in [1,\sampleSize], j  \neq i, \, | \, X^j_1 \leq X^i_1 \mbox{ and } X^j_2 \leq X^i_2  \right \}

From the sample (h_i)_{1 \leq i \leq \sampleSize}, we build the ordered sample (h_{(i)})_{1 \leq i \leq \sampleSize}.

Let (H_{(1)}, \dots, H_{(\sampleSize)}) be the order statistics of (H_1, \dots, H_{\sampleSize}). Then we know that the cumulative distribution function of H_{(i)} is the composition between the cumulative distribution function of the Beta(i, n-1+1) distribution and the distribution K_0 of H:

F_{H_{(i)}} = F_{Beta(i, n-1+1)} \circ K_0

Let w_i be the statistic defined by:

w_i = \Expect{H_{(i)}}

Thus we have:

(1)w_i = \sampleSize C_{\sampleSize-1}^{i-1} \int_0^1 t K_0(t)^{i-1} (1-K_0(t))^{n-i} \, dK_0(t)

For a given copula C, equation (1) is evaluated by Monte Carlo sampling: we generate N samples of size \sampleSize from C(U,V), in order to get N realizations of the statistics H_{(i)},\forall 1 \leq i \leq \sampleSize that are used to calculate w_i as the empirical mean of H_{(i)}.

The Kendall Plot draws the points (w_i, h_{(i)})_{1 \leq i \leq \sampleSize}. If the points are on the first diagonal, the copula C is validated. In particular, we can use the Kendall plot to test the independence between X_1 and X_2 by using the independent copula to calculate the values (w_i)_{1 \leq i \leq \sampleSize}.

To test whether two samples share the same copula, the Kendall Plot test draws the points (h^1_{(i)}, h^2_{(i)})_{1 \leq i \leq \sampleSize} respectively associated to the first and second sample. Note that the two samples must have the same size.

In the first example, the Kendall Plot test validates the use of the Frank copula for the given sample.

(Source code, png)

../../_images/graphical_fitting_test-5.png

In the second example, the Kendall Plot test invalidates the use of the Frank copula for the given sample.

(Source code, png)

../../_images/graphical_fitting_test-6.png

Remark: In the case where you want to test a sample with respect to a specific copula, if the size of the sample is greater than 500, we recommend to use the second form of the Kendall plot test: generate a sample of the proper size from your copula and then test both samples. Testing this way is more efficient.