Cross validation assessment of PC models

Once a polynomial response surface \widehat{h}(\underline{x}) of the original numerical model h(\underline{x}) has been built up, it is crucial to estimate the approximation error, i.e. the discrepancy between the response surface and the true model response in terms of a suitable norm such as the L_2-norm:

Err \, \, = \, \, \left\| h(\underline{x}) \; - \; \widehat{h}(\underline{x}) \right\|_{L_2}^2\, \, = \, \, \int_{\cD} \; \left( h(\underline{x}) \; - \; \widehat{h}(\underline{x}) \right)^2  \; d\underline{x}

where \cD denotes the support of the input parameters \underline{x}. It is worth emphasizing that one tends to overestimate the performance of a response surface by training and evaluating it on the same data set. For instance, the model might fail to predict on new data whereas the validation on the training data yields satisfactory performance. To avoid such issue, it is important that the model error assessment is conducted on a data set which is independent of the training sample. However, any new model evaluation may be time- and memory-consuming. Therefore, error estimates which are only based on already performed computer experiments are of interest. In this context, the so-called cross validation techniques are utilized to obtain reliable performance assessment of the response surface without additional model evaluations.

Any cross-validation scheme consists in dividing the data sample (i.e. the experimental design) into two sub-samples that are independently and identically distributed. A metamodel \widehat{h}(\underline{x}) is built from one sub-sample, i.e. the training set, and its performance is assessed by comparing its predictions to the other subset, i.e. the test set. A single split will lead to a validation estimate. When several splits are conducted, the cross-validation error estimate is obtained by averaging over the splits.

K-fold cross-validation error estimate

The K-fold cross-validation technique relies on splitting the data set \cX into K mutually exclusive sub-samples \cX_1, \dots, \cX_K of approximately equal size. A sub-sample \cX_i, i \in \{ 1, \dots, K\} is set aside, then the response surface \widehat{h}^{(-\cX_i)} is built on the remaining sub-samples \cX \setminus \cX_i. The approximation error is estimated on the set-aside sample \cX_i:

Err^{(i)}  = \dfrac{1}{ |\cX_i|}  \sum\limits_{\underline{x}^{(j)} \in \cX_i} \left[ h(\underline{x}^{(j)}) - \widehat{h}^{(-\cX_i)} {(\underline{x}^{(j)})} \right]^2

in which h(\underline{x}^{(j)}) - \widehat{h}^{(-\cX_i)} {(\underline{x}^{(j)})} is the predicted residual defined as the difference between the evaluation of h(\cdot) and the prediction with \widehat{h}^{(-\cX_i)}(\cdot) at \underline{x}^{(j)} in the sub-sample \cX_i whose cardinality is |\cX_i|.

The approximation errors Err^{(i)} are estimated with each of the sub-samples \cX_i, i \in \{ 1, \dots, K\} being used as the validation set whereas the remaining sub-samples \cX \setminus \cX_i being used for training. Finally the K-fold cross-validation error estimate is obtained as the average:

Err_{Kfold} = \dfrac{1}{K} \sum\limits_{i=1}^{K} Err^{(i)}

As described above, the K-fold error estimate can be obtained with a single split of the data \cX into K folds. It is worth noting that one can repeat the cross-validation multiple times using different divisions into folds to obtain better Monte Carlo estimate. This comes obviously with an additional computational cost.

Classical leave-one-out error estimate

The leave-one-out (LOO) cross-validation is a special case of K-fold cross-validation where the number of folds K is chosen equal to the cardinality N of the experimental design \cX. It is recalled that a P-term polynomial approximation of the model response reads:

\widehat{h}(\underline{x}) \, \, = \, \,  \sum_{j=0}^{P-1} \; \widehat{a}_{j} \; \psi_{j}(\underline{x})

where the \widehat{a}_{j}’s are estimates of the coefficients obtained by a specific method, e.g. least squares.

Let us denote by \widehat{h}^{(-i)} the approximation that has been built from the experimental design \cX \setminus \{\underline{x}^{(i)}\} with the i-th observation \underline{x}^{(i)} being set aside. The predicted residual is defined as the difference between the model evaluation at \underline{x}^{(i)} and its prediction based on \widehat{h}^{(-i)}:

(1)\Delta^{(i)} \, \, = \, \,  h(\underline{x}^{(i)}) - \widehat{h}^{(-i)}(\underline{x}^{(i)})

By repeating this process for all observations in the experimental design, one obtains the predicted residuals \Delta^{(i)}, i = 1, \dots , N. Finally, the LOO error is estimated as follows:

(2)Err_{LOO} \, \, = \, \, \frac{1}{N} \sum_{i=1}^{N} {\Delta^{(i)}}^{2}

Due to the linear-in-parameters form of the polynomial chaos expansion, the quantity Err_{LOO} may be computed without performing further regression calculations when the PC coefficients have been estimated using the entire experimental design \cX. Indeed, the predicted residuals can be obtained analytically as follows:

(3)\Delta^{(i)} \, = \,
     \frac{h(\underline{x}^{(i)}) - \widehat{h}(\underline{x}^{(i)})}{1 - h_i}

where h_i is the i-th diagonal term of the matrix \underline{\underline{\Psi}} (\underline{\underline{\Psi}}^{\textsf{T}}\underline{\underline{\Psi}})^{-1} \underline{\underline{\Psi}}^{\textsf{T}} with \underline{\underline{\Psi}} being the information matrix:

(4)\underline{\underline{\Psi}} \, \, = \, \, \left\{ \psi_{j}(\underline{x}^{(i)}) \; , \; i=1,\dots,N \; , \; j = 0,\dots,P-1 \right\}

In practice, one often computes the following normalized LOO error:

(5)\varepsilon_{LOO} \, \, \equiv \, \, \frac{Err_{LOO}}{\hat{\Cov{\cY}}}

where \hat{\Cov{\mathcal{Y}}} denotes the empirical covariance of the response sample \cY:

(6)\hat{\Cov{\mathcal{Y}}} \, \, \equiv \, \, \frac{1}{N-1} \; \sum_{i=1}^{N} \; \left( y^{(i)} \; - \;   \bar{\mathcal{Y}}  \right)^{2} \quad  \quad , \quad \quad   \bar{\mathcal{Y}} \, \, \equiv \, \, \frac{1}{N} \; \sum_{i=1}^{N} \; y^{(i)}

Corrected leave-one-out error estimate

A penalized variant of \varepsilon_{LOO} may be used in order to increase its robustness with respect to overfitting, i.e. to penalize a large number of terms in the PC expansion compared to the size of the experimental design:

\varepsilon_{LOO}^{*} \, \, \equiv \, \, \varepsilon_{LOO} \, T(P,N)

The penalty factor is defined by:

T(P,N) \, \, = \, \,   \frac{N}{N-P}  \; \left(1 \; + \; \frac{\mbox{tr} \left( \underline{\underline{C}}_{emp}^{-1}  \right) }{N} \right)

where:

(7)\underline{\underline{C}}_{emp} \, \, \equiv \, \, \frac{1}{N}\underline{\underline{\Psi}}^{\textsf{T}}\; \underline{\underline{\Psi}} \quad \quad , \quad \quad
  \underline{\underline{\Psi}} \, \, = \, \, \left\{ \psi_{j}(\underline{x}^{(i)}) \, \, , \, \, i=1,\dots,N \, \, , \, \, j=0,\dots,P-1 \right\}

Leave-one-out cross validation is also known as jackknife in statistics.

References: