Cross validation assessment of PC models¶
where denotes the support of the input parameters . It is worth emphasizing that one tends to overestimate the performance of a response surface by training and evaluating it on the same data set. For instance, the model might fail to predict on new data whereas the validation on the training data yields satisfactory performance. To avoid such issue, it is important that the model error assessment is conducted on a data set which is independent of the training sample. However, any new model evaluation may be time- and memory-consuming. Therefore, error estimates which are only based on already performed computer experiments are of interest. In this context, the so-called cross validation techniques are utilized to obtain reliable performance assessment of the response surface without additional model evaluations.
Any cross-validation scheme consists in dividing the data sample (i.e. the experimental design) into two sub-samples that are independently and identically distributed. A metamodel is built from one sub-sample, i.e. the training set, and its performance is assessed by comparing its predictions to the other subset, i.e. the test set. A single split will lead to a validation estimate. When several splits are conducted, the cross-validation error estimate is obtained by averaging over the splits.
K-fold cross-validation error estimate
in which is the predicted residual defined as the difference between the evaluation of and the prediction with at in the sub-sample whose cardinality is .
As described above, the -fold error estimate can be obtained with a single split of the data into folds. It is worth noting that one can repeat the cross-validation multiple times using different divisions into folds to obtain better Monte Carlo estimate. This comes obviously with an additional computational cost.
Classical leave-one-out error estimate
where the ’s are estimates of the coefficients obtained by a specific method, e.g. least squares.
By repeating this process for all observations in the experimental design, one obtains the predicted residuals . Finally, the LOO error is estimated as follows:
Due to the linear-in-parameters form of the polynomial chaos expansion, the quantity may be computed without performing further regression calculations when the PC coefficients have been estimated using the entire experimental design . Indeed, the predicted residuals can be obtained analytically as follows:
where is the -th diagonal term of the matrix with being the information matrix:
In practice, one often computes the following normalized LOO error:
where denotes the empirical covariance of the response sample :
Corrected leave-one-out error estimate
A penalized variant of may be used in order to increase its robustness with respect to overfitting, i.e. to penalize a large number of terms in the PC expansion compared to the size of the experimental design:
The penalty factor is defined by:
Leave-one-out cross validation is also known as jackknife in statistics.