Gaussian process regression

Gaussian process regression (also known as Kriging) is a Bayesian technique that aims at approximating functions (most often in order to surrogate it because it is expensive to evaluate).

Let \model: \Rset^\inputDim \rightarrow \Rset^\outputDim be a model and a dataset (\vect{x}_k, \vect{y}_k)_{1 \leq k \leq \sampleSize} where \vect{y}_k = \model(\vect{x}_k) for all k.

The objective is to build a metamodel \metaModel, using a Gaussian process that interpolates the data set. To build this metamodel, we follow the steps:

  • Step 1: Gaussian process fitting: we build the Gaussian process \vect{Y}(\omega, \vect{x}) defined by \vect{Y}(\omega, \vect{x}) = \vect{\mu}(\vect{x}) + \vect{W}(\omega, \vect{x}) where \vect{\mu} is the trend function and \vect{W} is a Gaussian process of dimension \outputDim with zero mean and covariance function \mat{C}. We show how to take into account a noise observed on the output values of the function \model;

  • Step 2: Gaussian Process Regression: we condition the Gaussian process \vect{Y} to the data set by considering the Gaussian Process Regression denoted by \vect{Z}(\omega, \vect{x}) = \vect{Y}(\omega, \vect{x})\, | \, \cC where \cC is the condition \vect{Y}(\omega, \vect{x}_k) =  \vect{y}_k for 1 \leq k \leq \sampleSize;

  • Step 3: Gaussian Process Regression metamodel and its exploitation: we define the metamodel as \metaModel(\vect{x}) =  \Expect{\vect{Y}(\omega, \vect{x})\, | \,  \cC}. Note that this metamodel is interpolating the data set when the noise parameter is set to zero. We can use the conditional covariance in order to quantify the error of the metamodel, that is the variation of the Gaussian vector at a given point.

Note that the Gaussian process regression of a function with multivariate output (\outputDim > 1) is done without simply looping over each output component.

Step 1: Gaussian process fitting with and without noise

The first step creates the Gaussian process \vect{Y}(\omega, \vect{x}) such that the sample (\vect{y}_k)_{1 \leq k \leq \sampleSize} is considered as its restriction on (\vect{x}_k)_{1 \leq k \leq \sampleSize}. It is defined by:

\vect{Y}(\omega, \vect{x}) = \vect{\mu}(\vect{x}) + \vect{W}(\omega, \vect{x})

where:

\vect{\mu}(\vect{x}) = \begin{pmatrix}
   \mu_1(\vect{x}) \\
   \vdots \\
   \mu_\outputDim(\vect{x})
 \end{pmatrix}

with \mu_\ell(\vect{x}) = \sum_{j=1}^{b} \beta_j^\ell \varphi_j(\vect{x}) and \varphi_j: \Rset^\inputDim \rightarrow \Rset the trend functions basis for 1 \leq j \leq b and 1 \leq \ell \leq \outputDim.

Furthermore, \vect{W} is a Gaussian process of dimension \outputDim with zero mean and covariance function \mat{C} = \mat{C}(\vect{\theta}, \vect{\sigma}, \mat{R}, \vect{\lambda}), where (see Covariance models for more details on the notations):

  • \vect{\theta} \in \Rset^\inputDim is the scale parameter vector,

  • \vect{\sigma} \in \Rset^\outputDim is the standard deviation parameter vector,

  • \mat{R} \in \cS_{\outputDim}^+(\Rset) is the spatial correlation matrix between the components of \vect{W},

  • \vect{\lambda} gathers some additional parameters specific to each covariance model.

Then, we have to estimate the coefficients \beta_j^\ell and \vect{p} where \vect{p} is the vector of parameters of the covariance model (a subset of \vect{\theta}, \vect{\sigma}, \mat{R}, \vect{\lambda}) that has been declared as active: by default, the full vectors \vect{\theta} and \vect{\sigma}. See openturns.CovarianceModel to get details on the activation of the estimation of the other parameters.

The estimation is done by maximizing the reduced log-likelihood of the model (see its expression below).

Estimation of the parameters: We want to estimate all the parameters \left(\beta_j^\ell \right) for 1 \leq j \leq b and 1 \leq \ell \leq \outputDim, and the vector of parameters \vect{p}.

We note:

\vect{y} = \begin{pmatrix}
    \vect{y}_1 \\
    \vdots \\
    \vect{y}_{\sampleSize}
\end{pmatrix} \in \Rset^{\outputDim \times \sampleSize},
\quad
\vect{m}_{\vect{\beta}} = \begin{pmatrix}
    \vect{\mu}(\vect{x}_1) \\
    \vdots \\
    \vect{\mu}(\vect{x}_{\sampleSize})
\end{pmatrix} \in \Rset^{\outputDim \times \sampleSize}

and:

(1)\mat{C}_{\vect{p}} = \begin{pmatrix}
  \mat{C}_{11} & \dots & \mat{C}_{1 \times \sampleSize} \\
  \vdots & & \vdots \\
  \mat{C}_{\sampleSize \times 1} & \dots & \mat{C}_{\sampleSize \times \sampleSize}
\end{pmatrix} \in \cS_{\outputDim \times \sampleSize}^+(\Rset)

where \mat{C}_{ij} = C_{\vect{p}}(\vect{x}_i, \vect{x}_j)\in \cS_{\outputDim \times \outputDim}^+
(\Rset).

The likelihood of the Gaussian process on the data set is defined by:

(2)\cL(\vect{\beta}, \vect{p};(\vect{x}_k, \vect{y}_k)_{1 \leq k \leq \sampleSize}) = \dfrac{1}
{(2\pi)^{\outputDim \times \sampleSize/2} |\det \left(\mat{C}_{\vect{p}}\right)|^{1/2}} \exp\left( -\dfrac{1}{2}
\Tr{\left( \vect{y}-\vect{m}_{\vect{\beta}} \right)} \mat{C}_{\vect{p}}^{-1}  \left( \vect{y}-\vect{m}
_{\vect{\beta}} \right)  \right)

Let \mat{L}_{\vect{p}} be the Cholesky factor of \mat{C}_{\vect{p}}: it means that \mat{L}_{\vect{p}} is the lower triangular matrix with positive diagonal such that \mat{L}_{\vect{p}} \,\Tr{\mat{L}_{\vect{p}}} = \mat{C}_{\vect{p}}. Therefore:

(3)\log \cL(\vect{\beta}, \vect{p};(\vect{x}_k, \vect{y}_k)_{1 \leq k \leq \sampleSize})
= \alpha - \log \left( \det \left(\mat{L}_{\vect{p}}\right) \right)
-\dfrac{1}{2}  \left\| \mat{L}_{\vect{p}}^{-1}(\vect{y}-\vect{m}_{\vect{\beta}}) \right\|^2_2

where \alpha \in \Rset is a constant independent of \vect{\beta} and \vect{p}. The maximization of (3) leads to the following optimality condition for \vect{\beta}:

\vect{\beta}^*(\vect{p}^*)
= \argmin_{\vect{\beta}} \left\| \mat{L}_{\vect{p}^*}^{-1} \left(\vect{y} - \vect{m}_{\vect{\beta}} \right) \right\|^2_2

This expression of \vect{\beta}^* as a function of \vect{p}^* is taken as a general relation between \vect{\beta} and \vect{p} and is substituted into (1), leading to a reduced log-likelihood function depending solely on \vect{p}.

In the particular case where \outputDim=\dim(\vect{\sigma})=1 and \sigma is a part of \vect{p}, then a further reduction is possible. In this case, if \vect{q} is the vector \vect{p} in which \sigma has been substituted by 1, then:

\left\| \mat{L}_{\vect{p}}^{-1}(\vect{y}-\vect{m}_{\vect{\beta}}) \right\|^2
= \frac{1}{\sigma^2} \left\| \mat{L}_{\vect{q}}^{-1} \left(\vect{y} - \vect{m}_{\vect{\beta}} \right) \right\|^2_2

showing that \vect{\beta}^* is a function of \vect{q}^* only. The optimality condition for \sigma reads:

\vect{\sigma}^*(\vect{q}^*)
=\dfrac{1}{\sampleSize}
\left\| \mat{L}_{\vect{q}^*}^{-1} \left(\vect{y} - \vect{m}_{\vect{\beta}^*(\vect{q}^*)}\right) \right\|^2_2.

which leads to a further reduction of the log-likelihood function where both \vect{\beta} and \sigma are replaced by their expression in terms of \vect{q}.

This step is performed by the class GaussianProcessFitter.

Add a noise to the output values: We show how to take into account the fact that the output values of the function are not known precisely. This noise is modeled by normal distribution with zero mean and a covariance matrix \mat{\Sigma}_i^{\text{noise}} \in \cM_{\outputDim \times \outputDim}(\Rset) that can be specific to the output \vect{y}_i. It means that each output \vect{y}_i is considered as the realization of the random vector \vect{Y}_i defined by:

\vect{Y}_i = \vect{y}_i^{\text{true}} + \vect{\varepsilon}, \quad
\vect{\varepsilon} \sim \cN \left(\vect{0}, \mat{\Sigma}_i^{\text{noise}}\right)

where \vect{y}_i^{\text{true}} is the true (and unknown) value of the model at \vect{x}_i.

If the covariance matrices \mat{\Sigma}_i^{\text{noise}} are different, the noise is heteroscedastic. On the contrary, the noise is homoscedastic.

The noise is introduced during the step of the parameters estimation: in the likelihood expression defined in (2), the covariance matrix of the process defined in (1) is transformed into the covariance matrix \mat{C}^{\text{noise}} defined by:

\mat{C}^{\text{noise}}_{\vect{p}} =
  \mat{C} + \text{diag}\left(\mat{\Sigma}_1^{\text{noise}}, \dots, \mat{\Sigma}_n^{\text{noise}}\right)
  \in \cS_{\outputDim \times \sampleSize}^+(\Rset)

Thus the covariance matrix of the noise has been added on the bloc-diagonal of the initial covariance matrix.

Note that the noise is taken into account to estimate the parameters only. The final covariance model of the process is still defined by the initial covariance function \vect{C} once the parameters have been estimated.

Use the method setNoise of the class GaussianProcessFitter.

Note that the noise is different from the nugget effect: refer to Covariance models to get more details on covariance models and the introduction of a nugget factor, and in particular see equation (5). To be short, the nugget effect modifies the diagonal of the covariance matrix and this modification remains even once the parameters of the covariance model have been estimated. The resulting process is less smooth than the initial one without nugget effect.

Step 2: Gaussian Process Regression

Once the Gaussian process \vect{Y} has been estimated, the Gaussian process regression aims at conditioning it to the data set: we make the Gaussian process approximation become interpolating over the dataset.

The final Gaussian process regression denoted by \vect{Z} is defined by:

(4)\vect{Z}(\omega, \vect{x}) = \vect{Y}(\omega, \vect{x})\, | \,  \cC

where \cC is the condition \vect{Y}(\omega, \vect{x}_k) = \vect{y}_k for 1 \leq k \leq \sampleSize.

Then, \vect{Z} is a Gaussian process, which mean is defined by:

\Expect{\vect{Z}(\omega, \vect{x})}  & =  \Expect{\vect{Y}(\omega, \vect{x})\, | \,  \cC}\\
 & = \vect{\mu}(\vect{x}) + \Cov{\vect{Y}(\omega, \vect{x}), (\vect{Y}(\omega,
 \vect{x}_1), \dots, \vect{Y}(\omega, \vect{x}_{\sampleSize}))} \vect{\gamma}

where:

\Cov{\vect{Y}(\omega, \vect{x}), (\vect{Y}(\omega, \vect{x}_1), \dots, \vect{Y}(\omega, \vect{x}
_{\sampleSize}))} = \left( \mat{C}( \vect{x},  \vect{x}_1) | \dots | \mat{C}( \vect{x},  \vect{x}
_{\sampleSize})  \right) \in \cM_{\outputDim,\sampleSize \times \outputDim}(\Rset)

and \vect{\gamma} is defined by:

(5)\vect{\gamma} = \mat{C}^{-1} \left( \vect{y}_1 - \vect{\mu}( \vect{x}_1), \dots, \vect{y}_\sampleSize -
\vect{\mu}( \vect{x}_\sampleSize) \right) \in  \cM_{\sampleSize \times \outputDim, 1}(\Rset)

where:

\mat{C} = \Cov{\vect{Y}(\omega, \vect{x}_1), \dots, \vect{Y}(\omega, \vect{x}_{\sampleSize})}\in
\cM_{\sampleSize \times \outputDim, \sampleSize \times \outputDim}(\Rset)

Finally, we get the following mean of the Gaussian process regression at the point \vect{x}:

(6)\Expect{\vect{Z}(\omega, \vect{x})} = \vect{\mu}(\vect{x}) + \sum_{i=1}
^\sampleSize \gamma_i \mat{C}( \vect{x},  \vect{x}_i) \in \Rset^{\outputDim}

The covariance matrix of \vect{Z} at the point \vect{x} is defined by:

(7)\Cov{\vect{Z}(\omega, \vect{x})} & =  \Cov{\vect{Y}(\omega, \vect{x}), \vect{Y}(\omega,
\vect{x})} - \Cov{\vect{Y}(\omega, \vect{x}), (\vect{Y}(\omega,
\vect{x}_1), \dots, \vect{Y}(\omega, \vect{x}_{\sampleSize}))} \mat{C}^{-1}\Cov{(\vect{Y}
(\omega, \vect{x}_1), \dots, \vect{Y}(\omega, \vect{x}_{\sampleSize})), \vect{Y}(\omega,
\vect{x})}

with \Cov{\vect{Z}(\omega, \vect{x})} \in \cM_{\outputDim \times \outputDim}(\Rset).

When computed on any sample (\vect{\xi}_1, \dots, \vect{\xi}_\sampleSize), the covariance matrix is defined by:

(8)\Cov{(\vect{Z}(\omega, \vect{\xi}_1), \dots, \vect{Z}(\omega, \vect{\xi}_\sampleSize)} =
    \begin{pmatrix}
        \Sigma_{11} & \dots & \Sigma_{1 \sampleSize} \\
        \vdots & & \vdots \\
        \Sigma_{\sampleSize 1} & \dots & \Sigma_{\sampleSize \sampleSize}
      \end{pmatrix}

where \Sigma_{ij} = \Cov{\vect{Z}(\omega, \vect{\xi}_i), \vect{Z}(\omega, \vect{\xi}_j)}.

This step is performed by the class GaussianProcessRegression.

Step 3: Gaussian Process Regression metamodel and its exploitation

The Gaussian Process Regression metamodel \metaModel is defined by:

(9)\metaModel(\vect{x}) = \Expect{\vect{Z}(\omega, \vect{x})} =  \vect{\mu}(\vect{x}) + \sum_{i=1}
^\sampleSize \gamma_i \mat{C}( \vect{x},  \vect{x}_i).

We can use the conditional covariance of \vect{Y} in order to quantify the predictive uncertainty of the metamodel. The GaussianProcessConditionalCovariance provides all the services to get the predictive error at any point.