.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_meta_modeling/general_purpose_metamodels/plot_overfitting_model_selection.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_auto_meta_modeling_general_purpose_metamodels_plot_overfitting_model_selection.py>`
        to download the full example code.

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_meta_modeling_general_purpose_metamodels_plot_overfitting_model_selection.py:


Over-fitting and model selection
================================

.. GENERATED FROM PYTHON SOURCE LINES 9-22

Introduction
------------

In this notebook, we present the problem of over-fitting a model to data.
We consider noisy observations of the `sine` function.
We estimate the coefficients of the univariate polynomial based on linear
least squares and show that, when the degree of the polynomial becomes too
large, the overall prediction quality decreases.

This shows why and how model selection can come into play in order to select
the degree of the polynomial: there is a trade-off between fitting the
data and preserving the quality of future predictions.
In this example, we use cross validation as a model selection method.

.. GENERATED FROM PYTHON SOURCE LINES 24-29

References
----------

* Bishop Christopher M., 1995, Neural networks for pattern recognition. Figure 1.4, page 7


.. GENERATED FROM PYTHON SOURCE LINES 31-35

Compute the data
----------------

In this section, we generate noisy observations from the `sine` function.

.. GENERATED FROM PYTHON SOURCE LINES 37-42

.. code-block:: Python

    import openturns as ot
    from matplotlib import pyplot as plt
    import openturns.viewer as otv


.. GENERATED FROM PYTHON SOURCE LINES 43-44

We define the function that we are going to approximate.

.. GENERATED FROM PYTHON SOURCE LINES 44-46

.. code-block:: Python

    g = ot.SymbolicFunction(["x"], ["sin(2*pi_*x)"])


.. GENERATED FROM PYTHON SOURCE LINES 47-55

.. code-block:: Python

    graph = ot.Graph("Polynomial curve fitting", "x", "y", True, "upper right")
    # The "unknown" function
    curve = g.draw(0, 1)
    curve.setLegends(['"Unknown" function'])
    graph.add(curve)
    view = otv.View(graph)


.. image-sg:: /auto_meta_modeling/general_purpose_metamodels/images/sphx_glr_plot_overfitting_model_selection_001.svg
   :alt: Polynomial curve fitting
   :srcset: /auto_meta_modeling/general_purpose_metamodels/images/sphx_glr_plot_overfitting_model_selection_001.svg
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 56-57

This seems a smooth function to approximate with polynomials.

.. GENERATED FROM PYTHON SOURCE LINES 60-69

.. code-block:: Python

    def linearSample(xmin, xmax, npoints):
        """Returns a sample created from a regular grid
        from xmin to xmax with npoints points."""
        step = (xmax - xmin) / (npoints - 1)
        rg = ot.RegularGrid(xmin, step, npoints)
        vertices = rg.getVertices()
        return vertices


.. GENERATED FROM PYTHON SOURCE LINES 70-71

We consider 10 observation points in the interval [0,1].

.. GENERATED FROM PYTHON SOURCE LINES 73-76

.. code-block:: Python

    n_train = 10
    x_train = linearSample(0, 1, n_train)


.. GENERATED FROM PYTHON SOURCE LINES 77-78

Assume that the observations are noisy and that the noise follows a Normal distribution with zero mean and small standard deviation.

.. GENERATED FROM PYTHON SOURCE LINES 80-83

.. code-block:: Python

    noise = ot.Normal(0, 0.1)
    noiseSample = noise.getSample(n_train)


.. GENERATED FROM PYTHON SOURCE LINES 84-86

The following computes the observation as the sum of the function value and of the noise.
The couple (`x_train` , `y_train`) is the training set: it is used to compute the coefficients of the polynomial model.

.. GENERATED FROM PYTHON SOURCE LINES 88-90

.. code-block:: Python

    y_train = g(x_train) + noiseSample


.. GENERATED FROM PYTHON SOURCE LINES 91-103

.. code-block:: Python

    graph = ot.Graph("Polynomial curve fitting", "x", "y", True, "upper right")
    # The "unknown" function
    curve = g.draw(0, 1)
    curve.setLegends(['"Unknown" function'])
    graph.add(curve)
    # Training set
    cloud = ot.Cloud(x_train, y_train)
    cloud.setPointStyle("circle")
    cloud.setLegend("Observations")
    graph.add(cloud)
    view = otv.View(graph)


.. image-sg:: /auto_meta_modeling/general_purpose_metamodels/images/sphx_glr_plot_overfitting_model_selection_002.svg
   :alt: Polynomial curve fitting
   :srcset: /auto_meta_modeling/general_purpose_metamodels/images/sphx_glr_plot_overfitting_model_selection_002.svg
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 104-126

Compute the coefficients of the polynomial decomposition
--------------------------------------------------------

Let :math:`\vect{y} \in \Rset^n` be a vector of observations.
The polynomial model is

.. math::
   P(x) = \beta_0 + \beta_1 x + ... + \beta_p x^p,

for any :math:`x \in \Rset`, where :math:`p` is the polynomial degree and :math:`\vect{\beta} \in \Rset^{p+1}` is the vector of the coefficients of the model.
Let :math:`\sampleSize` be the training sample size and let :math:`x_1,...,x_\sampleSize \in \Rset` be the abscissas of the training set.
The design matrix :math:`\mat{X} \in \Rset^{n \times (p+1)}` is

.. math::
   x_{i,j} = x^j_i,

for :math:`i=1,...,n` and :math:`j=0,...,p`.
The least squares solution is:

.. math::
   \beta^\star = \argmin_{\vect{\beta} \in \Rset^{p+1}} \| \mat{X} \vect{\beta} - \vect{y}\|_2^2.


.. GENERATED FROM PYTHON SOURCE LINES 128-133

In order to approximate the function with polynomials up to degree 4,
we create a list of strings containing the associated monomials.
We do not include a constant in the polynomial basis, as this constant term
is automatically included in the model by the :class:`~openturns.LinearLeastSquares` class.
We perform the loop from 1 up to `total_degree` (but the `range` function takes `total_degree + 1` as its second input argument).

.. GENERATED FROM PYTHON SOURCE LINES 135-139

.. code-block:: Python

    total_degree = 4
    polynomialCollection = ["x^%d" % (degree) for degree in range(1, total_degree + 1)]
    polynomialCollection


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    ['x^1', 'x^2', 'x^3', 'x^4']


.. GENERATED FROM PYTHON SOURCE LINES 140-141

Given the list of strings, we create a symbolic function which computes the values of the monomials.

.. GENERATED FROM PYTHON SOURCE LINES 143-146

.. code-block:: Python

    basis = ot.SymbolicFunction(["x"], polynomialCollection)
    basis


.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    [x]->[x^1,x^2,x^3,x^4]
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 147-150

.. code-block:: Python

    designMatrix = basis(x_train)
    designMatrix


.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <table>
      <tr><td></td><th>y0</th><th>y1</th><th>y2</th><th>y3</th></tr>
      <tr><th>0</th><td>0</td><td>0</td><td>0</td><td>0</td></tr>
      <tr><th>1</th><td>0.1111111</td><td>0.01234568</td><td>0.001371742</td><td>0.0001524158</td></tr>
      <tr><th>2</th><td>0.2222222</td><td>0.04938272</td><td>0.01097394</td><td>0.002438653</td></tr>
      <tr><th>3</th><td>0.3333333</td><td>0.1111111</td><td>0.03703704</td><td>0.01234568</td></tr>
      <tr><th>4</th><td>0.4444444</td><td>0.1975309</td><td>0.0877915</td><td>0.03901844</td></tr>
      <tr><th>5</th><td>0.5555556</td><td>0.308642</td><td>0.1714678</td><td>0.09525987</td></tr>
      <tr><th>6</th><td>0.6666667</td><td>0.4444444</td><td>0.2962963</td><td>0.1975309</td></tr>
      <tr><th>7</th><td>0.7777778</td><td>0.6049383</td><td>0.4705075</td><td>0.3659503</td></tr>
      <tr><th>8</th><td>0.8888889</td><td>0.7901235</td><td>0.702332</td><td>0.6242951</td></tr>
      <tr><th>9</th><td>1</td><td>1</td><td>1</td><td>1</td></tr>
    </table>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 151-154

.. code-block:: Python

    myLeastSquares = ot.LinearLeastSquares(designMatrix, y_train)
    myLeastSquares.run()


.. GENERATED FROM PYTHON SOURCE LINES 155-157

.. code-block:: Python

    responseSurface = myLeastSquares.getMetaModel()


.. GENERATED FROM PYTHON SOURCE LINES 158-160

The test set
------------

.. GENERATED FROM PYTHON SOURCE LINES 162-163

The couple (`x_test` , `y_test`) is the test set: it is used to assess the quality of the polynomial model with points that were not used for training.

.. GENERATED FROM PYTHON SOURCE LINES 165-169

.. code-block:: Python

    n_test = 50
    x_test = linearSample(0, 1, n_test)
    y_test = responseSurface(basis(x_test))


.. GENERATED FROM PYTHON SOURCE LINES 170-186

.. code-block:: Python

    graph = ot.Graph("Polynomial curve fitting", "x", "y", True, "upper right")
    # The "unknown" function
    curve = g.draw(0, 1)
    curve.setLegends(['"Unknown" function'])
    graph.add(curve)
    # Training set
    cloud = ot.Cloud(x_train, y_train)
    cloud.setPointStyle("circle")
    cloud.setLegend("Observations")
    graph.add(cloud)
    # Predictions
    curve = ot.Curve(x_test, y_test)
    curve.setLegend("Polynomial Degree = %d" % (total_degree))
    graph.add(curve)
    view = otv.View(graph)


.. image-sg:: /auto_meta_modeling/general_purpose_metamodels/images/sphx_glr_plot_overfitting_model_selection_003.svg
   :alt: Polynomial curve fitting
   :srcset: /auto_meta_modeling/general_purpose_metamodels/images/sphx_glr_plot_overfitting_model_selection_003.svg
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 187-189

Compute the residuals
---------------------

.. GENERATED FROM PYTHON SOURCE LINES 191-192

For each observation in the training set, the residual is the vertical distance between the model and the observation.

.. GENERATED FROM PYTHON SOURCE LINES 194-195

sphinx_gallery_thumbnail_number = 4

.. GENERATED FROM PYTHON SOURCE LINES 195-224

.. code-block:: Python

    graph = ot.Graph(
        "Least squares minimizes the sum of the squares of the vertical bars",
        "x",
        "y",
        True,
        "upper right",
    )
    residualsColor = ot.Drawable.BuildDefaultPalette(3)[2]
    # Predictions
    curve = ot.Curve(x_test, y_test)
    curve.setLegend("Polynomial Degree = %d" % (total_degree))
    graph.add(curve)
    # Training set observations
    cloud = ot.Cloud(x_train, y_train)
    cloud.setPointStyle("circle")
    cloud.setLegend("Observations")
    graph.add(cloud)
    # Errors
    ypredicted_train = responseSurface(basis(x_train))
    for i in range(n_train):
        curve = ot.Curve([x_train[i], x_train[i]], [y_train[i], ypredicted_train[i]])
        curve.setColor(residualsColor)
        curve.setLineWidth(2)
        if i == 0:
            curve.setLegend("Residual")
        graph.add(curve)
    view = otv.View(graph)


.. image-sg:: /auto_meta_modeling/general_purpose_metamodels/images/sphx_glr_plot_overfitting_model_selection_004.svg
   :alt: Least squares minimizes the sum of the squares of the vertical bars
   :srcset: /auto_meta_modeling/general_purpose_metamodels/images/sphx_glr_plot_overfitting_model_selection_004.svg
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 225-226

The least squares method minimizes the sum of the squared errors i.e. the sum of the squares of the lengths of the vertical segments.

.. GENERATED FROM PYTHON SOURCE LINES 228-230

We gather the previous computation in two different functions. The `myPolynomialDataFitting` function computes
the least squares solution and `myPolynomialCurveFittingGraph` plots the results.

.. GENERATED FROM PYTHON SOURCE LINES 233-247

.. code-block:: Python

    def myPolynomialDataFitting(total_degree, x_train, y_train):
        """Computes the polynomial curve fitting
        with given total degree.
        This is for learning purposes only: please consider a serious metamodel
        for real applications, e.g. polynomial chaos or kriging."""
        polynomialCollection = ["x^%d" % (degree) for degree in range(1, total_degree + 1)]
        basis = ot.SymbolicFunction(["x"], polynomialCollection)
        designMatrix = basis(x_train)
        myLeastSquares = ot.LinearLeastSquares(designMatrix, y_train)
        myLeastSquares.run()
        responseSurface = myLeastSquares.getMetaModel()
        return responseSurface, basis


.. GENERATED FROM PYTHON SOURCE LINES 248-273

.. code-block:: Python

    def myPolynomialCurveFittingGraph(total_degree, x_train, y_train):
        """Returns the graphics for a polynomial curve fitting
        with given total degree"""
        responseSurface, basis = myPolynomialDataFitting(total_degree, x_train, y_train)
        # Graphics
        n_test = 100
        x_test = linearSample(0, 1, n_test)
        ypredicted_test = responseSurface(basis(x_test))
        # Graphics
        graph = ot.Graph("Polynomial curve fitting", "x", "y", True, "upper right")
        # The "unknown" function
        curve = g.draw(0, 1)
        graph.add(curve)
        # Training set
        cloud = ot.Cloud(x_train, y_train)
        cloud.setPointStyle("circle")
        cloud.setLegend("N=%d" % (x_train.getSize()))
        graph.add(cloud)
        # Predictions
        curve = ot.Curve(x_test, ypredicted_test)
        curve.setLegend("Degree = %d" % (total_degree))
        graph.add(curve)
        return graph


.. GENERATED FROM PYTHON SOURCE LINES 274-275

In order to see the effect of the polynomial degree, we compare the polynomial fit with degrees equal to 0 (constant), 1 (linear), 3 (cubic) and 9.

.. GENERATED FROM PYTHON SOURCE LINES 277-285

.. code-block:: Python

    grid = ot.GridLayout(2, 2)
    grid.setGraph(0, 0, myPolynomialCurveFittingGraph(0, x_train, y_train))
    grid.setGraph(0, 1, myPolynomialCurveFittingGraph(1, x_train, y_train))
    grid.setGraph(1, 0, myPolynomialCurveFittingGraph(3, x_train, y_train))
    grid.setGraph(1, 1, myPolynomialCurveFittingGraph(9, x_train, y_train))
    view = otv.View(grid, figure_kw={"figsize": (8.0, 5.0)})
    plt.subplots_adjust(hspace=0.5, wspace=0.5)


.. image-sg:: /auto_meta_modeling/general_purpose_metamodels/images/sphx_glr_plot_overfitting_model_selection_005.svg
   :alt: , Polynomial curve fitting, Polynomial curve fitting, Polynomial curve fitting, Polynomial curve fitting
   :srcset: /auto_meta_modeling/general_purpose_metamodels/images/sphx_glr_plot_overfitting_model_selection_005.svg
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 286-298

When the polynomial degree is low, the fit is satisfying.
The polynomial is close to the observations, although there is still some residual error.

However, when the polynomial degree is high, it produces large oscillations
which significantly deviate from the true function.
This is *overfitting*. This is a pity, given the fact that the polynomial
*exactly* interpolates the observations: the residuals are zeroed.

If the locations of the x abscissas could be changed, then the oscillations could be made smaller.
This is the method used in Gaussian quadrature, where the nodes of interpolation are made closer on the left and right bounds.
In our situation, we make the asssumption that these abscissas cannot be changed: the most obvious choice is to limit the degree of the polynomial.
Another possibility is to include a regularization into the least squares solution.

.. GENERATED FROM PYTHON SOURCE LINES 300-304

Root mean squared error
-----------------------

In order to assess the quality of the polynomial fit, we create a second dataset, the *test set* and compare the value of the polynomial with the test observations.

.. GENERATED FROM PYTHON SOURCE LINES 306-308

.. code-block:: Python

    sqrt = ot.SymbolicFunction(["x"], ["sqrt(x)"])


.. GENERATED FROM PYTHON SOURCE LINES 309-312

In order to see how close the model is to the observations, we compute the root mean square error.

First, we create a degree 4 polynomial which fits the data.

.. GENERATED FROM PYTHON SOURCE LINES 314-318

.. code-block:: Python

    total_degree = 4
    responseSurface, basis = myPolynomialDataFitting(total_degree, x_train, y_train)


.. GENERATED FROM PYTHON SOURCE LINES 319-320

Then we create a test set, with the same method as before.

.. GENERATED FROM PYTHON SOURCE LINES 323-330

.. code-block:: Python

    def createDataset(n):
        x = linearSample(0, 1, n)
        noiseSample = noise.getSample(n)
        y = g(x) + noiseSample
        return x, y


.. GENERATED FROM PYTHON SOURCE LINES 331-334

.. code-block:: Python

    n_test = 100
    x_test, y_test = createDataset(n_test)


.. GENERATED FROM PYTHON SOURCE LINES 335-336

On this test set, we evaluate the polynomial.

.. GENERATED FROM PYTHON SOURCE LINES 338-340

.. code-block:: Python

    ypredicted_test = responseSurface(basis(x_test))


.. GENERATED FROM PYTHON SOURCE LINES 341-342

The vector of residuals is the vector of the differences between the observations and the predictions.

.. GENERATED FROM PYTHON SOURCE LINES 344-346

.. code-block:: Python

    residuals = y_test.asPoint() - ypredicted_test.asPoint()


.. GENERATED FROM PYTHON SOURCE LINES 347-350

The `normSquare` method computes the square of the Euclidian norm (i.e., the 2-norm).
We divide this by the test sample size (so as to compare the error for different sample sizes)
and compute the square root of the result (so that the result has the same unit as `y` ).

.. GENERATED FROM PYTHON SOURCE LINES 352-356

.. code-block:: Python

    RMSE = sqrt([residuals.normSquare() / n_test])[0]
    RMSE


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    0.1446476675291098


.. GENERATED FROM PYTHON SOURCE LINES 357-358

The following function gathers the RMSE computation to make the experiment easier.

.. GENERATED FROM PYTHON SOURCE LINES 361-368

.. code-block:: Python

    def computeRMSE(responseSurface, basis, x, y):
        ypredicted = responseSurface(basis(x))
        residuals = y.asPoint() - ypredicted.asPoint()
        RMSE = sqrt([residuals.normSquare() / n_test])[0]
        return RMSE


.. GENERATED FROM PYTHON SOURCE LINES 369-377

.. code-block:: Python

    maximum_degree = 10
    RMSE_train = ot.Sample(maximum_degree, 1)
    RMSE_test = ot.Sample(maximum_degree, 1)
    for total_degree in range(maximum_degree):
        responseSurface, basis = myPolynomialDataFitting(total_degree, x_train, y_train)
        RMSE_train[total_degree, 0] = computeRMSE(responseSurface, basis, x_train, y_train)
        RMSE_test[total_degree, 0] = computeRMSE(responseSurface, basis, x_test, y_test)


.. GENERATED FROM PYTHON SOURCE LINES 378-392

.. code-block:: Python

    degreeSample = ot.Sample([[i] for i in range(maximum_degree)])
    graph = ot.Graph("Root mean square error", "Degree", "RMSE", True, "upper right")
    # Train
    cloud = ot.Curve(degreeSample, RMSE_train)
    cloud.setLegend("Train")
    cloud.setPointStyle("circle")
    graph.add(cloud)
    # Test
    cloud = ot.Curve(degreeSample, RMSE_test)
    cloud.setLegend("Test")
    cloud.setPointStyle("circle")
    graph.add(cloud)
    view = otv.View(graph)


.. image-sg:: /auto_meta_modeling/general_purpose_metamodels/images/sphx_glr_plot_overfitting_model_selection_006.svg
   :alt: Root mean square error
   :srcset: /auto_meta_modeling/general_purpose_metamodels/images/sphx_glr_plot_overfitting_model_selection_006.svg
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 393-401

We see that the RMSE on the train set continuously decreases, reaching zero
when the polynomial degree is so that the number of coefficients is equal to
the train dataset sample size. In this extreme situation, the least squares
solution is equivalent to solving a linear system of equations: this leads to a zero residual.

On the test set however, the RMSE decreases, reaches a flat region,
then increases dramatically when the degree is equal to 9.
Hence, limiting the polynomial degree limits overfitting.

.. GENERATED FROM PYTHON SOURCE LINES 403-407

Increasing the training set
---------------------------

We wonder what happens when the training dataset size is increased.

.. GENERATED FROM PYTHON SOURCE LINES 409-420

.. code-block:: Python

    total_degree = 9
    grid = ot.GridLayout(1, 2)
    n_train = 11
    x_train, y_train = createDataset(n_train)
    grid.setGraph(0, 0, myPolynomialCurveFittingGraph(total_degree, x_train, y_train))
    n_train = 100
    x_train, y_train = createDataset(n_train)
    grid.setGraph(0, 1, myPolynomialCurveFittingGraph(total_degree, x_train, y_train))
    view = otv.View(grid, figure_kw={"figsize": (8.0, 4.0)})
    plt.subplots_adjust(wspace=0.3)


.. image-sg:: /auto_meta_modeling/general_purpose_metamodels/images/sphx_glr_plot_overfitting_model_selection_007.svg
   :alt: , Polynomial curve fitting, Polynomial curve fitting
   :srcset: /auto_meta_modeling/general_purpose_metamodels/images/sphx_glr_plot_overfitting_model_selection_007.svg
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 421-422

We see that the polynomial oscillates with a dataset with size 11, but does not with the larger dataset: increasing the training dataset mitigates the oscillations.

.. GENERATED FROM PYTHON SOURCE LINES 422-423

.. code-block:: Python

    otv.View.ShowAll()


.. _sphx_glr_download_auto_meta_modeling_general_purpose_metamodels_plot_overfitting_model_selection.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: plot_overfitting_model_selection.ipynb <plot_overfitting_model_selection.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: plot_overfitting_model_selection.py <plot_overfitting_model_selection.py>`

    .. container:: sphx-glr-download sphx-glr-download-zip

      :download:`Download zipped: plot_overfitting_model_selection.zip <plot_overfitting_model_selection.zip>`