.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_meta_modeling/general_purpose_metamodels/plot_overfitting_model_selection.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_meta_modeling_general_purpose_metamodels_plot_overfitting_model_selection.py: Over-fitting and model selection ================================ .. GENERATED FROM PYTHON SOURCE LINES 8-14 Introduction ------------ In this notebook, we present the problem of over-fitting a model to data. We consider noisy observations of the sine function. We estimate the coefficients of the univariate polynomial based on linear least squares and show that, when the degree of the polynomial becomes too large, the overall prediction quality decreases. This shows why and how model selection can come into play in order to select the degree of the polynomial: there is is a trade-off between fitting the data and preserving the quality of future predictions. In this example, we use cross validation as a model selection method. .. GENERATED FROM PYTHON SOURCE LINES 16-21 References ---------- * Bishop Christopher M., 1995, Neural networks for pattern recognition. Figure 1.4, page 7 .. GENERATED FROM PYTHON SOURCE LINES 23-27 Compute the data ---------------- In this section, we generate noisy observations from the sine function. .. GENERATED FROM PYTHON SOURCE LINES 29-33 .. code-block:: default import openturns as ot import pylab as pl import openturns.viewer as otv .. GENERATED FROM PYTHON SOURCE LINES 34-36 .. code-block:: default ot.RandomGenerator.SetSeed(0) .. GENERATED FROM PYTHON SOURCE LINES 37-38 We define the function that we are going to approximate. .. GENERATED FROM PYTHON SOURCE LINES 40-42 .. code-block:: default g = ot.SymbolicFunction(["x"], ["sin(2*pi_*x)"]) .. GENERATED FROM PYTHON SOURCE LINES 43-52 .. code-block:: default graph = ot.Graph("Polynomial curve fitting", "x", "y", True, "topright") # The "unknown" function curve = g.draw(0, 1) curve.setColors(["green"]) curve.setLegends(['"Unknown" function']) graph.add(curve) view = otv.View(graph) .. image-sg:: /auto_meta_modeling/general_purpose_metamodels/images/sphx_glr_plot_overfitting_model_selection_001.png :alt: Polynomial curve fitting :srcset: /auto_meta_modeling/general_purpose_metamodels/images/sphx_glr_plot_overfitting_model_selection_001.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 53-54 This seems a nice, smooth function to approximate with polynomials. .. GENERATED FROM PYTHON SOURCE LINES 56-65 .. code-block:: default def linearSample(xmin, xmax, npoints): """Returns a sample created from a regular grid from xmin to xmax with npoints points.""" step = (xmax - xmin) / (npoints - 1) rg = ot.RegularGrid(xmin, step, npoints) vertices = rg.getVertices() return vertices .. GENERATED FROM PYTHON SOURCE LINES 66-67 We consider 10 observation points in the interval [0,1]. .. GENERATED FROM PYTHON SOURCE LINES 69-72 .. code-block:: default n_train = 10 x_train = linearSample(0, 1, n_train) .. GENERATED FROM PYTHON SOURCE LINES 73-74 Assume that the observations are noisy and that the noise follows a Normal distribution with zero mean and small standard deviation. .. GENERATED FROM PYTHON SOURCE LINES 76-79 .. code-block:: default noise = ot.Normal(0, 0.1) noiseSample = noise.getSample(n_train) .. GENERATED FROM PYTHON SOURCE LINES 80-81 The following computes the observation as the sum of the function value and of the noise. The couple (`x_train`,`y_train`) is the training set: it is used to compute the coefficients of the polynomial model. .. GENERATED FROM PYTHON SOURCE LINES 83-85 .. code-block:: default y_train = g(x_train) + noiseSample .. GENERATED FROM PYTHON SOURCE LINES 86-98 .. code-block:: default graph = ot.Graph("Polynomial curve fitting", "x", "y", True, "topright") # The "unknown" function curve = g.draw(0, 1) curve.setColors(["green"]) graph.add(curve) # Training set cloud = ot.Cloud(x_train, y_train) cloud.setPointStyle("circle") cloud.setLegend("Observations") graph.add(cloud) view = otv.View(graph) .. image-sg:: /auto_meta_modeling/general_purpose_metamodels/images/sphx_glr_plot_overfitting_model_selection_002.png :alt: Polynomial curve fitting :srcset: /auto_meta_modeling/general_purpose_metamodels/images/sphx_glr_plot_overfitting_model_selection_002.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 99-124 Compute the coefficients of the polynomial decomposition -------------------------------------------------------- Let :math:`y \in \mathbb{R}^n` be a vector of observations. The polynomial model is .. math:: P(x) = \beta_0 + \beta_1 x + ... + \beta_p x^p, for any :math:`x\in\mathbb{R}`, where :math:`p` is the polynomial degree and :math:`\beta\in\mathbb{R}^{p+1}` is the vector of the coefficients of the model. Let :math:`n` be the training sample size and let :math:`x_1,...,x_n \in \mathbb{R}` be the abscissas of the training set. The design matrix :math:`X \in \mathbb{R}^{n \times (p+1)}` is .. math:: x_{i,j} = x^j_i, for :math:`i=1,...,n` and :math:`j=0,...,p`. The least squares solution is: .. math:: \beta^\star = \textrm{argmin}_{\beta \in \mathbb{R}^{p+1}} \| X\beta - y\|_2^2. .. GENERATED FROM PYTHON SOURCE LINES 126-127 In order to approximate the function with polynomials up to degree 4, we create a list of strings containing the associated monomials. We do not include a constant in the polynomial basis, as this constant term is automatically included in the model by the `LinearLeastSquares` class. We perform the loop from 1 up to `total_degree` (but the `range` function takes `total_degree + 1` as its second input argument). .. GENERATED FROM PYTHON SOURCE LINES 129-134 .. code-block:: default total_degree = 4 polynomialCollection = ["x^%d" % (degree) for degree in range(1, total_degree + 1)] polynomialCollection .. rst-class:: sphx-glr-script-out Out: .. code-block:: none ['x^1', 'x^2', 'x^3', 'x^4'] .. GENERATED FROM PYTHON SOURCE LINES 135-136 Given the list of strings, we create a symbolic function which computes the values of the monomials. .. GENERATED FROM PYTHON SOURCE LINES 138-141 .. code-block:: default basis = ot.SymbolicFunction(["x"], polynomialCollection) basis .. raw:: html

[x]->[x^1,x^2,x^3,x^4]



.. GENERATED FROM PYTHON SOURCE LINES 142-145 .. code-block:: default designMatrix = basis(x_train) designMatrix .. raw:: html
y0y1y2y3
00000
10.11111110.012345680.0013717420.0001524158
20.22222220.049382720.010973940.002438653
30.33333330.11111110.037037040.01234568
40.44444440.19753090.08779150.03901844
50.55555560.3086420.17146780.09525987
60.66666670.44444440.29629630.1975309
70.77777780.60493830.47050750.3659503
80.88888890.79012350.7023320.6242951
91111


.. GENERATED FROM PYTHON SOURCE LINES 146-149 .. code-block:: default myLeastSquares = ot.LinearLeastSquares(designMatrix, y_train) myLeastSquares.run() .. GENERATED FROM PYTHON SOURCE LINES 150-152 .. code-block:: default responseSurface = myLeastSquares.getMetaModel() .. GENERATED FROM PYTHON SOURCE LINES 153-154 The couple (`x_test`,`y_test`) is the test set: it is used to assess the quality of the polynomial model with points that were not used for training. .. GENERATED FROM PYTHON SOURCE LINES 156-160 .. code-block:: default n_test = 50 x_test = linearSample(0, 1, n_test) y_test = responseSurface(basis(x_test)) .. GENERATED FROM PYTHON SOURCE LINES 161-177 .. code-block:: default graph = ot.Graph("Polynomial curve fitting", "x", "y", True, "topright") # The "unknown" function curve = g.draw(0, 1) curve.setColors(["green"]) graph.add(curve) # Training set cloud = ot.Cloud(x_train, y_train) cloud.setPointStyle("circle") graph.add(cloud) # Predictions curve = ot.Curve(x_test, y_test) curve.setLegend("Polynomial Degree = %d" % (total_degree)) curve.setColor("red") graph.add(curve) view = otv.View(graph) .. image-sg:: /auto_meta_modeling/general_purpose_metamodels/images/sphx_glr_plot_overfitting_model_selection_003.png :alt: Polynomial curve fitting :srcset: /auto_meta_modeling/general_purpose_metamodels/images/sphx_glr_plot_overfitting_model_selection_003.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 178-179 For each observation in the training set, the error is the vertical distance between the model and the observation. .. GENERATED FROM PYTHON SOURCE LINES 181-208 .. code-block:: default graph = ot.Graph( "Least squares minimizes the sum of the squares of the vertical bars", "x", "y", True, "topright", ) # Training set observations cloud = ot.Cloud(x_train, y_train) cloud.setPointStyle("circle") graph.add(cloud) # Predictions curve = ot.Curve(x_test, y_test) curve.setLegend("Polynomial Degree = %d" % (total_degree)) curve.setColor("red") graph.add(curve) # Errors ypredicted_train = responseSurface(basis(x_train)) for i in range(n_train): curve = ot.Curve([x_train[i], x_train[i]], [ y_train[i], ypredicted_train[i]]) curve.setColor("green") curve.setLineWidth(2) graph.add(curve) view = otv.View(graph) .. image-sg:: /auto_meta_modeling/general_purpose_metamodels/images/sphx_glr_plot_overfitting_model_selection_004.png :alt: Least squares minimizes the sum of the squares of the vertical bars :srcset: /auto_meta_modeling/general_purpose_metamodels/images/sphx_glr_plot_overfitting_model_selection_004.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 209-210 The least squares method minimizes the sum of the squared errors i.e. the sum of the squares of the lengths of the vertical segments. .. GENERATED FROM PYTHON SOURCE LINES 212-213 We gather the previous computation in two different functions. The `myPolynomialDataFitting` function computes the least squares solution and `myPolynomialCurveFittingGraph` plots the results. .. GENERATED FROM PYTHON SOURCE LINES 215-230 .. code-block:: default def myPolynomialDataFitting(total_degree, x_train, y_train): """Computes the polynomial curve fitting with given total degree. This is for learning purposes only: please consider a serious metamodel for real applications, e.g. polynomial chaos or kriging.""" polynomialCollection = ["x^%d" % (degree) for degree in range(1, total_degree + 1)] basis = ot.SymbolicFunction(["x"], polynomialCollection) designMatrix = basis(x_train) myLeastSquares = ot.LinearLeastSquares(designMatrix, y_train) myLeastSquares.run() responseSurface = myLeastSquares.getMetaModel() return responseSurface, basis .. GENERATED FROM PYTHON SOURCE LINES 231-259 .. code-block:: default def myPolynomialCurveFittingGraph(total_degree, x_train, y_train): """Returns the graphics for a polynomial curve fitting with given total degree""" responseSurface, basis = myPolynomialDataFitting( total_degree, x_train, y_train) # Graphics n_test = 100 x_test = linearSample(0, 1, n_test) ypredicted_test = responseSurface(basis(x_test)) # Graphics graph = ot.Graph("Polynomial curve fitting", "x", "y", True, "topright") # The "unknown" function curve = g.draw(0, 1) curve.setColors(["green"]) graph.add(curve) # Training set cloud = ot.Cloud(x_train, y_train) cloud.setPointStyle("circle") cloud.setLegend("N=%d" % (x_train.getSize())) graph.add(cloud) # Predictions curve = ot.Curve(x_test, ypredicted_test) curve.setLegend("Polynomial Degree = %d" % (total_degree)) curve.setColor("red") graph.add(curve) return graph .. GENERATED FROM PYTHON SOURCE LINES 260-261 In order to see the effect of the polynomial degree, we compare the polynomial fit with degrees equal to 0 (constant), 1 (linear), 3 (cubic) and 9 (enneagonic ?). .. GENERATED FROM PYTHON SOURCE LINES 263-282 .. code-block:: default fig = pl.figure(figsize=(12, 9)) _ = fig.suptitle("Polynomial curve fitting") ax_1 = fig.add_subplot(2, 2, 1) _ = ot.viewer.View( myPolynomialCurveFittingGraph(0, x_train, y_train), figure=fig, axes=[ax_1] ) ax_2 = fig.add_subplot(2, 2, 2) _ = ot.viewer.View( myPolynomialCurveFittingGraph(1, x_train, y_train), figure=fig, axes=[ax_2] ) ax_3 = fig.add_subplot(2, 2, 3) _ = ot.viewer.View( myPolynomialCurveFittingGraph(3, x_train, y_train), figure=fig, axes=[ax_3] ) ax_4 = fig.add_subplot(2, 2, 4) _ = ot.viewer.View( myPolynomialCurveFittingGraph(9, x_train, y_train), figure=fig, axes=[ax_4] ) .. image-sg:: /auto_meta_modeling/general_purpose_metamodels/images/sphx_glr_plot_overfitting_model_selection_005.png :alt: Polynomial curve fitting :srcset: /auto_meta_modeling/general_purpose_metamodels/images/sphx_glr_plot_overfitting_model_selection_005.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 283-288 When the polynomial degree is low, the fit is satisfying. The polynomial is close to the observations, although there is still some residual error. However, when the polynomial degree is high, it produces large oscillations which significantly deviate from the true function. This is *overfitting*. This is a pity, given the fact that the polynomial *exactly* interpolates the observations: the residuals are zeroed. If the locations of the x abscissas could be changed, then the oscillations could be made smaller. This is the method used in gaussian quadrature, where the nodes of interpolation are made closer on the left and right bounds. In our situation, we make the asssumption that these abscissas cannot be changed: the most obvious choice is to limit the degree of the polynomial. Another possibility is to include a regularization into the least squares solution. .. GENERATED FROM PYTHON SOURCE LINES 290-294 Root mean squared error ----------------------- In order to assess the quality of the polynomial fit, we create a second dataset, the *test set* and compare the value of the polynomial with the test observations. .. GENERATED FROM PYTHON SOURCE LINES 296-298 .. code-block:: default sqrt = ot.SymbolicFunction(["x"], ["sqrt(x)"]) .. GENERATED FROM PYTHON SOURCE LINES 299-302 In order to see how close the model is to the observations, we compute the root mean square error. First, we create a degree 4 polynomial which fits the data. .. GENERATED FROM PYTHON SOURCE LINES 304-309 .. code-block:: default total_degree = 4 responseSurface, basis = myPolynomialDataFitting( total_degree, x_train, y_train) .. GENERATED FROM PYTHON SOURCE LINES 310-311 Then we create a test set, with the same method as before. .. GENERATED FROM PYTHON SOURCE LINES 313-320 .. code-block:: default def createDataset(n): x = linearSample(0, 1, n) noiseSample = noise.getSample(n) y = g(x) + noiseSample return x, y .. GENERATED FROM PYTHON SOURCE LINES 321-324 .. code-block:: default n_test = 100 x_test, y_test = createDataset(n_test) .. GENERATED FROM PYTHON SOURCE LINES 325-326 On this test set, we evaluate the polynomial. .. GENERATED FROM PYTHON SOURCE LINES 328-330 .. code-block:: default ypredicted_test = responseSurface(basis(x_test)) .. GENERATED FROM PYTHON SOURCE LINES 331-332 The vector of residuals is the vector of the differences between the observations and the predictions. .. GENERATED FROM PYTHON SOURCE LINES 334-336 .. code-block:: default residuals = y_test.asPoint() - ypredicted_test.asPoint() .. GENERATED FROM PYTHON SOURCE LINES 337-338 The `normSquare` method computes the square of the Euclidian norm (i.e. the 2-norm). We divide this by the test sample size (so as to compare the error for different sample sizes) and compute the square root of the result (so that the result has the same unit as y). .. GENERATED FROM PYTHON SOURCE LINES 340-344 .. code-block:: default RMSE = sqrt([residuals.normSquare() / n_test])[0] RMSE .. rst-class:: sphx-glr-script-out Out: .. code-block:: none 0.14464766752910935 .. GENERATED FROM PYTHON SOURCE LINES 345-346 The following function gathers the RMSE computation to make the experiment easier. .. GENERATED FROM PYTHON SOURCE LINES 348-355 .. code-block:: default def computeRMSE(responseSurface, basis, x, y): ypredicted = responseSurface(basis(x)) residuals = y.asPoint() - ypredicted.asPoint() RMSE = sqrt([residuals.normSquare() / n_test])[0] return RMSE .. GENERATED FROM PYTHON SOURCE LINES 356-367 .. code-block:: default maximum_degree = 10 RMSE_train = ot.Sample(maximum_degree, 1) RMSE_test = ot.Sample(maximum_degree, 1) for total_degree in range(maximum_degree): responseSurface, basis = myPolynomialDataFitting( total_degree, x_train, y_train) RMSE_train[total_degree, 0] = computeRMSE( responseSurface, basis, x_train, y_train) RMSE_test[total_degree, 0] = computeRMSE( responseSurface, basis, x_test, y_test) .. GENERATED FROM PYTHON SOURCE LINES 368-384 .. code-block:: default degreeSample = ot.Sample([[i] for i in range(maximum_degree)]) graph = ot.Graph("Root mean square error", "Degree", "RMSE", True, "topright") # Train cloud = ot.Curve(degreeSample, RMSE_train) cloud.setColor("blue") cloud.setLegend("Train") cloud.setPointStyle("circle") graph.add(cloud) # Test cloud = ot.Curve(degreeSample, RMSE_test) cloud.setColor("red") cloud.setLegend("Test") cloud.setPointStyle("circle") graph.add(cloud) view = otv.View(graph) .. image-sg:: /auto_meta_modeling/general_purpose_metamodels/images/sphx_glr_plot_overfitting_model_selection_006.png :alt: Root mean square error :srcset: /auto_meta_modeling/general_purpose_metamodels/images/sphx_glr_plot_overfitting_model_selection_006.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 385-388 We see that the RMSE on the train set continuously decreases, reaching zero when the polynomial degree is so that the number of coefficients is equal to the train dataset sample size. In this extreme situation, the least squares solution is equivalent to solving a linear system of equations: this leads to a zero residual. On the test set however, the RMSE decreases, reaches a flat region, then increases dramatically when the degree is equal to 9. Hence, limiting the polynomial degree limits overfitting. .. GENERATED FROM PYTHON SOURCE LINES 390-394 Increasing the training set --------------------------- We wonder what happens when the training dataset size is increased. .. GENERATED FROM PYTHON SOURCE LINES 396-419 .. code-block:: default total_degree = 9 fig = pl.figure(figsize=(12, 9)) _ = fig.suptitle("Polynomial curve fitting") # ax_1 = fig.add_subplot(2, 2, 1) n_train = 11 x_train, y_train = createDataset(n_train) _ = ot.viewer.View( myPolynomialCurveFittingGraph(total_degree, x_train, y_train), figure=fig, axes=[ax_1], ) # n_train = 100 x_train, y_train = createDataset(n_train) ax_2 = fig.add_subplot(2, 2, 2) _ = ot.viewer.View( myPolynomialCurveFittingGraph(total_degree, x_train, y_train), figure=fig, axes=[ax_2], ) pl.show() .. image-sg:: /auto_meta_modeling/general_purpose_metamodels/images/sphx_glr_plot_overfitting_model_selection_007.png :alt: Polynomial curve fitting :srcset: /auto_meta_modeling/general_purpose_metamodels/images/sphx_glr_plot_overfitting_model_selection_007.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 420-421 We see that the polynomial oscillates with a dataset with size 11, but does not with the larger dataset: increasing the training dataset mitigates the oscillations. .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 1.172 seconds) .. _sphx_glr_download_auto_meta_modeling_general_purpose_metamodels_plot_overfitting_model_selection.py: .. only :: html .. container:: sphx-glr-footer :class: sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_overfitting_model_selection.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_overfitting_model_selection.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_