.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_meta_modeling/general_purpose_metamodels/plot_distribution_linear_regression.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_auto_meta_modeling_general_purpose_metamodels_plot_distribution_linear_regression.py>`
        to download the full example code.

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_meta_modeling_general_purpose_metamodels_plot_distribution_linear_regression.py:


Distribution of estimators in linear regression
===============================================

.. GENERATED FROM PYTHON SOURCE LINES 7-32

Introduction
~~~~~~~~~~~~

In this example, we are interested in the distribution of the estimator of the variance
of the observation error in linear regression.
We are also interested in the estimator of the standard deviation of the
observation error.
We show how to use the :class:`~openturns.PythonRandomVector` class in order to
perform a study of the sample distribution of these estimators.

In the general linear regression model, the observation error :math:`\epsilon` has the
normal distribution :math:`\cN(0, \sigma^2)` where :math:`\sigma > 0`
is the standard deviation.
We are interested in the estimators of the variance :math:`\sigma^2`
and the standard deviation :math:`\sigma`:

- the variance of the residuals, :math:`\sigma^2`, is estimated from :meth:`~openturns.LinearModelResult.getResidualsVariance`;
- the standard deviation :math:`\sigma` is estimated from :meth:`~openturns.LinearModelAnalysis.getResidualsStandardError`.

The asymptotic distribution of these estimators is known (see [vaart2000]_)
but we want to perform an empirical study, based on simulation.
In order to see the distribution of the estimator, we simulate an observation of the estimator,
and repeat that experiment :math:`r \in \Nset` times, where :math:`r`
is a large integer.
Then we approximate the distribution using :class:`~openturns.KernelSmoothing`.

.. GENERATED FROM PYTHON SOURCE LINES 32-38

.. code-block:: Python


    import openturns as ot
    import openturns.viewer as otv
    import pylab as pl


.. GENERATED FROM PYTHON SOURCE LINES 39-46

The simulation engine
~~~~~~~~~~~~~~~~~~~~~
The simulation is based on the :class:`~openturns.PythonRandomVector` class,
which simulates independent observations of a random vector.
The `getRealization()` method implements the simulation of the observation
of the estimator.


.. GENERATED FROM PYTHON SOURCE LINES 46-157

.. code-block:: Python


    class SampleEstimatorLinearRegression(ot.PythonRandomVector):
        def __init__(
            self, sample_size, true_standard_deviation, coefficients, estimator="variance"
        ):
            """
            Create a RandomVector for an estimator from a linear regression model.

            Parameters
            ----------
            sample_size : int
                The sample size n.
            true_standard_deviation : float
                The standard deviation of the Gaussian observation error.
            coefficients: sequence of p floats
                The coefficients of the linear model.
            estimator: str
                The estimator.
                Available estimators are "variance" or "standard-deviation".
            """
            super(SampleEstimatorLinearRegression, self).__init__(1)
            self.sample_size = sample_size
            self.numberOfParameters = coefficients.getDimension()
            input_dimension = self.numberOfParameters - 1  # Because of the intercept
            centerCoefficient = [0.0] * input_dimension
            constantCoefficient = [coefficients[0]]
            linearCoefficient = ot.Matrix([coefficients[1:]])
            self.linearModel = ot.LinearFunction(
                centerCoefficient, constantCoefficient, linearCoefficient
            )
            self.distribution = ot.Normal(input_dimension)
            self.distribution.setDescription(["x%d" % (i) for i in range(input_dimension)])
            self.errorDistribution = ot.Normal(0.0, true_standard_deviation)
            self.estimator = estimator
            # In classical linear regression, the input is deterministic.
            # Here, we set it once and for all.
            self.input_sample = self.distribution.getSample(self.sample_size)
            self.output_sample = self.linearModel(self.input_sample)

        def getRealization(self):
            errorSample = self.errorDistribution.getSample(self.sample_size)
            noisy_output_sample = self.output_sample + errorSample
            algo = ot.LinearModelAlgorithm(self.input_sample, noisy_output_sample)
            lmResult = algo.getResult()
            if self.estimator == "variance":
                output = lmResult.getResidualsVariance()
            elif self.estimator == "standard-deviation":
                lmAnalysis = ot.LinearModelAnalysis(lmResult)
                output = lmAnalysis.getResidualsStandardError()
            else:
                raise ValueError("Unknown estimator %s" % (self.estimator))
            return [output]


    def plot_sample_by_kernel_smoothing(
        sample_size, true_standard_deviation, coefficients, estimator, repetitions_size
    ):
        """
        Plot the estimated distribution of the biased sample variance from a sample

        The method uses Kernel Smoothing with default kernel.

        Parameters
        ----------
        repetitions_size : int
            The number of repetitions of the experiments.
            This is the (children) size of the sample of the biased
            sample variance

        Returns
        -------
        graph : ot.Graph
            The plot of the PDF of the estimated distribution.

        """
        myRV = ot.RandomVector(
            SampleEstimatorLinearRegression(
                sample_size, true_standard_deviation, coefficients, estimator
            )
        )
        sample_estimator = myRV.getSample(repetitions_size)
        if estimator == "variance":
            sample_estimator.setDescription([r"$\hat{\sigma}^2$"])
        elif estimator == "standard-deviation":
            sample_estimator.setDescription([r"$\hat{\sigma}$"])
        else:
            raise ValueError("Unknown estimator %s" % (estimator))
        mean_sample_estimator = sample_estimator.computeMean()

        graph = ot.KernelSmoothing().build(sample_estimator).drawPDF()
        graph.setLegends(["Sample"])
        bb = graph.getBoundingBox()
        ylb = bb.getLowerBound()[1]
        yub = bb.getUpperBound()[1]
        curve = ot.Curve([true_standard_deviation**2] * 2, [ylb, yub])
        curve.setLegend("Exact")
        curve.setLineWidth(2.0)
        graph.add(curve)
        graph.setTitle(
            "Size = %d, True S.D. = %.4f, Mean = %.4f, Rep. = %d"
            % (
                sample_size,
                true_standard_deviation,
                mean_sample_estimator[0],
                repetitions_size,
            )
        )
        return graph


.. GENERATED FROM PYTHON SOURCE LINES 158-164

Distribution of the variance estimator
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We first consider the estimation of the variance :math:`\sigma^2`.
In the next cell, we consider a sample size equal to :math:`n = 6` with
:math:`p = 3` parameters.
We use :math:`r = 1000` repetitions.

.. GENERATED FROM PYTHON SOURCE LINES 164-180

.. code-block:: Python


    repetitions_size = 1000
    true_standard_deviation = 0.1
    sample_size = 6
    coefficients = ot.Point([3.0, 2.0, -1.0])
    estimator = "variance"
    view = otv.View(
        plot_sample_by_kernel_smoothing(
            sample_size, true_standard_deviation, coefficients, estimator, repetitions_size
        ),
        figure_kw={"figsize": (6.0, 3.5)},
    )
    pl.subplots_adjust(bottom=0.25)


.. image-sg:: /auto_meta_modeling/general_purpose_metamodels/images/sphx_glr_plot_distribution_linear_regression_001.png
   :alt: Size = 6, True S.D. = 0.1000, Mean = 0.0100, Rep. = 1000
   :srcset: /auto_meta_modeling/general_purpose_metamodels/images/sphx_glr_plot_distribution_linear_regression_001.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 181-185

If we use a sample size equal to :math:`n = 6` with
:math:`p = 3` parameters, the distribution is not symmetric.
The mean of the observations of the sample variances remains close to
the true value :math:`0.1^2 = 0.01`.

.. GENERATED FROM PYTHON SOURCE LINES 187-188

Then we increase the sample size :math:`n`.

.. GENERATED FROM PYTHON SOURCE LINES 188-202

.. code-block:: Python


    repetitions_size = 1000
    true_standard_deviation = 0.1
    sample_size = 100
    coefficients = ot.Point([3.0, 2.0, -1.0])
    view = otv.View(
        plot_sample_by_kernel_smoothing(
            sample_size, true_standard_deviation, coefficients, estimator, repetitions_size
        ),
        figure_kw={"figsize": (6.0, 3.5)},
    )
    pl.subplots_adjust(bottom=0.25)


.. image-sg:: /auto_meta_modeling/general_purpose_metamodels/images/sphx_glr_plot_distribution_linear_regression_002.png
   :alt: Size = 100, True S.D. = 0.1000, Mean = 0.0101, Rep. = 1000
   :srcset: /auto_meta_modeling/general_purpose_metamodels/images/sphx_glr_plot_distribution_linear_regression_002.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 203-206

If we use a sample size equal to :math:`n = 6` with
:math:`p = 3` parameters, the distribution is almost symmetric and
almost normal.

.. GENERATED FROM PYTHON SOURCE LINES 208-211

Distribution of the standard deviation estimator
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We now consider the estimation of the standard deviation :math:`\sigma`.

.. GENERATED FROM PYTHON SOURCE LINES 211-227

.. code-block:: Python


    repetitions_size = 1000
    true_standard_deviation = 0.1
    sample_size = 6
    coefficients = ot.Point([3.0, 2.0, -1.0])
    estimator = "standard-deviation"
    view = otv.View(
        plot_sample_by_kernel_smoothing(
            sample_size, true_standard_deviation, coefficients, estimator, repetitions_size
        ),
        figure_kw={"figsize": (6.0, 3.5)},
    )
    pl.subplots_adjust(bottom=0.25)


.. image-sg:: /auto_meta_modeling/general_purpose_metamodels/images/sphx_glr_plot_distribution_linear_regression_003.png
   :alt: Size = 6, True S.D. = 0.1000, Mean = 0.0938, Rep. = 1000
   :srcset: /auto_meta_modeling/general_purpose_metamodels/images/sphx_glr_plot_distribution_linear_regression_003.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 228-233

If we use a sample size equal to :math:`n = 6` with
:math:`p = 3` parameters, we see that the distribution is almost symmetric.
We notice a slight bias, since the mean of the observations of the
standard deviation is not as close to the true value 0.1
as we could expect.

.. GENERATED FROM PYTHON SOURCE LINES 233-249

.. code-block:: Python


    repetitions_size = 1000
    true_standard_deviation = 0.1
    sample_size = 100
    coefficients = ot.Point([3.0, 2.0, -1.0])
    estimator = "standard-deviation"
    view = otv.View(
        plot_sample_by_kernel_smoothing(
            sample_size, true_standard_deviation, coefficients, estimator, repetitions_size
        ),
        figure_kw={"figsize": (6.0, 3.5)},
    )
    pl.subplots_adjust(bottom=0.25)


.. image-sg:: /auto_meta_modeling/general_purpose_metamodels/images/sphx_glr_plot_distribution_linear_regression_004.png
   :alt: Size = 100, True S.D. = 0.1000, Mean = 0.0994, Rep. = 1000
   :srcset: /auto_meta_modeling/general_purpose_metamodels/images/sphx_glr_plot_distribution_linear_regression_004.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 250-253

If we use a sample size equal to :math:`n = 100` with
:math:`p = 3` parameters, we see that the distribution is almost normal.
We notice that the bias disappeared.


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (0 minutes 5.452 seconds)


.. _sphx_glr_download_auto_meta_modeling_general_purpose_metamodels_plot_distribution_linear_regression.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: plot_distribution_linear_regression.ipynb <plot_distribution_linear_regression.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: plot_distribution_linear_regression.py <plot_distribution_linear_regression.py>`