.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_meta_modeling/general_purpose_metamodels/plot_distribution_linear_regression.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_meta_modeling_general_purpose_metamodels_plot_distribution_linear_regression.py: Distribution of estimators in linear regression =============================================== .. GENERATED FROM PYTHON SOURCE LINES 7-32 Introduction ~~~~~~~~~~~~ In this example, we are interested in the distribution of the estimator of the variance of the observation error in linear regression. We are also interested in the estimator of the standard deviation of the observation error. We show how to use the :class:`~openturns.PythonRandomVector` class in order to perform a study of the sample distribution of these estimators. In the general linear regression model, the observation error :math:`\epsilon` has the normal distribution :math:`\cN(0, \sigma^2)` where :math:`\sigma > 0` is the standard deviation. We are interested in the estimators of the variance :math:`\sigma^2` and the standard deviation :math:`\sigma`: - the variance of the residuals, :math:`\sigma^2`, is estimated from :meth:`~openturns.LinearModelResult.getResidualsVariance`; - the standard deviation :math:`\sigma` is estimated from :meth:`~openturns.LinearModelAnalysis.getResidualsStandardError`. The asymptotic distribution of these estimators is known (see [vaart2000]_) but we want to perform an empirical study, based on simulation. In order to see the distribution of the estimator, we simulate an observation of the estimator, and repeat that experiment :math:`r \in \Nset` times, where :math:`r` is a large integer. Then we approximate the distribution using :class:`~openturns.KernelSmoothing`. .. GENERATED FROM PYTHON SOURCE LINES 32-38 .. code-block:: Python import openturns as ot import openturns.viewer as otv import pylab as pl .. GENERATED FROM PYTHON SOURCE LINES 39-46 The simulation engine ~~~~~~~~~~~~~~~~~~~~~ The simulation is based on the :class:`~openturns.PythonRandomVector` class, which simulates independent observations of a random vector. The `getRealization()` method implements the simulation of the observation of the estimator. .. GENERATED FROM PYTHON SOURCE LINES 46-157 .. code-block:: Python class SampleEstimatorLinearRegression(ot.PythonRandomVector): def __init__( self, sample_size, true_standard_deviation, coefficients, estimator="variance" ): """ Create a RandomVector for an estimator from a linear regression model. Parameters ---------- sample_size : int The sample size n. true_standard_deviation : float The standard deviation of the Gaussian observation error. coefficients: sequence of p floats The coefficients of the linear model. estimator: str The estimator. Available estimators are "variance" or "standard-deviation". """ super(SampleEstimatorLinearRegression, self).__init__(1) self.sample_size = sample_size self.numberOfParameters = coefficients.getDimension() input_dimension = self.numberOfParameters - 1 # Because of the intercept centerCoefficient = [0.0] * input_dimension constantCoefficient = [coefficients[0]] linearCoefficient = ot.Matrix([coefficients[1:]]) self.linearModel = ot.LinearFunction( centerCoefficient, constantCoefficient, linearCoefficient ) self.distribution = ot.Normal(input_dimension) self.distribution.setDescription(["x%d" % (i) for i in range(input_dimension)]) self.errorDistribution = ot.Normal(0.0, true_standard_deviation) self.estimator = estimator # In classical linear regression, the input is deterministic. # Here, we set it once and for all. self.input_sample = self.distribution.getSample(self.sample_size) self.output_sample = self.linearModel(self.input_sample) def getRealization(self): errorSample = self.errorDistribution.getSample(self.sample_size) noisy_output_sample = self.output_sample + errorSample algo = ot.LinearModelAlgorithm(self.input_sample, noisy_output_sample) lmResult = algo.getResult() if self.estimator == "variance": output = lmResult.getResidualsVariance() elif self.estimator == "standard-deviation": lmAnalysis = ot.LinearModelAnalysis(lmResult) output = lmAnalysis.getResidualsStandardError() else: raise ValueError("Unknown estimator %s" % (self.estimator)) return [output] def plot_sample_by_kernel_smoothing( sample_size, true_standard_deviation, coefficients, estimator, repetitions_size ): """ Plot the estimated distribution of the biased sample variance from a sample The method uses Kernel Smoothing with default kernel. Parameters ---------- repetitions_size : int The number of repetitions of the experiments. This is the (children) size of the sample of the biased sample variance Returns ------- graph : ot.Graph The plot of the PDF of the estimated distribution. """ myRV = ot.RandomVector( SampleEstimatorLinearRegression( sample_size, true_standard_deviation, coefficients, estimator ) ) sample_estimator = myRV.getSample(repetitions_size) if estimator == "variance": sample_estimator.setDescription([r"$\hat{\sigma}^2$"]) elif estimator == "standard-deviation": sample_estimator.setDescription([r"$\hat{\sigma}$"]) else: raise ValueError("Unknown estimator %s" % (estimator)) mean_sample_estimator = sample_estimator.computeMean() graph = ot.KernelSmoothing().build(sample_estimator).drawPDF() graph.setLegends(["Sample"]) bb = graph.getBoundingBox() ylb = bb.getLowerBound()[1] yub = bb.getUpperBound()[1] curve = ot.Curve([true_standard_deviation**2] * 2, [ylb, yub]) curve.setLegend("Exact") curve.setLineWidth(2.0) graph.add(curve) graph.setTitle( "Size = %d, True S.D. = %.4f, Mean = %.4f, Rep. = %d" % ( sample_size, true_standard_deviation, mean_sample_estimator[0], repetitions_size, ) ) return graph .. GENERATED FROM PYTHON SOURCE LINES 158-164 Distribution of the variance estimator ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ We first consider the estimation of the variance :math:`\sigma^2`. In the next cell, we consider a sample size equal to :math:`n = 6` with :math:`p = 3` parameters. We use :math:`r = 1000` repetitions. .. GENERATED FROM PYTHON SOURCE LINES 164-180 .. code-block:: Python repetitions_size = 1000 true_standard_deviation = 0.1 sample_size = 6 coefficients = ot.Point([3.0, 2.0, -1.0]) estimator = "variance" view = otv.View( plot_sample_by_kernel_smoothing( sample_size, true_standard_deviation, coefficients, estimator, repetitions_size ), figure_kw={"figsize": (6.0, 3.5)}, ) pl.subplots_adjust(bottom=0.25) .. image-sg:: /auto_meta_modeling/general_purpose_metamodels/images/sphx_glr_plot_distribution_linear_regression_001.png :alt: Size = 6, True S.D. = 0.1000, Mean = 0.0100, Rep. = 1000 :srcset: /auto_meta_modeling/general_purpose_metamodels/images/sphx_glr_plot_distribution_linear_regression_001.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 181-185 If we use a sample size equal to :math:`n = 6` with :math:`p = 3` parameters, the distribution is not symmetric. The mean of the observations of the sample variances remains close to the true value :math:`0.1^2 = 0.01`. .. GENERATED FROM PYTHON SOURCE LINES 187-188 Then we increase the sample size :math:`n`. .. GENERATED FROM PYTHON SOURCE LINES 188-202 .. code-block:: Python repetitions_size = 1000 true_standard_deviation = 0.1 sample_size = 100 coefficients = ot.Point([3.0, 2.0, -1.0]) view = otv.View( plot_sample_by_kernel_smoothing( sample_size, true_standard_deviation, coefficients, estimator, repetitions_size ), figure_kw={"figsize": (6.0, 3.5)}, ) pl.subplots_adjust(bottom=0.25) .. image-sg:: /auto_meta_modeling/general_purpose_metamodels/images/sphx_glr_plot_distribution_linear_regression_002.png :alt: Size = 100, True S.D. = 0.1000, Mean = 0.0101, Rep. = 1000 :srcset: /auto_meta_modeling/general_purpose_metamodels/images/sphx_glr_plot_distribution_linear_regression_002.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 203-206 If we use a sample size equal to :math:`n = 6` with :math:`p = 3` parameters, the distribution is almost symmetric and almost normal. .. GENERATED FROM PYTHON SOURCE LINES 208-211 Distribution of the standard deviation estimator ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ We now consider the estimation of the standard deviation :math:`\sigma`. .. GENERATED FROM PYTHON SOURCE LINES 211-227 .. code-block:: Python repetitions_size = 1000 true_standard_deviation = 0.1 sample_size = 6 coefficients = ot.Point([3.0, 2.0, -1.0]) estimator = "standard-deviation" view = otv.View( plot_sample_by_kernel_smoothing( sample_size, true_standard_deviation, coefficients, estimator, repetitions_size ), figure_kw={"figsize": (6.0, 3.5)}, ) pl.subplots_adjust(bottom=0.25) .. image-sg:: /auto_meta_modeling/general_purpose_metamodels/images/sphx_glr_plot_distribution_linear_regression_003.png :alt: Size = 6, True S.D. = 0.1000, Mean = 0.0938, Rep. = 1000 :srcset: /auto_meta_modeling/general_purpose_metamodels/images/sphx_glr_plot_distribution_linear_regression_003.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 228-233 If we use a sample size equal to :math:`n = 6` with :math:`p = 3` parameters, we see that the distribution is almost symmetric. We notice a slight bias, since the mean of the observations of the standard deviation is not as close to the true value 0.1 as we could expect. .. GENERATED FROM PYTHON SOURCE LINES 233-249 .. code-block:: Python repetitions_size = 1000 true_standard_deviation = 0.1 sample_size = 100 coefficients = ot.Point([3.0, 2.0, -1.0]) estimator = "standard-deviation" view = otv.View( plot_sample_by_kernel_smoothing( sample_size, true_standard_deviation, coefficients, estimator, repetitions_size ), figure_kw={"figsize": (6.0, 3.5)}, ) pl.subplots_adjust(bottom=0.25) .. image-sg:: /auto_meta_modeling/general_purpose_metamodels/images/sphx_glr_plot_distribution_linear_regression_004.png :alt: Size = 100, True S.D. = 0.1000, Mean = 0.0994, Rep. = 1000 :srcset: /auto_meta_modeling/general_purpose_metamodels/images/sphx_glr_plot_distribution_linear_regression_004.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 250-253 If we use a sample size equal to :math:`n = 100` with :math:`p = 3` parameters, we see that the distribution is almost normal. We notice that the bias disappeared. .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 5.452 seconds) .. _sphx_glr_download_auto_meta_modeling_general_purpose_metamodels_plot_distribution_linear_regression.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_distribution_linear_regression.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_distribution_linear_regression.py `