.. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_data_analysis_distribution_fitting_plot_estimate_non_parametric_distribution.py: Fit a non parametric distribution ================================= In this example we are going to estimate a non parametric distribution using the kernel smoothing method. After a short introductory example we focus on a few basic features of the API : - kernel selection - bandwidth estimation - an advanced feature such as boundary corrections .. code-block:: default from __future__ import print_function import openturns as ot import openturns.viewer as viewer from matplotlib import pylab as plt ot.Log.Show(ot.Log.NONE) An introductory example ----------------------- We create the data from a Gamma distribution : .. code-block:: default ot.RandomGenerator.SetSeed(0) distribution = ot.Gamma(6.0, 1.0) sample = distribution.getSample(800) We define the kernel smoother and build the smoothed estimate. .. code-block:: default kernel = ot.KernelSmoothing() estimated = kernel.build(sample) We can draw the original distribution vs the kernel smoothing. .. code-block:: default graph = distribution.drawPDF() graph.setTitle("Kernel smoothing vs original") kernel_plot = estimated.drawPDF().getDrawable(0) kernel_plot.setColor("blue") graph.add(kernel_plot) graph.setLegends(["original", "KS"]) graph.setLegendPosition("topright") view = viewer.View(graph) .. image:: /auto_data_analysis/distribution_fitting/images/sphx_glr_plot_estimate_non_parametric_distribution_001.png :alt: Kernel smoothing vs original :class: sphx-glr-single-img We can obtain the bandwdth parameter : .. code-block:: default print(kernel.getBandwidth()) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none [0.529581] We now compute a better bandwitdh with the Silverman rule. .. code-block:: default bandwidth = kernel.computeSilvermanBandwidth(sample) print(bandwidth) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none [0.639633] The new bandwidth is used to regenerate another fitted distribution : .. code-block:: default estimated = kernel.build(sample, bandwidth) .. code-block:: default graph = distribution.drawPDF() graph.setTitle("Kernel smoothing vs original") kernel_plot = estimated.drawPDF().getDrawable(0) kernel_plot.setColor("blue") graph.add(kernel_plot) graph.setLegends(["original", "KS-Silverman"]) graph.setLegendPosition("topright") view = viewer.View(graph) .. image:: /auto_data_analysis/distribution_fitting/images/sphx_glr_plot_estimate_non_parametric_distribution_002.png :alt: Kernel smoothing vs original :class: sphx-glr-single-img The Silverman rule of thumb to estimate the bandwidth provides a better estimate for the distribution. We can also study the impact of the kernel selection. Choosing a kernel ----------------- We experiment with several kernels to perform the smoothing : - the standard normal kernel - the triangular kernel - the Epanechnikov kernel - the uniform kernel We first create the data from a Gamma distribution : .. code-block:: default distribution = ot.Gamma(6.0, 1.0) sample = distribution.getSample(800) The definition of the Normal kernel : .. code-block:: default kernelNormal = ot.KernelSmoothing(ot.Normal()) estimatedNormal = kernelNormal.build(sample) The definition of the Triangular kernel : .. code-block:: default kernelTriangular = ot.KernelSmoothing(ot.Triangular()) estimatedTriangular = kernelTriangular.build(sample) The definition of the Epanechnikov kernel : .. code-block:: default kernelEpanechnikov = ot.KernelSmoothing(ot.Epanechnikov()) estimatedEpanechnikov = kernelEpanechnikov.build(sample) The definition of the Uniform kernel : .. code-block:: default kernelUniform = ot.KernelSmoothing(ot.Uniform()) estimatedUniform = kernelUniform.build(sample) We finally compare all the distributions : .. code-block:: default graph = distribution.drawPDF() graph.setTitle("Different kernel smoothings vs original distribution") graph.setGrid(True) kernel_estimatedNormal_plot = estimatedNormal.drawPDF().getDrawable(0) kernel_estimatedNormal_plot.setColor("blue") graph.add(kernel_estimatedNormal_plot) kernel_estimatedTriangular_plot = estimatedTriangular.drawPDF().getDrawable(0) kernel_estimatedTriangular_plot.setColor("green") graph.add(kernel_estimatedTriangular_plot) kernel_estimatedEpanechnikov_plot = estimatedEpanechnikov.drawPDF().getDrawable(0) kernel_estimatedEpanechnikov_plot.setColor("orange") graph.add(kernel_estimatedEpanechnikov_plot) kernel_estimatedUniform_plot = estimatedUniform.drawPDF().getDrawable(0) kernel_estimatedUniform_plot.setColor("black") kernel_estimatedUniform_plot.setLineStyle("dashed") graph.add(kernel_estimatedUniform_plot) graph.setLegends( ["original", "KS-Normal", "KS-Triangular", "KS-Epanechnikov", "KS-Uniform"] ) view = viewer.View(graph) .. image:: /auto_data_analysis/distribution_fitting/images/sphx_glr_plot_estimate_non_parametric_distribution_003.png :alt: Different kernel smoothings vs original distribution :class: sphx-glr-single-img We observe that all the kernels produce very similar results in practice. The Uniform kernel may be seen as the worst of them all while the Epanechnikov one is said to be a good theoritical choice. In practice the standard normal kernel is a fine choice. The most important aspect of kernel smoothing is the choice of the bandwith. Bandwidth selection ------------------- We reproduce a classical example of the literature : the fitting of a bimodal distribution. We will show the result of a kernel smoothing with different bandwidth computation : - the Silverman rule - the Plugin bandwidth - the Mixed bandwidth We define the bimodal distribution and generate a sample out of it. .. code-block:: default X1 = ot.Normal(10.0, 1.0) X2 = ot.Normal(-10.0, 1.0) myDist = ot.Mixture([X1, X2]) sample = myDist.getSample(2000) We now compare the fitted distribution : .. code-block:: default graph = myDist.drawPDF() graph.setTitle("Kernel smoothing vs original") With the Silverman rule : .. code-block:: default kernelSB = ot.KernelSmoothing() bandwidthSB = kernelSB.computeSilvermanBandwidth(sample) estimatedSB = kernelSB.build(sample, bandwidthSB) kernelSB_plot = estimatedSB.drawPDF().getDrawable(0) graph.add(kernelSB_plot) With the Plugin bandwidth : .. code-block:: default kernelPB = ot.KernelSmoothing() bandwidthPB = kernelPB.computePluginBandwidth(sample) estimatedPB = kernelPB.build(sample, bandwidthPB) kernelPB_plot = estimatedPB.drawPDF().getDrawable(0) graph.add(kernelPB_plot) With the Mixed bandwidth : .. code-block:: default kernelMB = ot.KernelSmoothing() bandwidthMB = kernelMB.computeMixedBandwidth(sample) estimatedMB = kernelMB.build(sample, bandwidthMB) kernelMB_plot = estimatedMB.drawPDF().getDrawable(0) kernelMB_plot.setLineStyle("dashed") graph.add(kernelMB_plot) .. code-block:: default graph.setLegends(["original", "KS-Silverman", "KS-Plugin", "KS-Mixed"]) graph.setColors(["red", "blue", "green", "black"]) graph.setLegendPosition("topright") view = viewer.View(graph) .. image:: /auto_data_analysis/distribution_fitting/images/sphx_glr_plot_estimate_non_parametric_distribution_004.png :alt: Kernel smoothing vs original :class: sphx-glr-single-img As expected the Silverman seriously overfit the data and the other rules are more to the point. Boundary corrections -------------------- We finish this example on an advanced feature of the kernel smoothing, the boundary corrections. We consider a Weibull distribution : .. code-block:: default myDist = ot.WeibullMin(2.0, 1.5, 1.0) We generate a sample from the defined distribution : .. code-block:: default sample = myDist.getSample(2000) We draw the exact Weibull distribution : .. code-block:: default graph = myDist.drawPDF() We use two different kernels : - a standard normal kernel - the same kernel with a boundary correction The first kernel without the boundary corrections. .. code-block:: default kernel1 = ot.KernelSmoothing() estimated1 = kernel1.build(sample) The second kernel with the boundary corrections. .. code-block:: default kernel2 = ot.KernelSmoothing() kernel2.setBoundaryCorrection(True) estimated2 = kernel2.build(sample) We compare the estimated PDFs : .. code-block:: default graph.setTitle("Kernel smoothing vs original") kernel1_plot = estimated1.drawPDF().getDrawable(0) kernel1_plot.setColor("blue") graph.add(kernel1_plot) kernel2_plot = estimated2.drawPDF().getDrawable(0) kernel2_plot.setColor("green") graph.add(kernel2_plot) graph.setLegends(["original", "KS", "KS with boundary correction"]) graph.setLegendPosition("topright") view = viewer.View(graph) .. image:: /auto_data_analysis/distribution_fitting/images/sphx_glr_plot_estimate_non_parametric_distribution_005.png :alt: Kernel smoothing vs original :class: sphx-glr-single-img The boundary correction made has a remarkable impact on the quality of the estimate for the small values. .. code-block:: default plt.show() .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 1.118 seconds) .. _sphx_glr_download_auto_data_analysis_distribution_fitting_plot_estimate_non_parametric_distribution.py: .. only :: html .. container:: sphx-glr-footer :class: sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_estimate_non_parametric_distribution.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_estimate_non_parametric_distribution.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_