.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_data_analysis/distribution_fitting/plot_estimate_gpd_dowjones.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_auto_data_analysis_distribution_fitting_plot_estimate_gpd_dowjones.py>`
        to download the full example code.

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_data_analysis_distribution_fitting_plot_estimate_gpd_dowjones.py:


Estimate a GPD on the Dow Jones Index data
==========================================

.. GENERATED FROM PYTHON SOURCE LINES 7-10

In this example, we illustrate various techniques of extreme value modeling applied
to the 5-year series of daily Dow Jones Index closing prices.
Readers should refer to [coles2001]_ example 1.8 to get more details.

.. GENERATED FROM PYTHON SOURCE LINES 10-17

.. code-block:: Python

    import math as m
    import openturns as ot
    import openturns.experimental as otexp
    import openturns.viewer as otv
    from openturns.usecases import coles
    import pandas as pd


.. GENERATED FROM PYTHON SOURCE LINES 18-20

First, we load the Dow Jones dataset and plot it through time.
We can see that the process is non-stationary.

.. GENERATED FROM PYTHON SOURCE LINES 20-35

.. code-block:: Python

    full = pd.read_csv(coles.Coles().dowjones, index_col=0, parse_dates=True)
    print(full[:10])
    graph = ot.Graph(
        "Daily closing prices of the Dow Jones Index", "Day index", "Index", True, ""
    )
    # Care: to get the real period range, multiply by a factor to account for non-working days missing values!
    size = len(full)
    days = ot.Sample([[i] for i in range(size)])
    dataDowJones = ot.Sample.BuildFromDataFrame(full)
    curve = ot.Curve(days, dataDowJones)
    curve.setColor("red")
    graph.add(curve)
    graph.setIntegerXTick(True)
    view = otv.View(graph)


.. image-sg:: /auto_data_analysis/distribution_fitting/images/sphx_glr_plot_estimate_gpd_dowjones_001.png
   :alt: Daily closing prices of the Dow Jones Index
   :srcset: /auto_data_analysis/distribution_fitting/images/sphx_glr_plot_estimate_gpd_dowjones_001.png
   :class: sphx-glr-single-img


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

                           Index
    Date                        
    1995-09-11 02:00:00  4704.94
    1995-09-12 02:00:00  4747.21
    1995-09-13 02:00:00  4765.52
    1995-09-14 02:00:00  4801.80
    1995-09-15 02:00:00  4797.57
    1995-09-18 02:00:00  4780.41
    1995-09-19 02:00:00  4767.04
    1995-09-20 02:00:00  4792.69
    1995-09-21 02:00:00  4767.40
    1995-09-22 02:00:00  4764.15


.. GENERATED FROM PYTHON SOURCE LINES 36-46

In that example, the time dependence can
not be explained by trends or seasonal cycles. Many empirical
studies have advised to consider the logarithms of ratios of
successive observations to get an approximation to stationarity.
We apply that transformation:

.. math::
    \tilde{X}_i = \log X_i - \log X_{i-1}.

The resulting time series appears to be reasonably close to stationarity.

.. GENERATED FROM PYTHON SOURCE LINES 46-60

.. code-block:: Python

    transfDataDJ = ot.Sample(
        [
            [m.log(dataDowJones[i, 0]) - m.log(dataDowJones[i - 1, 0])]
            for i in range(1, size)
        ]
    )
    curve = ot.Curve(days[:-1], transfDataDJ)
    graph = ot.Graph(
        "Log-daily returns of the Dow Jones Index", "Day index", "Index", True, ""
    )
    graph.add(curve)
    view = otv.View(graph)


.. image-sg:: /auto_data_analysis/distribution_fitting/images/sphx_glr_plot_estimate_gpd_dowjones_002.png
   :alt: Log-daily returns of the Dow Jones Index
   :srcset: /auto_data_analysis/distribution_fitting/images/sphx_glr_plot_estimate_gpd_dowjones_002.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 61-66

For convenience of presentation, we rescale the data:

.. math::
   \tilde{X}_i \rightarrow 100\tilde{X}_i.


.. GENERATED FROM PYTHON SOURCE LINES 66-69

.. code-block:: Python

    scalTransfDataDJ = transfDataDJ * 100.0
    size = len(scalTransfDataDJ)


.. GENERATED FROM PYTHON SOURCE LINES 70-76

In order to select a threshold upon which the GPD model will be fitted, we draw
the mean residual life plot with approximate :math:`95\%` confidence interval.
The curve is initially linear and shows significant curvature for
:math:`u \in [-1, 2]`. Then for :math:`u \geq 2`, the curve is considered as
reasonably linear when judged to confidence intervals. Hence, we
choose the threshold :math:`u=2`. There are 37 exceedances of :math:`u`.

.. GENERATED FROM PYTHON SOURCE LINES 76-81

.. code-block:: Python

    factory = ot.GeneralizedParetoFactory()
    graph = factory.drawMeanResidualLife(scalTransfDataDJ)
    view = otv.View(graph)
    u = 2.0


.. image-sg:: /auto_data_analysis/distribution_fitting/images/sphx_glr_plot_estimate_gpd_dowjones_003.png
   :alt: Mean residual life plot
   :srcset: /auto_data_analysis/distribution_fitting/images/sphx_glr_plot_estimate_gpd_dowjones_003.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 82-89

**Stationary GPD modeling assuming independence in data**

We first assume that the dependence between the transformed data
is negligible, so we first consider the data as
independent observations over the observation period.
We estimate the parameters of the GPD distribution modeling the excesses
above :math:`u=2` by maximizing the log-likelihood of the excesses.

.. GENERATED FROM PYTHON SOURCE LINES 89-91

.. code-block:: Python

    result_LL = factory.buildMethodOfLikelihoodMaximizationEstimator(scalTransfDataDJ, u)


.. GENERATED FROM PYTHON SOURCE LINES 92-93

We get the fitted GPD and the estimated parameters :math:`(\hat{\sigma}, \hat{\xi})`.

.. GENERATED FROM PYTHON SOURCE LINES 93-99

.. code-block:: Python

    fitted_GPD = result_LL.getDistribution()
    desc = fitted_GPD.getParameterDescription()
    param = fitted_GPD.getParameter()
    print(", ".join([f"{p}: {value:.3f}" for p, value in zip(desc, param)]))
    print("log-likelihood = ", result_LL.getLogLikelihood())


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    sigma: 0.626, xi: 0.066, u: 2.000
    log-likelihood =  -22.103513585430218


.. GENERATED FROM PYTHON SOURCE LINES 100-107

We get the asymptotic distribution of the estimator :math:`(\hat{\sigma}, \hat{\xi})`.
The threshold :math:`u` has not been estimated to ensure the regularity
of the model and then the asymptotic properties of the maximum likelihood
estimators. This is the reason why it appears as a Dirac distribution centered on
the chosen threshold.
In that case, the asymptotic distribution of :math:`(\hat{\sigma}, \hat{\xi})`
is normal.

.. GENERATED FROM PYTHON SOURCE LINES 107-111

.. code-block:: Python

    parameterEstimate = result_LL.getParameterDistribution()
    print("Asymptotic distribution of the estimator : ")
    print(parameterEstimate)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Asymptotic distribution of the estimator : 
    BlockIndependentDistribution(Normal(mu = [0.625758,0.0661835], sigma = [0.1838,0.193486], R = [[  1        -0.795905 ]
     [ -0.795905  1        ]]), Dirac(point = [2]))


.. GENERATED FROM PYTHON SOURCE LINES 112-114

We get the covariance matrix and the standard deviation of
:math:`(\hat{\sigma}, \hat{\xi}, u)` where :math:`u` is deterministic.

.. GENERATED FROM PYTHON SOURCE LINES 114-117

.. code-block:: Python

    print("Cov matrix = \n", parameterEstimate.getCovariance())
    print("Standard dev = ", parameterEstimate.getStandardDeviation())


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Cov matrix = 
     [[  0.0337823 -0.0283045  0         ]
     [ -0.0283045  0.0374369  0         ]
     [  0          0          0         ]]
    Standard dev =  [0.1838,0.193486,0]


.. GENERATED FROM PYTHON SOURCE LINES 118-120

We get the marginal confidence intervals of order 0.95 of
:math:`(\hat{\sigma}, \hat{\xi})`.

.. GENERATED FROM PYTHON SOURCE LINES 120-125

.. code-block:: Python

    order = 0.95
    for i in range(2):  # exclude u parameter (fixed)
        ci = parameterEstimate.getMarginal(i).computeBilateralConfidenceInterval(order)
        print(desc[i] + ":", ci)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    sigma: [0.265518, 0.985999]
    xi: [-0.313043, 0.44541]


.. GENERATED FROM PYTHON SOURCE LINES 126-138

At last, we can check the quality of the inference thanks to the 4 usual
diagnostic plots:

- the probability-probability pot,
- the quantile-quantile pot,
- the return level plot,
- the data histogram and the density of the fitted model.

We conclude that the goodness-of-fit in the quantile plots seems unconvincing,
even if the other plots appear to be reasonable. This is due to the fact that
the excesses can not be considered as independent: the transformed series
:math:`\tilde{X}_i` has a rich structure of temporal dependence.

.. GENERATED FROM PYTHON SOURCE LINES 138-142

.. code-block:: Python

    validation = otexp.GeneralizedParetoValidation(result_LL, scalTransfDataDJ)
    graph = validation.drawDiagnosticPlot()
    view = otv.View(graph)


.. image-sg:: /auto_data_analysis/distribution_fitting/images/sphx_glr_plot_estimate_gpd_dowjones_004.png
   :alt: , Sample versus model PP-plot, Sample versus model QQ-plot, Return level plot, Density
   :srcset: /auto_data_analysis/distribution_fitting/images/sphx_glr_plot_estimate_gpd_dowjones_004.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 143-161

**Stationary GPD modeling taking into account the dependence in data**

We illustrate the fact that the excesses of the transformed series happen in
groups. Hence we use the declustering method
which filters the dependent observations exceeding a given threshold to obtain a
set of threshold excesses that can be assumed as independent.

Consecutive exceedances of :math:`u` belong to the same cluster. Two distinct
clusters are separated by :math:`r` consecutive observations under the
threshold. Within each cluster, we select the maximum value that will be used to
infer the GPD distribution. The cluster maxima are assumed to be independent.

On the graph, we show the clusters over the threshold :math:`u=2`
and all the maxima selected within each cluster.
It is possible to extract the data belonging to the same cluster and the
cluster maximum series.
We denote by :math:`n_c` the number of clusters and
by :math:`n_u` the number of exceedances above :math:`u`.

.. GENERATED FROM PYTHON SOURCE LINES 161-173

.. code-block:: Python

    part = otexp.SamplePartition(scalTransfDataDJ)
    r = 3
    peaks, clusters = part.getPeakOverThreshold(u, r)
    nc = len(peaks)
    nu = sum([1 if scalTransfDataDJ[i, 0] > u else 0 for i in range(size)])
    print(f"nc={nc} nu={u} theta={nc / nu:.3f}")
    graph = clusters.draw(u)
    graph.setTitle(
        "Threshold exceedances and clusters by transformed Dow Jones Index series"
    )
    view = otv.View(graph)


.. image-sg:: /auto_data_analysis/distribution_fitting/images/sphx_glr_plot_estimate_gpd_dowjones_005.png
   :alt: Threshold exceedances and clusters by transformed Dow Jones Index series
   :srcset: /auto_data_analysis/distribution_fitting/images/sphx_glr_plot_estimate_gpd_dowjones_005.png
   :class: sphx-glr-single-img


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    nc=32 nu=2.0 theta=0.865


.. GENERATED FROM PYTHON SOURCE LINES 174-176

We estimate a stationary GPD on the clusters maxima which are independent
with the :math:`95\%` confidence interval of each parameter.

.. GENERATED FROM PYTHON SOURCE LINES 176-184

.. code-block:: Python

    result_LL = factory.buildMethodOfLikelihoodMaximizationEstimator(peaks, u)
    sigma, xi, _ = result_LL.getParameterDistribution().getMean()
    sigma_stddev, xi_stddev, _ = result_LL.getParameterDistribution().getStandardDeviation()
    print(
        f"u={u} r={r} nc={nc} sigma={sigma:.2f} ({sigma_stddev:.2f}) xi={xi:.2f} ({xi_stddev:.2f})",
        end=" ",
    )


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    u=2.0 r=3 nc=32 sigma=0.54 (0.18) xi=0.27 (0.28) 


.. GENERATED FROM PYTHON SOURCE LINES 185-192

We evaluate the :math:`T=100`-year return level which corresponds to the
:math:`m`-observation return level, where :math:`m = T*n_y` with :math:`n_y`
the number of observations per year. Here, we have daily observations, hence
:math:`n_y = 365`. To calculate it, we evaluate the extremal index
:math:`\theta` which is the inverse of the mean length of the clusters,
estimated by the ratio between the number of clusters and the number
of exceedances of :math:`u`.

.. GENERATED FROM PYTHON SOURCE LINES 192-198

.. code-block:: Python

    theta = nc / nu
    ny = 365
    T = 100
    xm_100 = factory.buildReturnLevelEstimator(result_LL, scalTransfDataDJ, T * ny, theta)
    print(f"x100={xm_100.getMean()} ({xm_100.getStandardDeviation()}) theta={theta:.3f}")


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    x100=[12.5097] ([10.6691]) theta=0.865


.. GENERATED FROM PYTHON SOURCE LINES 199-204

We plot the return level for the new fitted model that takes into account
dependence between excesses.
We can see the fitted model works well. However, the large return level confidence intervals
obtained for extreme return levels makes it difficult to make reliable
predictions with any degree of certainty.

.. GENERATED FROM PYTHON SOURCE LINES 204-210

.. code-block:: Python

    validation = otexp.GeneralizedParetoValidation(result_LL, peaks)
    grid = validation.drawDiagnosticPlot()
    rlPlot = grid.getGraph(1, 0)
    rlPlot.setTitle(rlPlot.getTitle() + f" (u={u} r={r})")
    view = otv.View(rlPlot)


.. image-sg:: /auto_data_analysis/distribution_fitting/images/sphx_glr_plot_estimate_gpd_dowjones_006.png
   :alt: Return level plot (u=2.0 r=3)
   :srcset: /auto_data_analysis/distribution_fitting/images/sphx_glr_plot_estimate_gpd_dowjones_006.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 211-212

We plot the whole series of validation graphs of the new fitted model.

.. GENERATED FROM PYTHON SOURCE LINES 212-214

.. code-block:: Python

    view = otv.View(grid)


.. image-sg:: /auto_data_analysis/distribution_fitting/images/sphx_glr_plot_estimate_gpd_dowjones_007.png
   :alt: , Sample versus model PP-plot, Sample versus model QQ-plot, Return level plot, Density
   :srcset: /auto_data_analysis/distribution_fitting/images/sphx_glr_plot_estimate_gpd_dowjones_007.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 215-216

.. code-block:: Python

    otv.View.ShowAll()


.. _sphx_glr_download_auto_data_analysis_distribution_fitting_plot_estimate_gpd_dowjones.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: plot_estimate_gpd_dowjones.ipynb <plot_estimate_gpd_dowjones.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: plot_estimate_gpd_dowjones.py <plot_estimate_gpd_dowjones.py>`

    .. container:: sphx-glr-download sphx-glr-download-zip

      :download:`Download zipped: plot_estimate_gpd_dowjones.zip <plot_estimate_gpd_dowjones.zip>`