.. _empirical_cdf:

Empirical cumulative distribution function
------------------------------------------

The empirical cumulative distribution function provides a graphical
representation of the probability distribution of a random vector
without implying any prior assumption concerning the form of this
distribution. It concerns a non-parametric approach which enables the
description of complex behavior not necessarily detected with
parametric approaches.

Therefore, using general notation, this means that we are looking for an
estimator :math:`\widehat{F}_N` for the cumulative distribution function
:math:`F_{X}` of the random variable
:math:`\vect{X} = \left( X^1,\ldots,X^{n_X} \right)`:

.. math::

   \begin{aligned}
       \widehat{F}_N \leftrightarrow F_{X}
     \end{aligned}

Let us first consider the uni-dimensional case, and let us denote
:math:`\vect{X} = X^1 = X`. The empirical probability distribution is
the distribution created from a sample of observed values
:math:`\left\{x_1, x_2, \ldots, x_N\right\}`. It corresponds to a
discrete uniform distribution on
:math:`\left\{x_1, x_2, \ldots, x_N\right\}`: where :math:`X'` follows
this distribution,

.. math::

   \begin{aligned}
       \forall \; i \in \left\{1,\ldots, N\right\} ,\ \textrm{Pr}\left(X'=x_i\right) = \frac{1}{N}
     \end{aligned}

The empirical cumulative distribution function :math:`\widehat{F}_N`
with this distribution is constructed as follows:

.. math::

   \begin{aligned}
       F_N(x) = \frac{1}{N} \sum_{i=1}^N \mathbf{1}_{ \left\{ x_i \leq x \right\} }
     \end{aligned}

The empirical cumulative distribution function :math:`F_N(x)` is defined
as the proportion of observations that are less than (or equal to)
:math:`x` and is thus an approximation of the cumulative distribution
function :math:`F_X(x)` which is the probability that an observation is
less than (or equal to) :math:`x`.

.. math::

   \begin{aligned}
       F_X(x) = \textrm{Pr} \left( X \leq x \right)
     \end{aligned}

The diagram below provides an illustration of an ordered sample
:math:`\left\{5,6,10,22,27\right\}`.

.. plot::

    import openturns as ot
    from matplotlib import pyplot as plt
    import openturns.viewer as otv

    sample = [[5.0], [6.0], [10.0], [22.0], [27.0]]
    xmin = 0.0
    xmax = 30.0
    graph = ot.UserDefined(sample).drawCDF(xmin, xmax)
    graph.setTitle('Empirical CDF')
    otv.View(graph)

The method is similar for the case :math:`n_X>1`. The empirical
probability distribution is a distribution created from a sample
:math:`\left\{\vect{x}_1, \vect{x}_2, \ldots, \vect{x}_N\right\}`. It
corresponds to a discrete uniform distribution on
:math:`\left\{\vect{x}_1, \vect{x}_2, \ldots, \vect{x}_N\right\}`: where
:math:`\vect{X}'` follows this distribution,

.. math::

   \begin{aligned}
       \forall \; i \in \left\{1,\ldots, N\right\} ,\ \textrm{Pr}\left(\vect{X}'=\vect{x}_i\right) = \frac{1}{N}
     \end{aligned}

Thus we have:

.. math::

   \begin{aligned}
       F_N(\vect{x}) = \frac{1}{N} \sum_{i=1}^N \mathbf{1}_{ \left\{ x^1_i \leq x^1,\ldots,x^{n_X}_N \leq x^{n_X} \right\} }
     \end{aligned}

in comparison with the theoretical probability density function :math:`F_X`:

.. math::

   \begin{aligned}
       F_X(x) = \Prob{X^1 \leq x^1,\ldots,X^{n_X} \leq x^{n_X}}
     \end{aligned}

This method is also referred to in the literature as the empirical
distribution function.

.. topic:: API:

    - See :class:`~openturns.UserDefined` for the empirical distribution

.. topic:: Examples:

    - See :doc:`/auto_data_analysis/sample_analysis/plot_visualize_empirical_cdf`


.. topic:: References:

    - [saporta1990]_
    - [dixon1983]_