.. _pearson_coefficient:

Pearson correlation coefficient
-------------------------------

This method deals with the parametric modelling of a probability
distribution for a random vector
:math:`\vect{X} = \left( X^1,\ldots,X^{n_X} \right)`. It aims to measure
a type of dependence (here a linear correlation) which may exist between
two components :math:`X^i` and :math:`X^j`.

The Pearson’s correlation coefficient :math:`\rho_{U,V}` aims to measure
the strength of a linear relationship between two random variables
:math:`U` and :math:`V`. It is defined as follows:

.. math::

   \begin{aligned}
       \rho_{U,V} = \frac{\displaystyle \Cov{U,V}}{\sigma_U \sigma_V}
     \end{aligned}

where
:math:`\Cov{U,V} = \Expect{ \left( U - m_U \right) \left( V - m_V \right) }`,
:math:`m_U= \Expect{U}`, :math:`m_V= \Expect{V}`,
:math:`\sigma_U= \sqrt{\Var{U}}` and :math:`\sigma_V= \sqrt{\Var{V}}`.
If we have a sample made up of a set of :math:`N` pairs
:math:`\left\{ (u_1,v_1),(u_2,v_2),\ldots,(u_N,v_N) \right\}`, Pearson’s
correlation coefficient can be estimated using the formula:

.. math::

   \begin{aligned}
       \widehat{\rho}_{U,V} = \frac{ \displaystyle \sum_{i=1}^N \left( u_i - \overline{u} \right) \left( v_i - \overline{v} \right) }{ \sqrt{\displaystyle \sum_{i=1}^N \left( u_i - \overline{u} \right)^2 \left( v_i - \overline{v} \right)^2} }
     \end{aligned}

where :math:`\overline{u}` and :math:`\overline{v}` represent the
empirical means of the samples :math:`(u_1,\ldots,u_N)` and
:math:`(v_1,\ldots,v_N)`.

Pearson’s correlation coefficient takes values between -1 and 1. The
closer its absolute value is to 1, the stronger the indication is that a
linear relationship exists between variables :math:`U` and :math:`V`.
The sign of Pearson’s coefficient indicates if the two variables
increase or decrease in the same direction (positive coefficient) or in
opposite directions (negative coefficient). We note that a correlation
coefficient equal to 0 does not necessarily imply the independence of
variables :math:`U` and :math:`V`: this property is in fact
theoretically guaranteed only if :math:`U` and :math:`V` both follow a
Normal distribution. In all other cases, there are two possible
situations in the event of a zero Pearson’s correlation coefficient:

-  the variables :math:`U` and :math:`V` are in fact independent,

-  or a non-linear relationship exists between :math:`U` and :math:`V`.

.. plot::

    import openturns as ot
    from openturns.viewer import View

    N = 20
    ot.RandomGenerator.SetSeed(10)
    x = ot.Uniform(0.0, 10.0).getSample(N)
    f = ot.SymbolicFunction(['x'], ['5*x+10'])
    y = f(x) + ot.Normal(0.0, 5.0).getSample(N)
    graph = f.draw(0.0, 10.0)
    graph.setTitle('A linear relationship exists between U and V:\n Pearson\'s coefficient is a relevant measure of dependency')
    graph.setXTitle('u')
    graph.setYTitle('v')
    cloud = ot.Cloud(x, y)
    cloud.setPointStyle('circle')
    cloud.setColor('orange')
    graph.add(cloud)
    View(graph)

.. plot::

    import openturns as ot
    from openturns.viewer import View

    N = 20
    ot.RandomGenerator.SetSeed(10)
    x = ot.Uniform(0.0, 10.0).getSample(N)
    f = ot.SymbolicFunction(['x'], ['x^2'])
    y = f(x) + ot.Normal(0.0, 5.0).getSample(N)
    graph = f.draw(0.0, 10.0)
    graph.setTitle('There is a strong, non-linear relationship between U and V:\n Pearson\'s coefficient is not a relevant measure of dependency')
    graph.setXTitle('u')
    graph.setYTitle('v')
    cloud = ot.Cloud(x, y)
    cloud.setPointStyle('circle')
    cloud.setColor('orange')
    graph.add(cloud)
    View(graph)

.. plot::

    import openturns as ot
    from openturns.viewer import View

    N = 20
    ot.RandomGenerator.SetSeed(10)
    x = ot.Uniform(0.0, 10.0).getSample(N)
    f = ot.SymbolicFunction(['x'], ['5'])
    y = ot.Uniform(0.0, 10.0).getSample(N)
    graph = f.draw(0.0, 10.0)
    graph.setTitle('Pearson\'s coefficient estimate is quite close to zero\nbecause U and V are independent')
    graph.setXTitle('u')
    graph.setYTitle('v')
    cloud = ot.Cloud(x, y)
    cloud.setPointStyle('circle')
    cloud.setColor('orange')
    graph.add(cloud)
    View(graph)

.. plot::

    import openturns as ot
    from openturns.viewer import View

    N = 20
    ot.RandomGenerator.SetSeed(10)
    x = ot.Uniform(0.0, 10.0).getSample(N)
    f = ot.SymbolicFunction(['x'], ['30*sin(x)'])
    y = f(x) + ot.Normal(0.0, 5.0).getSample(N)
    graph = f.draw(0.0, 10.0)
    graph.setTitle('Pearson\'s coefficient estimate is quite close to zero\neven though U and V are not independent')
    graph.setXTitle('u')
    graph.setYTitle('v')
    cloud = ot.Cloud(x, y)
    cloud.setPointStyle('circle')
    cloud.setColor('orange')
    graph.add(cloud)
    View(graph)

The estimate :math:`\widehat{\rho}` of Pearson’s correlation
coefficient is sometimes denoted by :math:`r`.

.. topic:: API:

    - See :py:func:`~openturns.CorrelationAnalysis_PearsonCorrelation`
    - See :py:meth:`~openturns.Sample.computePearsonCorrelation`

.. topic:: Examples:

    - See :doc:`/auto_data_analysis/manage_data_and_samples/plot_sample_correlation`

.. topic:: References:

    - [saporta1990]_
    - [dixon1983]_
    - [nisthandbook]_
    - [dagostino1986]_
    - [bhattacharyya1997]_
    - [sprent2001]_
    - [burnham2002]_