Compare unconditional and conditional histograms¶

In this example, we compare unconditional and conditional histograms for a simulation. We consider the flooding model. Let $g$ be a function which takes four inputs $Q$ , $K_s$ , $Z_v$ and $Z_m$ and returns one output $S$ .

We first consider the (unconditional) distribution of the input $Q$ .

Let $t$ be a given threshold on the output $S$ : we consider the event $S > t$ . Then we consider the conditional distribution of the input $Q$ given that $S > t$ that is to say $Q|S > t$ .

If these two distributions are significantly different, we conclude that the input $Q$ has an impact on the event $S > t$ .

In order to approximate the distribution of the output $S$ , we perform a Monte-Carlo simulation with size 500. The threshold $t$ is chosen as the 90% quantile of the empirical distribution of $S$ . In this example, the distribution is aproximated by its empirical histogram (but this could be done with another distribution approximation as well, such as kernel smoothing for example).

import numpy as np
from openturns.usecases import flood_model
import openturns as ot
import openturns.viewer as viewer
from matplotlib import pylab as plt

ot.Log.Show(ot.Log.NONE)

We use the FloodModel data class that contains all the case parameters.

fm = flood_model.FloodModel()

Create an input sample from the joint distribution defined in the data class. We build an output sample by taking the image by the model.

size = 500
inputSample = fm.distribution.getSample(size)
inputSample[:5]

	Q (m3/s)	Ks	Zv (m)	Zm (m)	B (m)	L (m)	Zb (m)	Hd (m)
0	2032.978	28.16431	49.81823	54.44882	298.0983	4997.511	55.27675	3.987806
1	831.1784	32.06598	49.8578	54.29531	298.3157	4997.297	55.18741	2.030507
2	1741.776	19.36681	49.08975	55.0745	299.1433	4999.432	55.66693	2.719918
3	800.476	40.00743	49.16216	55.03673	299.5998	5002.712	55.55715	3.080748
4	917.9835	38.23018	49.19878	54.97124	302.2765	5008.607	55.36659	3.816204

outputSample = fm.model(inputSample)
outputSample[:5]

	H	S	C
0	3.470401	-5.975926	1.034781
1	1.900478	-5.459643	1.140406
2	3.659279	-5.637822	1.102684
3	1.492342	-7.983398	0.7745734
4	1.66541	-8.3186	0.7507753

Merge the input and output samples into a single sample.

sample = ot.Sample(inputSample)
sample.stack(outputSample)
sample[0:5]

	Q (m3/s)	Ks	Zv (m)	Zm (m)	B (m)	L (m)	Zb (m)	Hd (m)	H	S	C
0	2032.978	28.16431	49.81823	54.44882	298.0983	4997.511	55.27675	3.987806	3.470401	-5.975926	1.034781
1	831.1784	32.06598	49.8578	54.29531	298.3157	4997.297	55.18741	2.030507	1.900478	-5.459643	1.140406
2	1741.776	19.36681	49.08975	55.0745	299.1433	4999.432	55.66693	2.719918	3.659279	-5.637822	1.102684
3	800.476	40.00743	49.16216	55.03673	299.5998	5002.712	55.55715	3.080748	1.492342	-7.983398	0.7745734
4	917.9835	38.23018	49.19878	54.97124	302.2765	5008.607	55.36659	3.816204	1.66541	-8.3186	0.7507753

Extract the first column of inputSample into the sample of the flowrates $Q$ .

sampleQ = inputSample[:, 0]

The next cell defines a function that computes the conditional sample of a component given that the a marginal (defined by its index criteriaComponent) exceeds a given threshold, defined by its quantile level.

def computeConditionnedSample(
    sample, alpha=0.9, criteriaComponent=None, selectedComponent=0
):
    """
    Return values from the selectedComponent-th component of the sample.
    Selects the values according to the alpha-level quantile of
    the criteriaComponent-th component of the sample.
    """
    dim = sample.getDimension()
    if criteriaComponent is None:
        criteriaComponent = dim - 1
    sortedSample = sample.sortAccordingToAComponent(criteriaComponent)
    quantiles = sortedSample.computeQuantilePerComponent(alpha)
    quantileValue = quantiles[criteriaComponent]
    sortedSampleCriteria = sortedSample[:, criteriaComponent]
    indices = np.where(np.array(sortedSampleCriteria.asPoint()) > quantileValue)[0]
    conditionnedSortedSample = sortedSample[int(indices[0]):, selectedComponent]
    return conditionnedSortedSample

Create an histogram for the unconditional flowrates.

numberOfBins = 10
histogram = ot.HistogramFactory().buildAsHistogram(sampleQ, numberOfBins)

Extract the sub-sample of the input flowrates Q which leads to large values of the output S.

Search the index of the marginal S in the columns of the sample.

criteriaComponent = list(sample.getDescription()).index("S")
criteriaComponent

alpha = 0.9
selectedComponent = 0
conditionnedSampleQ = computeConditionnedSample(
    sample, alpha, criteriaComponent, selectedComponent
)

We could as well use:

# conditionnedHistogram = ot.HistogramFactory().buildAsHistogram(conditionnedSampleQ)

but this creates an histogram with new classes, corresponding to conditionnedSampleQ. We want to use exactly the same classes as the full sample, so that the two histograms match.

first = histogram.getFirst()
width = histogram.getWidth()
conditionnedHistogram = ot.HistogramFactory().buildAsHistogram(
    conditionnedSampleQ, first, width
)

Then creates a graphics with the unconditional and the conditional histograms.

graph = histogram.drawPDF()
graph.setLegends(["Q"])
#
graphConditionnalQ = conditionnedHistogram.drawPDF()
graphConditionnalQ.setColors(["blue"])
graphConditionnalQ.setLegends([r"$Q | S > S_{%s}$" % (alpha)])
graph.add(graphConditionnalQ)
view = viewer.View(graph)

plt.show()

We see that the two histograms are very different. The high values of the input $Q$ seem to often lead to a high value of the output $S$ .

We could explore this situation further by comparing the unconditional distribution of $Q$ (which is known in this case) with the conditonal distribution of $Q | S > t$ , estimated by kernel smoothing. This would have the advantage of accuracy, since the kernel smoothing is a more accurate approximation of a distribution than the histogram.

OpenTURNS

An Open source initiative for the Treatment of Uncertainties, Risks'N Statistics

Previous topic

Next topic

This Page

Compare unconditional and conditional histograms¶