Sensitivity analysis using Hilbert-Schmidt Indepencence Criterion (HSIC)¶
Introduction¶
The Hilbert-Schmidt Indepencence Criterion deals with analyzing the influence that the random vector
has on a random variable
, which is being studied for uncertainty. Here, we attempt to evaluate the influence
through the dependence between the two random variables
and
.
In practice, we compute the dependence between
and
as the
distance between the joint distribution
and the
product of the marginal distributions
.
In the following paragraphs, we consider an independent and identically distributed
learning sample of size , which can for instance be obtained through
Monte Carlo sampling or real life observations:
where and
respectively follow
and
.
In many cases, only
is sampled, while
is obtained
as the output of a computer code:
HSIC definition¶
Suppose and
are measurable spaces.
Let
and
be two (universal) Reproducing Kernel Hilbert Spaces (RKHS).
These functional spaces are equipped with their characteristic kernels: (resp.)
and
and the associated
scalar products are denoted by
and
. This allows to define the evaluation
operator:
Let us now consider , an RKHS over
with kernel
. We can define the mean embedding of
and
in
as:
We can then define a dependence measure between and
, under
the form of HSIC, as the squared distance between the mean embeddings of
and
:
Assuming , it can be shown that:
where is an independent and identically distributed copy of
.
HSIC estimators¶
Two alternative estimators exist in order to compute the HSIC value. The first one is a biased, but asymptotically unbiased estimator based on V-statistics:
where and
are Gram matrices computed with the respective kernels
and
,
while
a shift matrix defined as:
.
The second estimator is an unbiased estimator based on U-statistics:
where and
are computed as:
In order to compare the HSIC values associated to various input variables ,
it is common practice to consider a normalized index (bounded between 0 and 1) called R2-HSIC,
and defined as:
Please note that, differently from the Sobol indices typically used in the context of sensitivity analysis, the sum of all normalized R2-HSIC indices for a given set of considered input variables does not sum to 1.
Most covariance kernels that can be used in order to compute the Gram matrices of the HSIC estimators are characterized by one or several hyper-parameters. No universal rule exists allowing to determine the optimal value for these hyper-parameters, however, for some specific kernels empirical rules are proposed. For instance, the squared exponential (or Gaussian) kernel can be parameterized as a function of the sample empirical variance. In this case, we obtain :
with , where
is the empirical
standard deviation of the sample
.
Screening with HSIC-based statistical tests¶
The HSIC can also be used in order to perform screening on a set of input variables. This can be defined as the the process of identifying the input variables which are significantly influential on the considered output. More specifically, within the framework of HSIC this can be done by relying on statistical hypothesis tests. In practice, we wish to test the the following hypothesis:
which, thanks to the HSIC properties, is equivalent to assessing the hypothesis of
independence between and
.
We define the test statistic as: ,
and the associated p-value:
,
where
is the stastistic observed on the given sample.
In other words, the p-value represents the probability of obtaining a
value as large as the observed one under the assumption
that
and
are independent. Therefore, the lower the p-value is,
the higher are the chances that the two considered variables are actually dependent.
In order to discriminate influential inputs from non-influential ones, it is common
practice to fix an acceptance level
(typically equal to 0.05, or 0.1),
and to consider all variables associated to a p-value larger than
as being non-influential, and all variables associated to p-values lower than
as having a non-negligible influence on the considered output.
Depending on the size of the available data set, the p-value of a given input variable
can be either computed with an asymptotic estimator, or with a permutation-based estimator.
The asymptotic estimator is used when dealing with sufficiently large data sets,
and stems from the fact that the considered test statistic
can be approached by a Gamma distribution. As a consequence, the p-value can be approximated
as follows:
where is the cumulative distribution function of
the Gamma distribution. The parameters of this distribution are estimated as a
function of the sample values.
Alternatively, when dealing with small data sets, a permutation-based estimator of
the p-value can be considered. The underlying idea is that under the independence
hypothesis , considering a permutation of the considered output sample
should have no impact on the estimated HSIC value. We therefore consider
an initial n-size pair of samples
and
. From this samples, we can generate a set of
B independent permutations
of
and compute the associated HSIC values:
.
We can then finally estimate the p-value (under
) as :
Target sensitivity analysis using HSIC¶
On top of the standard screening and global sensitivity analysis described in the
previous paragraphs, HSIC also allows to perform target sensitivity analysis.
The underlying concept is to identify the most influential input parameters which
cause the considered output to cross into a user-defined critical domain:
. In practice, rather than directly computing the HSIC values on a given
set of output values
, we first apply a transformation
through the use of a filter function
:
.
We can then estimate the target HSIC value associated to the input variable
as:
Please note that both the U-statistics a the V-statistics estimators described in the previous section can be used.
Depending on the application, different filter functions can be considered. A first common example of filter function is the exponential function:
where characterizes the minimum distance between
and
any point contained in the critical domain
, while
is a tunable scale parameter.
Alternatively, we can also consider a step filter function defined as:
This filter function presents the advantage of being simpler and requiring no parameterization. However, it also makes no distinction between points being very close to the critical domain and points which are far from it. This may partially limit the performance of the sensitivity analysis, especially when dealing with small data sets. It is important to note that when considering this step filter function, it is advisable to rely on a covariance kernel adapted to binary variables (for the considered output), such as:
where is the number of samples in the available data set belonging to
the same category as
.
Please note that this specific kernel can also be used when performing sensitivity
analysis on discrete variables.
Conditional sensitivity analysis using HSIC¶
Similarly to the target sensitivity analysis discussed in the previous paragraph,
the HSIC also allows the possibility of performing conditional sensitivity analysis.
In this case, the objective is to identify the most influential input variables under
the condition that the considered output variable is within a user-defined critical domain.
In other words, we are interested in identifying the variables that drive the
output variability within the critical domain.
This analysis can be achieved by relying on a diagonal weight matrix computed through
the use of a weight function on the considered data set:
. The underlying purpose of this matrix is to associate to
each sample in the data set a weight characterizing its distance from the critical domain.
Different definitions of the weight function can be considered. For instance, the exponential
and step weight functions defined in the previous paragraph can be used.
Having defined a proper weight function, the conditional HSIC values can be computed by relying on an adapted V-statistics estimator:
where ,
and
.
Please note that no U-statistics estimator exists for the conditional HSIC. Furhtermore, differently than in the target analysis case, standard continuous covariance kernels can be used, regardless of the type of weight function that is being considered.
In most applications, it may be worth performing all three types of sensitivity analysis presented in the previous paragaph, i.e., global, target and conditional, in order to gain a more precise understanding of the degree and type of influence of every input variable.
API:
See
HSICEstimatorGlobalSensitivity
for global sensitivity analysis HSIC estimatorsSee
HSICEstimatorTargetSensitivity
for target sensitivity analysis HSIC estimatorsSee
HSICEstimatorConditionalSensitivity
for conditional sensitivity analysis HSIC estimatorsSee
HSICUStat
for U-statistic specific HSIC computationsSee
HSICVStat
for V-statistic specific HSIC computations
Examples:
References: