Sensitivity analysis using Hilbert-Schmidt Indepencence Criterion (HSIC)¶
Introduction¶
The Hilbert-Schmidt Indepencence Criterion deals with analyzing the influence that the random vector has on a random variable , which is being studied for uncertainty. Here, we attempt to evaluate the influence through the dependence between the two random variables and . In practice, we compute the dependence between and as the distance between the joint distribution and the product of the marginal distributions .
In the following paragraphs, we consider an independent and identically distributed learning sample of size :
where and respectively follow and . In many cases, only is sampled, while is obtained as the output of a computer code:
HSIC definition¶
Suppose and are measurable spaces. Let and be two (universal) Reproducing Kernel Hilbert Spaces (RKHS). These functional spaces are equipped with their characteristic kernels: (resp.) and and the associated scalar products are denoted by and . This allows one to define the evaluation operator:
Let us now consider , an RKHS over with kernel . We can define the mean embedding of and in as:
We can then define a dependence measure between and , under the form of HSIC, as the squared distance between the mean embeddings of and :
Assuming , it can be shown that:
where is an independent and identically distributed copy of .
HSIC estimators¶
Two alternative estimators exist in order to compute the HSIC value. The first one is a biased, but asymptotically unbiased estimator based on V-statistics:
where and are Gram matrices computed with the respective kernels and , while a shift matrix defined as: .
The second estimator is an unbiased estimator based on U-statistics:
where and are computed as:
In order to compare the HSIC values associated to various input variables , it is common practice to consider a normalized index (bounded between 0 and 1) called R2-HSIC, and defined as:
Please note that, differently from the Sobol indices typically used in the context of sensitivity analysis, the sum of all normalized R2-HSIC indices for a given set of considered input variables does not sum to 1.
Most covariance kernels that can be used in order to compute the Gram matrices of the HSIC estimators are characterized by one or several hyper-parameters. No universal rule exists allowing to determine the optimal value for these hyper-parameters, however, for some specific kernels empirical rules are proposed. For instance, the squared exponential (or Gaussian) kernel can be parameterized as a function of the sample empirical variance. In this case, we obtain :
with , where is the empirical standard deviation of the sample .
Screening with HSIC-based statistical tests¶
The HSIC can also be used in order to perform screening on a set of input variables. This can be defined as the process of identifying the input variables which are significantly influential on the considered output. More specifically, within the framework of HSIC this can be done by relying on statistical hypothesis tests. In practice, we wish to test the following hypothesis:
which, thanks to the HSIC properties, is equivalent to assessing the hypothesis of independence between and .
We define the test statistic as: , and the associated p-value: , where is the stastistic observed on the given sample. In other words, the p-value represents the probability of obtaining a value as large as the observed one under the assumption that and are independent. Therefore, the lower the p-value is, the higher are the chances that the two considered variables are actually dependent. In order to discriminate influential inputs from non-influential ones, it is common practice to fix an acceptance level (typically equal to 0.05, or 0.1), and to consider all variables associated to a p-value larger than as being non-influential, and all variables associated to p-values lower than as having a non-negligible influence on the considered output.
Depending on the size of the available data set, the p-value of a given input variable can be either computed with an asymptotic estimator, or with a permutation-based estimator. The asymptotic estimator is used when dealing with sufficiently large data sets, and stems from the fact that the considered test statistic can be approached by a Gamma distribution. As a consequence, the p-value can be approximated as follows:
where is the cumulative distribution function of the Gamma distribution. The parameters of this distribution are estimated as a function of the sample values.
Alternatively, when dealing with small data sets, a permutation-based estimator of the p-value can be considered. The underlying idea is that under the independence hypothesis , considering a permutation of the considered output sample should have no impact on the estimated HSIC value. We therefore consider an initial n-size sample and . From these samples, we can generate a set of B independent permutations of and compute the associated HSIC values: . We can then finally estimate the p-value (under ) as :
Target sensitivity analysis using HSIC¶
On top of the standard screening and global sensitivity analysis described in the previous paragraphs, HSIC also allows one to perform target sensitivity analysis. The underlying concept is to identify the most influential input parameters which cause the considered output to cross into a user-defined critical domain: . In practice, rather than directly computing the HSIC values on a given set of output values , we first apply a transformation through the use of a filter function : . We can then estimate the target HSIC value associated to the input variable as:
Please note that both the U-statistics a the V-statistics estimators described in the previous section can be used.
Depending on the application, different filter functions can be considered. A first common example of filter function is the exponential function: