Sensitivity analysis using Hilbert-Schmidt Indepencence Criterion (HSIC)


The Hilbert-Schmidt Indepencence Criterion deals with analyzing the influence that the random vector \vect{X} = \left( X_1,\ldots,X_{d} \right) has on a random variable Y, which is being studied for uncertainty. Here, we attempt to evaluate the influence through the dependence between the two random variables Y and X^i. In practice, we compute the dependence between Y and X^i as the distance between the joint distribution P_{Y,X^i} and the product of the marginal distributions P_{X^i}P_{Y}.

In the following paragraphs, we consider an independent and identically distributed learning sample of size n, which can for instance be obtained through Monte Carlo sampling or real life observations:

\left(\mathbf{X}^{(j)}, Y^{(j)}\right)_{(1\leq j\leq n)} = \left(X_1^{(j)}, X_2^{(j)}, \dots, X_d^{(j)}, Y^{(j)}\right)_{(1\leq j\leq n)}

where \mathbf{X} and Y respectively follow P_{\mathbf{X}} and P_{Y}. In many cases, only \mathbf{X} is sampled, while P_{Y} is obtained as the output of a computer code: Y^{(j)} = \mathcal{M}\left(X_1^{(j)}, X_2^{(j)}, \dots, X_d^{(j)}\right)

HSIC definition

Suppose \cX_{i} \ \forall i \ \in \{1,\dots,d\} and \cY are measurable spaces. Let \cF_{i} : \cX_i \rightarrow \Rset and \cG : \cY \rightarrow \Rset be two (universal) Reproducing Kernel Hilbert Spaces (RKHS). These functional spaces are equipped with their characteristic kernels: (resp.) \kappa_{i}(\cdot,\cdot) and \kappa(\cdot,\cdot) and the associated scalar products are denoted by \langle \cdot, \cdot \rangle_{\cF_{i}} and \langle \cdot, \cdot \rangle_{\cG}. This allows one to define the evaluation operator: f \in \cF_i \rightarrow  f(\mathbf{x}) = \langle f, \kappa_i (\mathbf{x}, \cdot) \rangle_{\cF_{i}}

Let us now consider \cV_{i}, an RKHS over \cX_{i} \times \cY with kernel v_{i}(\cdot, \cdot). We can define the mean embedding of P_YP_{X_i} and P_{Y,X_i} in \mathcal{V}_{i} as:

\mu [ P_YP_{X_i} ]  = \Eset_{Y} \Eset_{X_i} [v_{i}((Y, X_i),\cdot) ]

\mu [ P_{Y,X_i} ] = \Eset_{Y,X_i} [v_{i}((Y, X_i),\cdot) ]

We can then define a dependence measure between Y and X^i, under the form of HSIC, as the squared distance between the mean embeddings of P_YP_{X_i} and P_{Y,X_i}:

\mathrm{HSIC}(X_i,Y)_{\cF_{i},\cG} := || \mu [ P_{Y,X_i} ] - \mu [ P_YP_{X_i} ]  ||^2

Assuming v_{i}((Y, X_i),(Y', X_i') ) = \kappa(Y,Y') \kappa_i (X_i, X_i'), it can be shown that:

\mathrm{HSIC}(X_i,Y)_{\cF_{i},\cG} = \Eset[\kappa_i(X_i,X_i^{'})\kappa(Y,Y^{'})]
+ \Eset[\kappa_i(X_i,X_i^{'})]\Eset[\kappa(Y,Y^{'})] -

where (X_i^{'}, Y^{'}) is an independent and identically distributed copy of (X_i,Y).

HSIC estimators

Two alternative estimators exist in order to compute the HSIC value. The first one is a biased, but asymptotically unbiased estimator based on V-statistics:

\widehat{\mathrm{HSIC}}(X_i,Y) = \frac{1}{n^2} \mathrm{Tr}(\mat{L_iHLH})

where \mat{L_i} and \mat{L} are Gram matrices computed with the respective kernels L_{i_{j,k}} = \kappa_i(x_i^j,x_i^k) and L_{j,k}= \kappa(y^j,y^k), while \mat{H} a shift matrix defined as: H_{j,k} = \left(\delta_{j,k} - \frac{1}{n}\right)_{1 \leq j, k \leq n}.

The second estimator is an unbiased estimator based on U-statistics:

\widehat{\mathrm{HSIC}}(X_i,Y) = \frac{1}{n(n-3)} \left[\mathrm{Tr}(\mat{\tilde{L}_i \tilde{L}}) + \frac{\mathbf{1}^{\top} \mat{\tilde{L}_i} \mathbf{1}\mathbf{1}^{\top} \mat{\tilde{L}} \mathbf{1}}{(n-1)(n-2)} - \frac{2}{n-2} \mathbf{1}^{\top} \mat{\tilde{L}_i} \mat{\tilde{L}} \mathbf{1}\right]

where \tilde{L}_{i_{j,k}} and \tilde{L}_{j,k} are computed as:

\tilde{L}_{i_{j,k}} = (1-\delta_{j,k}) \kappa_i\left( X_i^{(j)}, X_i^{(k)}\right)

\tilde{L}_{j,k} = (1-\delta_{j,k}) \kappa\left( Y^{(j)}, Y^{(k)}\right)

In order to compare the HSIC values associated to various input variables X^i, it is common practice to consider a normalized index (bounded between 0 and 1) called R2-HSIC, and defined as:

\widehat{R_{\mathrm{HSIC},i}^2} = \frac{\widehat{\mathrm{HSIC}}(X_i,Y)}{\sqrt{\widehat{\mathrm{HSIC}}(X_i,X_i)~\widehat{\mathrm{HSIC}}(Y,Y)}}

Please note that, differently from the Sobol indices typically used in the context of sensitivity analysis, the sum of all normalized R2-HSIC indices for a given set of considered input variables does not sum to 1.

Most covariance kernels that can be used in order to compute the Gram matrices of the HSIC estimators are characterized by one or several hyper-parameters. No universal rule exists allowing to determine the optimal value for these hyper-parameters, however, for some specific kernels empirical rules are proposed. For instance, the squared exponential (or Gaussian) kernel can be parameterized as a function of the sample empirical variance. In this case, we obtain :

\kappa_i(x_i^j,x_i^k) = \exp (- \frac{(x_i^j - x_i^k)^2}{2\theta_i^2})

with \theta_i = \sigma_i, where \sigma_i is the empirical standard deviation of the sample X_i.

Screening with HSIC-based statistical tests

The HSIC can also be used in order to perform screening on a set of input variables. This can be defined as the process of identifying the input variables which are significantly influential on the considered output. More specifically, within the framework of HSIC this can be done by relying on statistical hypothesis tests. In practice, we wish to test the following hypothesis:

\cT: \textrm{Test}\quad (\cH_{0,i}):\mathrm{HSIC}(X_i,Y) = 0

which, thanks to the HSIC properties, is equivalent to assessing the hypothesis of independence between Y and X^i.

We define the test statistic as: \widehat{S}_{\cT} := n \times \widehat{\mathrm{HSIC}}(X_i,Y), and the associated p-value: p_{\textrm{val}} = \Pset\left(\widehat{S}_{\cT} \geq \widehat{S}_{\cT,\textrm{obs}}~|~\cH_{0,i}\right), where \widehat{S}_{\cT,\textrm{obs}} is the stastistic observed on the given sample. In other words, the p-value represents the probability of obtaining a \mbox{HSIC}(X_i,Y) value as large as the observed one under the assumption that Y and X_i are independent. Therefore, the lower the p-value is, the higher are the chances that the two considered variables are actually dependent. In order to discriminate influential inputs from non-influential ones, it is common practice to fix an acceptance level \alpha (typically equal to 0.05, or 0.1), and to consider all variables associated to a p-value larger than \alpha as being non-influential, and all variables associated to p-values lower than \alpha as having a non-negligible influence on the considered output.

Depending on the size of the available data set, the p-value of a given input variable can be either computed with an asymptotic estimator, or with a permutation-based estimator. The asymptotic estimator is used when dealing with sufficiently large data sets, and stems from the fact that the considered test statistic \widehat{S}_{\cT} can be approached by a Gamma distribution. As a consequence, the p-value can be approximated as follows:

p_{\textrm{val}} \approx 1 - F_{\textrm{Ga}}\left(n\times\widehat{\mathrm{HSIC}}(X_i,Y)_{\textrm{obs}}\right)

where F_{\textrm{Ga}}(\cdot) is the cumulative distribution function of the Gamma distribution. The parameters of this distribution are estimated as a function of the sample values.

Alternatively, when dealing with small data sets, a permutation-based estimator of the p-value can be considered. The underlying idea is that under the independence hypothesis \cH_{0,i}, considering a permutation of the considered output sample Y should have no impact on the estimated HSIC value. We therefore consider an initial n-size sample Z_{n} := \left(X_i^{(j)}, Y^{(j)}\right)_{(1\leq j\leq n)} and \widehat{\mathrm{HSIC}}(Z_{n}). From these samples, we can generate a set of B independent permutations \{\tau_1,\dots,\tau_B\} of Y^{(j)}_{(1 \leq j \leq n)} and compute the associated HSIC values: \widehat{H}^{*b} := \widehat{\mathrm{HSIC}} \left(X^{(j)}, Y^{(\tau_{b}(j))}\right)_{(1\leq j\leq n)}. We can then finally estimate the p-value (under \cH_{0,i}) as :

p_{\textrm{val}} \approx \frac{1}{B} \sum_{b=1}^{B} \mathbf{1}_{\widehat{H}^{*b}>\widehat{\mathrm{HSIC}}(X,Y)}

Target sensitivity analysis using HSIC

On top of the standard screening and global sensitivity analysis described in the previous paragraphs, HSIC also allows one to perform target sensitivity analysis. The underlying concept is to identify the most influential input parameters which cause the considered output Y to cross into a user-defined critical domain: \cC. In practice, rather than directly computing the HSIC values on a given set of output values Y^{(j)}_{(1\leq j\leq n)}, we first apply a transformation through the use of a filter function w(\cdot) : \tilde{Y^j} = w(Y^j). We can then estimate the target HSIC value associated to the input variable X_i as:

\mathrm{T-}\widehat{\mathrm{HSIC},i} = \widehat{\mathrm{HSIC}}(X_i,\tilde{Y})

Please note that both the U-statistics a the V-statistics estimators described in the previous section can be used.

Depending on the application, different filter functions can be considered. A first common example of filter function is the exponential function: