Sensitivity analysis using Hilbert-Schmidt Indepencence Criterion (HSIC)

Introduction

The Hilbert-Schmidt Indepencence Criterion deals with analyzing the influence that the random vector \vect{X} = \left( X_1,\ldots,X_{d} \right) has on a random variable Y, which is being studied for uncertainty. Here, we attempt to evaluate the influence through the dependence between the two random variables Y and X^i. In practice, we compute the dependence between Y and X^i as the distance between the joint distribution P_{Y,X^i} and the product of the marginal distributions P_{X^i}P_{Y}.

In the following paragraphs, we consider an independent and identically distributed learning sample of size n, which can for instance be obtained through Monte Carlo sampling or real life observations:

\left(\mathbf{X}^{(j)}, Y^{(j)}\right)_{(1\leq j\leq n)} = \left(X_1^{(j)}, X_2^{(j)}, \dots, X_d^{(j)}, Y^{(j)}\right)_{(1\leq j\leq n)}

where \mathbf{X} and Y respectively follow P_{\mathbf{X}} and P_{Y}. In many cases, only \mathbf{X} is sampled, while P_{Y} is obtained as the output of a computer code: Y^{(j)} = \mathcal{M}\left(X_1^{(j)}, X_2^{(j)}, \dots, X_d^{(j)}\right)

HSIC definition

Suppose \cX_{i} \ \forall i \ \in \{1,\dots,d\} and \cY are measurable spaces. Let \cF_{i} : \cX_i \rightarrow \Rset and \cG : \cY \rightarrow \Rset be two (universal) Reproducing Kernel Hilbert Spaces (RKHS). These functional spaces are equipped with their characteristic kernels: (resp.) \kappa_{i}(\cdot,\cdot) and \kappa(\cdot,\cdot) and the associated scalar products are denoted by \langle \cdot, \cdot \rangle_{\cF_{i}} and \langle \cdot, \cdot \rangle_{\cG}. This allows to define the evaluation operator: f \in \cF_i \rightarrow  f(\mathbf{x}) = \langle f, \kappa_i (\mathbf{x}, \cdot) \rangle_{\cF_{i}}

Let us now consider \cV_{i}, an RKHS over \cX_{i} \times \cY with kernel v_{i}(\cdot, \cdot). We can define the mean embedding of P_YP_{X_i} and P_{Y,X_i} in \mathcal{V}_{i} as:

\mu [ P_YP_{X_i} ]  = \Eset_{Y} \Eset_{X_i} [v_{i}((Y, X_i),\cdot) ]

\mu [ P_{Y,X_i} ] = \Eset_{Y,X_i} [v_{i}((Y, X_i),\cdot) ]

We can then define a dependence measure between Y and X^i, under the form of HSIC, as the squared distance between the mean embeddings of P_YP_{X_i} and P_{Y,X_i}:

\mathrm{HSIC}(X_i,Y)_{\cF_{i},\cG} := || \mu [ P_{Y,X_i} ] - \mu [ P_YP_{X_i} ]  ||^2

Assuming v_{i}((Y, X_i),(Y', X_i') ) = \kappa(Y,Y') \kappa_i (X_i, X_i'), it can be shown that:

\mathrm{HSIC}(X_i,Y)_{\cF_{i},\cG} = \Eset[\kappa_i(X_i,X_i^{'})\kappa(Y,Y^{'})]
    + \Eset[\kappa_i(X_i,X_i^{'})]\Eset[\kappa(Y,Y^{'})] -
2\Eset[\Eset[\kappa_i(X_i,X_i^{'})|X_i]~\Eset[\kappa(Y,Y^{'})|Y]]

where (X_i^{'}, Y^{'}) is an independent and identically distributed copy of (X_i,Y).

HSIC estimators

Two alternative estimators exist in order to compute the HSIC value. The first one is a biased, but asymptotically unbiased estimator based on V-statistics:

\widehat{\mathrm{HSIC}}(X_i,Y) = \frac{1}{n^2} \mathrm{Tr}(\mat{L_iHLH})

where \mat{L_i} and \mat{L} are Gram matrices computed with the respective kernels L_{i_{j,k}} = \kappa_i(x_i^j,x_i^k) and L_{j,k}= \kappa(y^j,y^k), while \mat{H} a shift matrix defined as: H_{j,k} = \left(\delta_{j,k} - \frac{1}{n}\right)_{1 \leq j, k \leq n}.

The second estimator is an unbiased estimator based on U-statistics:

\widehat{\mathrm{HSIC}}(X_i,Y) = \frac{1}{n(n-3)} \left[\mathrm{Tr}(\mat{\tilde{L}_i \tilde{L}}) + \frac{\mathbf{1}^{\top} \mat{\tilde{L}_i} \mathbf{1}\mathbf{1}^{\top} \mat{\tilde{L}} \mathbf{1}}{(n-1)(n-2)} - \frac{2}{n-2} \mathbf{1}^{\top} \mat{\tilde{L}_i} \mat{\tilde{L}} \mathbf{1}\right]

where \tilde{L}_{i_{j,k}} and \tilde{L}_{j,k} are computed as:

\tilde{L}_{i_{j,k}} = (1-\delta_{j,k}) \kappa_i\left( X_i^{(j)}, X_i^{(k)}\right)

    \tilde{L}_{j,k} = (1-\delta_{j,k}) \kappa\left( Y^{(j)}, Y^{(k)}\right)

In order to compare the HSIC values associated to various input variables X^i, it is common practice to consider a normalized index (bounded between 0 and 1) called R2-HSIC, and defined as:

\widehat{R_{\mathrm{HSIC},i}^2} = \frac{\widehat{\mathrm{HSIC}}(X_i,Y)}{\sqrt{\widehat{\mathrm{HSIC}}(X_i,X_i)~\widehat{\mathrm{HSIC}}(Y,Y)}}

Please note that, differently from the Sobol indices typically used in the context of sensitivity analysis, the sum of all normalized R2-HSIC indices for a given set of considered input variables does not sum to 1.

Most covariance kernels that can be used in order to compute the Gram matrices of the HSIC estimators are characterized by one or several hyper-parameters. No universal rule exists allowing to determine the optimal value for these hyper-parameters, however, for some specific kernels empirical rules are proposed. For instance, the squared exponential (or Gaussian) kernel can be parameterized as a function of the sample empirical variance. In this case, we obtain :

\kappa_i(x_i^j,x_i^k) = \exp (- \frac{(x_i^j - x_i^k)^2}{2\theta_i^2})

with \theta_i = \sigma_i, where \sigma_i is the empirical standard deviation of the sample X_i.

Screening with HSIC-based statistical tests

The HSIC can also be used in order to perform screening on a set of input variables. This can be defined as the process of identifying the input variables which are significantly influential on the considered output. More specifically, within the framework of HSIC this can be done by relying on statistical hypothesis tests. In practice, we wish to test the following hypothesis:

\cT: \textrm{Test}\quad (\cH_{0,i}):\mathrm{HSIC}(X_i,Y) = 0

which, thanks to the HSIC properties, is equivalent to assessing the hypothesis of independence between Y and X^i.

We define the test statistic as: \widehat{S}_{\cT} := n \times \widehat{\mathrm{HSIC}}(X_i,Y), and the associated p-value: p_{\textrm{val}} = \Pset\left(\widehat{S}_{\cT} \geq \widehat{S}_{\cT,\textrm{obs}}~|~\cH_{0,i}\right), where \widehat{S}_{\cT,\textrm{obs}} is the stastistic observed on the given sample. In other words, the p-value represents the probability of obtaining a \mbox{HSIC}(X_i,Y) value as large as the observed one under the assumption that Y and X_i are independent. Therefore, the lower the p-value is, the higher are the chances that the two considered variables are actually dependent. In order to discriminate influential inputs from non-influential ones, it is common practice to fix an acceptance level \alpha (typically equal to 0.05, or 0.1), and to consider all variables associated to a p-value larger than \alpha as being non-influential, and all variables associated to p-values lower than \alpha as having a non-negligible influence on the considered output.

Depending on the size of the available data set, the p-value of a given input variable can be either computed with an asymptotic estimator, or with a permutation-based estimator. The asymptotic estimator is used when dealing with sufficiently large data sets, and stems from the fact that the considered test statistic \widehat{S}_{\cT} can be approached by a Gamma distribution. As a consequence, the p-value can be approximated as follows:

p_{\textrm{val}} \approx 1 - F_{\textrm{Ga}}\left(n\times\widehat{\mathrm{HSIC}}(X_i,Y)_{\textrm{obs}}\right)

where F_{\textrm{Ga}}(\cdot) is the cumulative distribution function of the Gamma distribution. The parameters of this distribution are estimated as a function of the sample values.

Alternatively, when dealing with small data sets, a permutation-based estimator of the p-value can be considered. The underlying idea is that under the independence hypothesis \cH_{0,i}, considering a permutation of the considered output sample Y should have no impact on the estimated HSIC value. We therefore consider an initial n-size sample Z_{n} := \left(X_i^{(j)}, Y^{(j)}\right)_{(1\leq j\leq n)} and \widehat{\mathrm{HSIC}}(Z_{n}). From these samples, we can generate a set of B independent permutations \{\tau_1,\dots,\tau_B\} of Y^{(j)}_{(1 \leq j \leq n)} and compute the associated HSIC values: \widehat{H}^{*b} := \widehat{\mathrm{HSIC}} \left(X^{(j)}, Y^{(\tau_{b}(j))}\right)_{(1\leq j\leq n)}. We can then finally estimate the p-value (under \cH_{0,i}) as :

p_{\textrm{val}} \approx \frac{1}{B} \sum_{b=1}^{B} \mathbf{1}_{\widehat{H}^{*b}>\widehat{\mathrm{HSIC}}(X,Y)}

Target sensitivity analysis using HSIC

On top of the standard screening and global sensitivity analysis described in the previous paragraphs, HSIC also allows to perform target sensitivity analysis. The underlying concept is to identify the most influential input parameters which cause the considered output Y to cross into a user-defined critical domain: \cC. In practice, rather than directly computing the HSIC values on a given set of output values Y^{(j)}_{(1\leq j\leq n)}, we first apply a transformation through the use of a filter function w(\cdot) : \tilde{Y^j} = w(Y^j). We can then estimate the target HSIC value associated to the input variable X_i as:

\mathrm{T-}\widehat{\mathrm{HSIC},i} = \widehat{\mathrm{HSIC}}(X_i,\tilde{Y})

Please note that both the U-statistics a the V-statistics estimators described in the previous section can be used.

Depending on the application, different filter functions can be considered. A first common example of filter function is the exponential function:

w(Y) = \exp (-d_{\cC}(Y) /s)

where -d_{\cC}(Y) characterizes the minimum distance between Y and any point contained in the critical domain \cC, while s is a tunable scale parameter.

Alternatively, we can also consider a step filter function defined as:

w(Y) = \begin{cases}  0 \ \ \mathrm{ if } \ \ Y \notin \cC \\ 1 \ \  \mathrm{ if } \ \ Y \in \cC & \end{cases}

This filter function presents the advantage of being simpler and requiring no parameterization. However, it also makes no distinction between points being very close to the critical domain and points which are far from it. This may partially limit the performance of the sensitivity analysis, especially when dealing with small data sets. It is important to note that when considering this step filter function, it is advisable to rely on a covariance kernel adapted to binary variables (for the considered output), such as:

\kappa(\tilde{Y^j},\tilde{Y^k}) = \begin{cases}  1/n_z \ \ & \mathrm{ if } \ \ \tilde{Y^j} = \tilde{Y^k} \\ 0 \ \  & \mathrm{ if } \ \ \tilde{Y^j} \neq \tilde{Y^k} \end{cases}

where n_z is the number of samples in the available data set belonging to the same category as \tilde{Y^j} \ \mathrm{ and } \ \tilde{Y^k}. Please note that this specific kernel can also be used when performing sensitivity analysis on discrete variables.

Conditional sensitivity analysis using HSIC

Similarly to the target sensitivity analysis discussed in the previous paragraph, the HSIC also allows the possibility of performing conditional sensitivity analysis. In this case, the objective is to identify the most influential input variables under the condition that the considered output variable is within a user-defined critical domain. In other words, we are interested in identifying the variables that drive the output variability within the critical domain. This analysis can be achieved by relying on a diagonal weight matrix computed through the use of a weight function w(\cdot) on the considered data set: W_{j,j} = w(Y^j). The underlying purpose of this matrix is to associate to each sample in the data set a weight characterizing its distance from the critical domain. Different definitions of the weight function can be considered. For instance, the exponential and step weight functions defined in the previous paragraph can be used.

Having defined a proper weight function, the conditional HSIC values can be computed by relying on an adapted V-statistics estimator:

\mathrm{C-}\widehat{\mathrm{HSIC}} (X_i,Y) = \frac{1}{n^2} \mathrm{Tr} (\hat{W} L_i \hat{W} H_1 L H_2)

where \hat{W} = \frac{W}{\frac{1}{n}\sum_{j = 1}^{n} W_{j,j} }, H_1 = I_n - \frac{1}{n} U\hat{W},

H_2 = I_n - \frac{1}{n} \hat{W}U, I_n is an n \times n identity matrix and U is an n \times n comprised of ones.

Please note that no U-statistics estimator exists for the conditional HSIC. Furhtermore, differently than in the target analysis case, standard continuous covariance kernels can be used, regardless of the type of weight function that is being considered.

In most applications, it may be worth performing all three types of sensitivity analysis presented in the previous paragaph, i.e., global, target and conditional, in order to gain a more precise understanding of the degree and type of influence of every input variable.