Least squares problems numerical methods

This section presents numerical methods that can be used in order to solve least squares problems, which can be encountered when the construction of a response surface (i.e. of a meta-model) is of interest, or when one wishes to perform a statistical regression.
Given a matrix \matr{\Psi}~\in~\Rset^{N\times P}, N>P, and a vector \underline{y}~\in~\Rset^{N}, we want to find a vector \underline{a}~\in \Rset^{P} such that \matr{\Psi}\: \underline{a} is the best approximation to \underline{y} in the least squares sense. Mathematically speaking, we want to solve the following minimization problem:

\min_{\underline{a}} \, \, = \, \, \left\| \; \matr{\Psi} \, \underline{a} \, - \, \underline{y} \; \right\|_2

In the following, it is assumed that the rank of matrix \matr{\Psi} is equal to P.

Several algorithms can be applied to compute the least squares solution, as shown in the sequel.

Method of normal equations

It is shown that the solution of the least squares problem satisfies the so-called normal equations, which read using a matrix notation:

\matr{\Psi}^{\mbox{\scriptsize \textsf{T}}} \; \matr{\Psi} \; \underline{a} \, \, = \, \, \matr{\Psi}^{\mbox{\scriptsize \textsf{T}}} \; \underline{y}

The matrix \matr{C} \equiv \matr{\Psi}^{\mbox{\scriptsize \textsf{T}}} \; \matr{\Psi} is symmetric and positive definite. The system can be solved using the following Cholesky factorization:

\matr{C} \, \, = \, \, \matr{R}^{\mbox{\scriptsize \textsf{T}}} \; \matr{R}

where \matr{R} is an upper triangular matrix with positive diagonal entries. Solving the normal equations is equivalent to solving the two following triangular systems, which can be easily solved by backwards substitution:

\begin{aligned}
    \matr{R}^{\mbox{\scriptsize \textsf{T}}} \; \underline{z} \, \, = \, \, \matr{\Psi}^{\mbox{\scriptsize \textsf{T}}} \; \underline{y}
    \qquad , \qquad \matr{R} \; \underline{a} \, \, = \, \, \underline{z}
  \end{aligned}

It has to be noted that this theoretical approach is seldom used in practice though. Indeed the resulting least squares solution is quite sensitive to a small change in the data (i.e. in \underline{y} and \matr{\Psi}). More precisely, the normal equations are always more badly conditioned than the original overdetermined system, as their condition number is squared compared to the original problem:

\kappa(\matr{\Psi}^{\mbox{\scriptsize \textsf{T}}} \; \matr{\Psi}) \, \, = \, \, \kappa(\matr{\Psi})^2

As a consequence more robust numerical methods should be adopted.

Method based on QR factorization
It is shown that the matrix \matr{\Psi} can be factorized as follows:

\matr{\Psi} \, \, = \, \, \matr{Q} \; \matr{R}

where \matr{Q} is a N-by-P-matrix with orthonormal columns and \matr{R} is a P-by-P-upper triangular matrix. Such a QR decomposition may be constructed using several schemes, such as Gram-Schmidt orthogonalization, Householder reflections or Givens rotations.
In this setup the least squares problem is equivalent to solving:

\matr{R} \; \underline{a} \, \, = \, \, \matr{Q}^{\mbox{\scriptsize \textsf{T}}} \; \underline{y}

This upper triangular system can be solved using backwards substitution.
The solving scheme based on Householder QR factorization leads to a relative error that is proportional to:

\kappa(\matr{\Psi}) + \|\underline{r}\|_2 \kappa(\matr{\Psi})^2

where \underline{r} = \matr{\Psi} \, \underline{a} \, - \, \underline{y}. Thus this error is expected to be much smaller than the one associated with the normal equations provided that the residual \underline{r} is “small”.
Method based on singular value decomposition
The so-called singular value decomposition (SVD) of matrix \matr{\Psi} reads:

\matr{\Psi} \quad = \quad \matr{U} \; \matr{S} \; \matr{V}^{\mbox{\scriptsize \textsf{T}}}

where \matr{U}~\in \Rset^{N \times N} and \matr{V}~\in \Rset^{P \times P} are orthogonal matrices, and \matr{S}~\in \Rset^{N \times P} can be cast as:

\begin{aligned}
    \matr{S} \quad = \quad \left(
    \begin{array}{c}
      \matr{S}_1 \\
      \matr{0} \\
    \end{array}
    \right)
  \end{aligned}

In the previous equation, \matr{S}_1~\in \Rset^{P \times P} is a diagonal matrix containing the singular values \sigma_1 \geq \sigma_2 \geq \dots \geq \sigma_P > 0 of \matr{\Psi}.
It can be shown that the least squares solution is equal to:

\begin{aligned}
    \widehat{\underline{a}} \quad = \quad \matr{V} \; \left( \begin{array}{c}
      \matr{S}_1^{-1} \\
      \matr{0}  \\
    \end{array}\right)
    \; \matr{U}^{\mbox{\scriptsize \textsf{T}}} \; \underline{y}
  \end{aligned}

In practice it is not common to compute a “full” SVD as shown above. Instead, it is often sufficient and more economical in terms of time and memory to compute a reduced version of SVD. The latter reads:

\matr{\Psi} \quad = \quad \matr{U}_1 \; \matr{S}_1 \; \matr{V}^{\mbox{\scriptsize \textsf{T}}}

where \matr{U}_1 is obtained by extracting the P first columns of \matr{U}.
Note that it is also possible to perform SVD in case of a rank-deficient matrix \matr{\Psi}. In this case the resulting vector \widehat{\underline{a}} corresponds to the minimum norm least squares solution.
The computational cost of the method is proportional to (NP^2 + P^3) with a factor ranging from 4 to 10, depending on the numerical scheme used to compute the SVD decomposition. This cost is higher than those associated with the normal equations and with QR factorization. However SVD is relevant insofar as it provides a very valuable information, that is the singular values of matrix \matr{\Psi}.

Comparison of the methods

Several conclusions may be drawn concerning the various methods considered so far:

  • If N \approx P, normal equations and Householder QR factorization require about the same computational work. If N >> P, then the QR approach requires about twice as much work as normal equations.

  • However QR appears to be more accurate than normal equations, so it should be almost always preferred in practice.

  • SVD is also robust but it reveals the most computationally expensive scheme. Nonetheless the singular values are obtained as a by-product, which may be particularly useful for analytical and computational purposes.