6. Parameter estimation¶

The objective of this section is to give the necessary detail on the proposed spectral regularisation term, the optimisation framework followed, and combine this into a final parameter estimation framework for LVMs with spectral regularisation.

6.1. Part one: The spectral regularisation term¶

The proposed spectral regularisation term uses the Fourier transform of the source vectors to enforce spectral orthogonality. Given some column vector \(\mathbf{w} \in \mathbb{R}^{D}\), the Fourier representation of the vector can be computed using

\[\mathcal{F}(\mathbf{w}) = \mathbf{D} \mathbf{w}\]

where \(\mathcal{F}\) represents the Fourier operator and \(\mathbf{D} \in \mathbb{C}^{D \times D}\) represents the discrete Fourier transform (DFT) matrix. The DFT matrix \(\mathbf{D}\) is given by

(1)¶\[\begin{split} \begin{align} & \mathbf{D} = \frac{1}{\sqrt{D}} \mathbf{B} \\ &= \frac{1}{\sqrt{D}} \begin{bmatrix} 1 & 1 & 1 & 1 & \cdots & 1 \\ 1 & \omega & \omega^2 & \omega^3 & \cdots & \omega^{D - 1} \\ 1 & \omega^2 & \omega^4 & \omega^6 & \cdots & \omega^{2 \cdot \left(D - 1\right)} \\ 1 & \omega^3 & \omega^6 & \omega^9 & \cdots & \omega^{3 \cdot \left(D - 1\right)} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & \omega^{D - 1} & \omega^{2 \cdot \left(D - 1\right)} & \omega^{3 \cdot \left(D - 1\right)} & \cdots & \omega^{\left(D - 1\right) \cdot \left(D - 1\right)} \end{bmatrix}\\ &= \frac{1}{\sqrt{D}} \left(\mathbf{R} + \sqrt{-1} \cdot \mathbf{I} \right) \end{align}\end{split}\]

where \(\omega = e^{-2 \cdot \pi \cdot \sqrt{-1} / D} = \sin\left( 2 \cdot \pi / D \right) - \sqrt{-1} \cdot \cos\left( 2 \cdot \pi / D \right)\) is the primitive \(D^{th}\) root of unity, \(\sqrt{-1}\) defines an imaginary number, and \(\mathbf{R}\in \mathbb{R}^{D \times D} = [\mathbf{r}_1, \cdots, \mathbf{r}_D]\) and \(\mathbf{I}\in \mathbb{R}^{D \times D} = [\mathbf{i}_1, \cdots, \mathbf{i}_D]\) represent the real and imaginary components of \(\mathbf{B}\) respectively. Note that the factor \(1/\sqrt{D}\) in Equation (1) ensures that the DFT transform is unitary. The square of the complex modulus for the \(i^{th}\) index in \(\mathcal{F}(\mathbf{w})\) may be represented as

\[\begin{split}\begin{align} b_i(\mathbf{w}) &= \vert \mathcal{F}_i(\mathbf{w}) \vert^2 \\ &= \left\lvert \frac{1}{\sqrt{D}} \left(\sum_{j} R_{ij} \cdot w_j + \sqrt{-1} \cdot I_{ij} \cdot w_j\right) \right\rvert^2 \\ &= \frac{1}{D} \sum_{j} \left[(R_{ij} \cdot w_j)^2 + (I_{ij} \cdot w_j)^2\right], \end{align}\end{split}\]

which may be given in vector notation through

\[\mathbf{b}(\mathbf{w}) = \frac{1}{D} \left( \mathbf{R} \mathbf{w} \odot \mathbf{R} \mathbf{w} + \mathbf{I} \mathbf{w} \odot \mathbf{I} \mathbf{w} \right),\]

where \(\odot\) is the Hadamard product. The square of the complex modulus of the spectral representation of a component vector \(\mathbf{w}\) is used to enforce that the component vectors capture non-duplicate spectral information. Given some vector of interest, \(\mathbf{w}_i\) and the vector of a previous solution iteration \(\mathbf{w}_j\), \(j < i\), the shared source information can be represented via

(2)¶\[h(\mathbf{w}_{i}, \mathbf{w}_{j}) = \mathbf{b}(\mathbf{w}_i)^T \mathbf{b}(\mathbf{w}_j),\]

which then defines the spectral regularisation term as

(3)¶\[\mathcal{L}_{sr}(\mathbf{w}_i) = \alpha \sum_{j=1}^{i-1} h(\mathbf{w}_{i}, \mathbf{w}_{j}).\]

To use the regularisation term given in Equation (3) during the parameter estimation process, it is necessary to determine the gradient vector \(\nabla_{\mathbf{w}_i} \mathcal{L}_{sr}(\mathbf{w}_i)\) and the Jacobian \(\mathbf{J}\left(\nabla_{\mathbf{w}_i} \mathcal{L}_{sr}(\mathbf{w}_i) \right)^T\), i.e. the Hessian \(\mathbf{H}_{sr}\), of Equation (3). For notational simplicity, the second derivative operator is denoted by \(\nabla^2\). It is also noted here that \(\mathbf{w}_i\) and \(\mathbf{w}_j\) are independent variables, which is necessary to readily simplify the subsequent derivatives with respect to \(\mathbf{w}_i\). To enable this derivation, the general Hadamard product for vectors \(\mathbf{v}\in\mathbb{R}^{D}\) and \(\mathbf{u}\in\mathbb{R}^{D}\) is defined as

\[\mathbf{v} \odot \mathbf{u} = \text{diag} \left(\mathbf{v}\right) \mathbf{u} = \text{diag} \left(\mathbf{u}\right) \mathbf{v},\]

where \(\text{diag}(\cdot)\) represents the vector-to-matrix diag operator. The differential of this expression is

\[\begin{split}\begin{align} d \mathbf{p} &= \mathbf{v} \odot d\mathbf{u} + \mathbf{u} \odot d \mathbf{v} \\ &= \text{diag}\left( \mathbf{u} \right) d \mathbf{v} + \text{diag}\left( \mathbf{v} \right) d \mathbf{u} \end{align}\end{split}\]

which, under the assumption that both \(\mathbf{u}\) and \(\mathbf{v}\) depend on some variable \(\boldsymbol{\theta}\), gives the derivative

(4)¶\[ \frac{d \mathbf{p}}{d \boldsymbol{\theta}} = \text{diag}\left( \mathbf{u} \right) \frac{d \mathbf{v}}{d \boldsymbol{\theta}} + \text{diag}\left( \mathbf{v} \right) \frac{d \mathbf{u}}{d \boldsymbol{\theta}}.\]

The gradient vector of Equation (3) can be obtained using Equation (4) through

(5)¶\[\begin{split} \begin{align} \nabla_{\mathbf{w}_i} \mathcal{L}_{sr}(\mathbf{w}_i) &= \alpha \sum_{j=1}^{i - 1} \left( \frac{\partial \mathbf{b}}{\partial \mathbf{w}_i} \right)^T \mathbf{b}(\mathbf{w}_j) \\ &= \frac{2 \cdot \alpha}{D} \sum_{j=1}^{i - 1} \left[ \text{diag}(\mathbf{R} \mathbf{w}_i)\mathbf{R} + \text{diag}(\mathbf{I} \mathbf{w}_i)\mathbf{I} \right]^T \mathbf{b}(\mathbf{w}_j). \end{align}\end{split}\]

To compute the Hessian \(\mathbf{H}_{sr}\), it is easier to first consider what the \(k^{th}\) index in the gradient vector \(\nabla_{\mathbf{w}_i} \mathcal{L}_{sr}(\mathbf{w}_i)\) represents for the \(j^{th}\) summation term. This scalar term is given by

\[\left(\nabla_{\mathbf{w}_i} \mathcal{L}_{sr}(\mathbf{w}_i) \right)^{(j)}_k = \left( (\mathbf{R} \mathbf{w}) \odot \mathbf{r}_k + (\mathbf{I} \mathbf{w}) \odot \mathbf{i}_k \right)^T \mathbf{b}(\mathbf{w}_j),\]

where \(\mathbf{r}_k\) and \(\mathbf{i}_k\) represent the \(k^{th}\) column in \(\mathbf{R}\) and \(\mathbf{I}\) respectively. Computing the derivative of the \(k^{th}\) index in the gradient vector with respect to \(\mathbf{w}_i\) yields

\[\nabla_{\mathbf{w}_i}^T \left(\nabla_{\mathbf{w}_i} \mathcal{L}_{sr}(\mathbf{w}_i) \right)^{(j)}_k = \mathbf{b}(\mathbf{w}_j)^T \left( \text{diag}\left(\mathbf{r}_{k}\right) \mathbf{R} + \text{diag}\left(\mathbf{i}_{k}\right) \mathbf{I} \right),\]

which represents the contribution of the \(j^{th}\) term to the \(k^{th}\) row in the Hessian matrix. Thus, the full Hessian matrix \(\mathbf{H}_{sr} \in \mathbb{R}^{D \times D}\) can be represented as

(6)¶\[\begin{split} \mathbf{H}_{sr} = \frac{2\cdot \alpha}{D} \sum_{j=1}^{i - 1} \begin{bmatrix} \mathbf{b}(\mathbf{w}_j)^T \left( \text{diag}\left(\mathbf{r}_{1}\right) \mathbf{R} + \text{diag}\left(\mathbf{i}_{1}\right) \mathbf{I} \right) \\ \vdots \\ \mathbf{b}(\mathbf{w}_j)^T \left( \text{diag}\left(\mathbf{r}_{D}\right) \mathbf{R} + \text{diag}\left(\mathbf{i}_{D}\right) \mathbf{I} \right) \end{bmatrix}.\end{split}\]

In order to validate this derivation, a notebook is provided in the Examples directory in the Github repository which uses a simple implementation of the derivation completed here and validates the implemented spectral_regulariser code against the simple derivation implementation and a symbolic representation of the spectral regularisation term given in Equation (3).

6.2. Part two: Optimisation formulation¶

The next step is to detail the optimisation formulation and the methodology for parameter optimisation. The general LVM objective function can be written as

\[\begin{split}\begin{align} \min_{\mathbf{w}_i} \quad & \mathcal{L}_{model}(\mathbf{w}_i) + \mathcal{L}_{sr}(\mathbf{w}_i) \\ \text{s.t.} \quad & \mathbf{w}_{i}^{T}\mathbf{w}_{i} = 1, \end{align}\end{split}\]

where \(\mathcal{L}_{model}(\mathbf{w}_i)\) represents the objective function to be minimised, \(\mathcal{L}_{sr}(\mathbf{w}_i)\) represents the spectral orthogonality term which is an additive regularisation term, and the equality constraint \(\mathbf{w}_i^T\mathbf{w}_i=1\) is used to ensure that the objective function focuses on the direction of \(\mathbf{w}_i\) and not its magnitude. In this work, Newton’s method is applied to the Lagrangian to obtain a solution to the general objective function. This can be seen as an application of constrained Newton’s method. The Lagrangian expression used for unconstrained function minimisation may be expressed as

(7)¶\[\begin{split} \begin{align} \mathcal{L}(\mathbf{w}_i, \lambda_{eq}) &= \mathcal{L}_{model}(\mathbf{w}_i) + \mathcal{L}_{sr}(\mathbf{w}_i) \\ & + \lambda_{eq} \left( \mathbf{w}_i^T \mathbf{w}_i - 1 \right), \end{align}\end{split}\]

where \(\lambda_{eq}\) represents the Lagrange multiplier. Note that \(\lambda_{eq}\) is an additional parameter that increases the dimensionality of the problem to \(D + 1\). In the derivation that follows, we generalise the objective function, its gradient vector and Hessian to \(\mathcal{L}_{model}\), \(\nabla_{\mathbf{w}_i} \mathcal{L}_{model}\), and \(\nabla_{\mathbf{w}_i}^2 \mathcal{L}_{model}=\mathbf{H}_{model}\) respectively. This is done as the spectrally-regularised-LVM package caters a general set of user-defined cost functions, each with a unique formulation, and can automatically generate the first and second-order derivatives symbolically, if necessary.

The gradient of Equation (7) with respect to \(\mathbf{w}_i\) can be expressed as

(8)¶\[ \nabla_{\mathbf{w}_i} \mathcal{L} = \nabla_{\mathbf{w}_i} \mathcal{L}_{model} + \nabla_{\mathbf{w}_i} \mathcal{L}_{sr}(\mathbf{w}_i) + 2 \cdot \lambda_{eq} \cdot \mathbf{w}_i.\]

The gradient of Equation (7) with respect to \(\lambda_{eq}\) is given as

\[\nabla_{\lambda_{eq}} \mathcal{L} = \mathbf{w}_i^T \mathbf{w}_i - 1.\]

Thus, the final gradient vector can be combined to be

(9)¶\[\begin{split} \nabla_{\boldsymbol\phi_i} \mathcal{L}(\boldsymbol{\phi}_i) = \begin{bmatrix} \nabla_{\mathbf{w}_i} \mathcal{L} \\ \nabla_{\lambda_{eq}} \mathcal{L} \end{bmatrix},\end{split}\]

where \(\boldsymbol\phi_i \in \mathbb{R}^{D + 1} = \left[ \begin{smallmatrix} \mathbf{w}_i \\ \lambda_{eq} \end{smallmatrix}\right]\) represents the combined optimisation parameters. In the model optimisation step performed by the spectrally-regularised-LVMs package, Newton’s method is used to obtain an estimate the model parameters. In this optimisation scheme, the next step is to compute the Hessian matrix, whereby the Hessian matrix is given in block notation as

(10)¶\[\begin{split} \mathbf{H}_{\mathcal{L}}(\boldsymbol{\phi}_i) = \begin{bmatrix} \nabla^2_{\mathbf{w}_i} \mathcal{L} & \frac{\partial}{\partial \lambda_{eq}} \left( \nabla_{\mathbf{w}_i} \mathcal{L} \right) \\ \frac{\partial}{\partial \lambda_{eq}} \left( \nabla_{\mathbf{w}_i}^T \mathcal{L} \right) & \nabla^2_{\lambda_{eq}} \mathcal{L} \end{bmatrix}\end{split}\]

where each term can be computed in turn. The Jacobian of the gradient vector in Equation (8) is expressed as

\[\begin{split}\begin{align} \nabla^2_{\mathbf{w}_i} \mathcal{L} &= \nabla^2_{\mathbf{w}_i} \mathcal{L}_{model} + \nabla_{\mathbf{w}_i}^2 \mathcal{L}_{sr} + 2 \cdot \lambda_{eq} \cdot \mathbf{I} \\ &= \mathbf{H}_{model} + \mathbf{H}_{sr} + 2 \cdot \lambda_{eq} \cdot \mathbf{I}. \end{align}\end{split}\]

The second term in the Hessian is the derivative of \(\partial \mathcal{L} / \partial \mathbf{\lambda_{eq}}\) with respect to \(\mathbf{w}_i\). This can be obtained through

\[\frac{\partial}{\partial \lambda_{eq}} \left( \nabla_{\mathbf{w}_i} \mathcal{L} \right) = 2 \cdot \mathbf{w}_i.\]

The final component of the Hessian matrix is the second derivative of the Lagrangian function with respect to \(\lambda_{eq}\). This is given by

\[\nabla^2_{\lambda_{eq}} \mathcal{L} = 0.\]

These terms complete the Hessian matrix given in Equation (7). The Hessian matrix applied to the constrained optimisation problem is known as the KKT matrix or the bordered Hessian. The general update scheme for some initial parameter state \(\boldsymbol{\phi}_{i}^{(k)}\) becomes

(11)¶\[ \mathbf{H}_{\mathcal{L}}(\boldsymbol{\phi}_i^{(k)}) \Delta \boldsymbol{\phi}_i = - \nabla_{\boldsymbol\phi_i}\mathcal{L}(\boldsymbol{\phi}_i^{(k)}),\]

where \(\Delta \boldsymbol{\phi}_i \in \mathbb{R}^{D + 1} = \left[\begin{smallmatrix} \Delta \mathbf{w}_i \\ \Delta \lambda_{eq} \end{smallmatrix}\right]\) represents the parameter update vector that is solved through the square system of equations. The respective parameters can then be updated through

(12)¶\[ \mathbf{w}_i^{(k + 1)} = \mathbf{w}_i^{(k)} + \gamma_{i}^{(k)} \cdot \Delta \mathbf{w}_i,\]

(13)¶\[ \lambda_{eq}^{(k + 1)} = \lambda_{eq}^{(k)} + \gamma_{i}^{(k)} \cdot\Delta \lambda_{eq},\]

where \(\gamma_{i}^{(k)}\) is a step size parameter obtained from a univariate line search to satisfy the Armijo condition. Equations (12) and Equation (13) are used until a termination condition occurs. The termination condition used is on the change to the \(\mathbf{w}_i\) vector and is given by

\[\vert \mathbf{w}^{(k), T}_{i} \mathbf{w}^{(k - 1)}_{i} - 1 \vert \leq \epsilon_{tol},\]

where \(\vert \cdot \vert\) is the absolute value function and \(\epsilon_{tol}\) is a convergence tolerance parameter.

6.3. Part three: Putting it all together¶

In the case where we wish to solve for multiple projection vectors, it may be necessary to enforce orthonormality between vectors \(\mathbf{w}_{i}\) and \(\mathbf{w}_j\), where \(j \neq i, \, \forall \, i > 1\). This is achieved through the use of the Gram-Schmidt (GS) orthonormalisation process. This process is given as

(14)¶\[ \mathbf{w}_{i, orth} = \mathbf{w}_i - \sum_{j=1}^{i-1} \frac{\mathbf{w}_{i}^T\mathbf{w}_{j}}{\Vert \mathbf{w}_j \Vert_2} \cdot \mathbf{w}_{j},\]

which produces a vector \(\mathbf{w}_{i, orth}\) that is orthogonal to all previously solved projection vectors. The final step in the GS process is to normalise the vector \(\mathbf{w}_{i, orth}\) by dividing by its vector norm \(\Vert \mathbf{w}_{i, orth} \Vert_2\). This process ensures that \(\forall i, j: \, \mathbf{w}_i^T \mathbf{w}_j = \delta_{ij}\), where \(\delta_{ij}\) is the Kroneker delta function.

The pre-processing strategy followed in the package is to first de-mean the random variables \(x_i\) to be zero-mean. This is given by

(15)¶\[ \overline{\mathbf{X}} = \mathbf{X} - \mathbf{1}\boldsymbol{\mu}^T,\]

where \(\boldsymbol\mu \in \mathbb{R}^{L_w}\) is a column vector of the feature-wise means of the hankel matrix \(\mathbf{X} \in \mathbb{R}^{L_H \times L_w}\) and \(\mathbf{1} \in \mathbb{R}^{L_H}\) is a constant vector with elements 1. If users wish to perform pre-whitening, which is a pre-processing strategy that removes any second-order correlations in \(\mathbf{x}\), a linear transformation is used and given by

(16)¶\[ \tilde{\mathbf{x}} = \mathbf{U} \mathbf{L}^{-1/2} \mathbf{U}^T \overline{\mathbf{x}},\]

where \(\tilde{\mathbf{x}}\) is the transformed variable, \(\mathbf{U} \in \mathbb{R}^{L_w \times L_w}\) is a matrix that represents the eigenvectors of the data covariance matrix \(\mathbf{C} = \mathbb{E} \{ \overline{\mathbf{x}} \, \overline{\mathbf{x}}^T \}\), and \(\mathbf{L}^{-1/2} \in \mathbb{R}^{L_w \times L_w} = \text{diag}\left( 1/\sqrt{\lambda_1} , \cdots, 1 / \sqrt{\lambda_{L_w}} \right)\) is a diagonal matrix which contains the reciprocal of the square root of the eigenvalues of \(\mathbf{C}\).

To provide users with the flexibility to choose between the prescribed update strategy, which uses Newton’s method to obtain a stationary point of the Lagrangian expression, a second_order flag is used in the LinearModel class to specify whether the Hessian matrix is computed or replaced with an identity matrix, \(H_{\mathcal{L}} = \mathbf{I} \in \mathbb{R}^{D + 1 \times D + 1}\). This allows users to choose between the prescribed optimisation strategy and standard gradient descent with a fixed learning rate.

Thus, the general optimisation algorithm used by the spectrally-regularised-LVMs package is given as