Regularized Kernel Hilbert Space

In this section we assume that the function classes whenever \(\mathcal{G}\), \(\mathcal{H}\), \(\mathcal{F}\), \(\mathcal{F}^\prime\) are RKHS. Let \(\Phi_A:\mathcal{G}\rightarrow\mathbb{R}^n\) be an operator with \(i\) th row \(\langle \phi(A_i), \cdot \rangle_{\mathcal{G}}\) with corresponding kernel matrix \(K_A\). Define analogously \(\Phi_B, \ldots\) for the rest of the function classes.

Closed form - Estimator 1

We study the estimator

\[\hat{g} = \arg \min_{g \in \mathcal{G}} \max_{f' \in \mathcal{F'}} \mathbb{E}_n \left[ 2 \left\{ g(A) - Y \right\} f'(C') - f'(C')^2 \right] - \lambda \| f \|_{\mathcal{F}}^2 + \mu' \| g \|_{\mathcal{G}}^2\]

Formula of minimizers

The minimizer takes the form \(\hat{g} = \Phi_A^* \hat{\alpha}\) where,

\[\begin{split}\hat{\alpha} &= \left(K_A P_C' K_A + \mu K_A \right)^{\dagger} K_A P_C' Y \\ P_{C'} &= \left(K_{C'} + \lambda \right)^{\dagger} K_{C'}\end{split}\]

`rkhsiv.RKHSIV`([kernel, gamma, degree, ...])	RKHS IV estimator.
`rkhsiv.RKHSIVCV`([kernel, gamma, degree, ...])	RKHS IV estimator with cross-validation.

Remark (Nystrom approximation) A low-rank approximation using Nystrom method is also implemented.

`rkhsiv.ApproxRKHSIV`([kernel_approx, ...])	Approximate RKHS IV estimator using kernel approximations.
`rkhsiv.ApproxRKHSIVCV`([kernel_approx, ...])	Approximate RKHS IV estimator with cross-validation using kernel approximations.

Closed form - Estimator 2

We study the estimator

\[\hat{g} = \arg \min_{g \in \mathcal{G}} \max_{f' \in \mathcal{F'}} \mathbb{E}_n \left[ 2 \left\{ g(A) - Y \right\} f'(C') - f'(C')^2 \right] + \mu' \mathbb{E}_n \{ g(A)^2 \}\]

Formula of minimizers

The minimizer takes the form \(\hat{g} = \Phi_A^* \hat{\alpha}\) where,

\[\begin{split}\hat{\alpha} &= \left( K_A P_C' K_A + \mu K_A^2 \right)^{\dagger} K_A P_C' Y \\ P_{C'} &= K_{C'}^{\dagger} K_{C'}\end{split}\]

`rkhsiv.RKHSIVL2`([kernel, gamma, degree, ...])	RKHS IV estimator with L2 regularization.
`rkhsiv.RKHSIVL2CV`([kernel, gamma, degree, ...])	RKHS IV estimator with L2 regularization and cross-validation.

Closed form - Estimator 3

We study the ridge regularized joint estimator:

\[\begin{split}(\hat{g}, \hat{h}) = \arg \min_{g \in \mathcal{G}, h \in \mathcal{H}} \max_{f' \in \mathcal{F}} \mathbb{E}_n \left[ 2 \left\{ g(A) - Y \right\} f'(C') - f'(C')^2 \right] + \mu' \mathbb{E}_n \{ g(A)^2 \} \\ \quad + \max_{f \in \mathcal{F}} \mathbb{E}_n \left[ 2 \left\{ h(B) - g(A) \right\} f(C) - f(C)^2 \right] + \mu \mathbb{E}_n \{ h(B)^2 \}\end{split}\]

Let \(V_{g,h}' = g(A) - Y\) and \(V_{g,h} = h(B) - g(A)\). Let \(\Phi_C : \mathcal{F} \rightarrow \mathbb{R}^n\) be an operator with \(i\) th row \(\langle \phi(C_i), \cdot \rangle_{\mathcal{F}}\). Define \(\Phi_{C'}\) analogously, replacing \(C_i\) with \(C_i'\). Let \(K_C\) and \(K_{C'}\) be the corresponding kernel matrices.

In remarks below, we also study the following modification, which we call the “subsetted” estimator:

\[\begin{split}(\hat{g}, \hat{h}) = \arg \min_{g \in \mathcal{G}, h \in \mathcal{H}} \max_{f' \in \mathcal{F}} \mathbb{E}_p \left[ 2 \left\{ g(A) - Y \right\} f'(C') - f'(C')^2 \right] + \mu' \mathbb{E}_n \{ g(A)^2 \} \\ \quad + \max_{f \in \mathcal{F}} \mathbb{E}_q \left[ 2 \left\{ h(B) - g(A) \right\} f(C) - f(C)^2 \right] + \mu \mathbb{E}_n \{ h(B)^2 \}\end{split}\]

where \([p]\) and \([q]\) partition \([n] = (1, \ldots, n)\), so \(p + q = n\).

For the index set \([p]\), let \(I_{[p]} \in \mathbb{R}^{p \times n}\) be the matrix of ones and zeros such that \(V_{[p]} = I_{[p]} V\) gives the elements of \(V\) whose indices are in \([p]\).

Maximizers

Existence of maximizers

There exist coefficients \(\hat{\gamma}_{g,h}, \hat{\gamma}'_{g,h} \in \mathbb{R}^n\) such that maximizers take the form \(\hat{f}_{g,h} = \Phi_C^* \hat{\gamma}_{g,h}\) and \(\hat{f}'_{g,h} = \Phi_{C'}^* \hat{\gamma}'_{g,h}\).

Remark (Subsetted estimator)

For the subsetted estimator, the same results hold but with \(\hat{\gamma}_{g,h;[q]} \in \mathbb{R}^q\) and \(\hat{\gamma}'_{g,h;[p]} \in \mathbb{R}^p\), acting on appropriately modified feature operators \(\Phi^*_{C;[q]}\) and \(\Phi^*_{C';[p]}\).

Proof

Write the objectives for the maximizers as

\[\begin{split}\mathcal{E}'(f') = \mathbb{E}_n \left\{ 2 V'_{g,h} f'(C') - f'(C')^2 \right\} \\ \mathcal{E}(f) = \mathbb{E}_n \left\{ 2 V_{g,h} f(C) - f(C)^2 \right\}\end{split}\]

We prove the former result; the latter is similar. By the Riesz representation theorem,

\[\mathcal{E}(f) = \mathbb{E}_n \left\{ 2 V_{g,h} \langle f, \phi(C) \rangle_{\mathcal{F}} - \langle f, \phi(C) \rangle_{\mathcal{F}}^2 \right\}\]

For an RKHS, evaluation is a continuous functional represented as the inner product with the feature map. Due to the ridge penalty, the stated objective has a maximizer \(\hat{f}_{g,h}\) that obtains the maximum.

To lighten notation, we suppress the indexing of \(\hat{f}_{g,h}\) by \((g,h)\) for the rest of this argument. Write \(\hat{f} = \hat{f}_n + \hat{f}^{\perp}_n\) where \(\hat{f}_n \in \text{row}(\Phi_C)\) and \(\hat{f}_n^{\perp} \in \text{null}(\Phi_C)\). Substituting this decomposition of \(\hat{f}\) into the objective, we see that

\[\mathcal{E}(\hat{f}) = \mathcal{E}(\hat{f}_n)\]

Hence if \(\hat{f}\) is a maximizer, then there exists \(\hat{f}_n\) that is also a maximizer.

Formula of maximizers

The explicit formula for the coefficients is \(\hat{\gamma}_{g,h} = K_C^{\dagger} \vec{V}_{g,h}\) and \(\hat{\gamma}'_{g,h} = K_{C'}^{\dagger} \vec{V}'_{g,h}\).

Remark (Subsetted estimator)

For the subsetted estimator, the same results hold but with \(\hat{\gamma}_{g,h;[q]} = K_{C;[q,q]}^{\dagger} \vec{V}_{g,h;[q]}\) and \(\hat{\gamma}'_{g,h;[p]} = K_{C';[p,p]}^{\dagger} \vec{V}'_{g,h;[p]}\).

Proof

We prove the former result; the latter is similar. Write the objective as

\[\mathcal{E}(f) = 2 \langle f, \hat{\mu}_{g,h} \rangle_{\mathcal{F}} - \langle f, \hat{T}_C f \rangle_{\mathcal{F}}\]

where \(\hat{\mu}_{g,h} = \mathbb{E}_n \{ V_{g,h} \phi(C) \} = \frac{1}{n} \Phi_C^* \vec{V}_{g,h}\) and \(\hat{T}_C = \mathbb{E}_n \{ \phi(C) \otimes \phi(C)^* \} = \frac{1}{n} \Phi_C^* \Phi_C\). Hence by the existence of maximizers,

\[\mathcal{E}(\gamma) = 2 \langle \Phi_C^* \gamma_{g,h}, \hat{\mu}_{g,h} \rangle_{\mathcal{F}} - \langle \Phi_C^* \gamma_{g,h}, \hat{T}_C \Phi_C^* \gamma_{g,h} \rangle_{\mathcal{F}} = \frac{2}{n} \gamma_{g,h}^{\top} \Phi_C \Phi_C^* \vec{V}_{g,h} - \frac{1}{n} \gamma_{g_h}^{\top} \Phi_C \Phi_C^* \Phi_C \Phi_C^* \gamma_{g,h}\]

Since \(K_C = \Phi_C \Phi_C^*\), the first order condition yields \(K_C \vec{V}_{g,h} = K_C^2 \hat{\gamma}_{g,h}\), i.e. \(\hat{\gamma}_{g,h} = K_C^{\dagger} \vec{V}_{g,h}\) where \(K_C^{\dagger}\) is the pseudoinverse of \(K_C\).

Minimizers

Let \(\Phi_A : \mathcal{H} \rightarrow \mathbb{R}^n\) be an operator with \(i\) th row \(\langle \phi(A_i), \cdot \rangle_{\mathcal{H}}\). Define \(\Phi_B\) analogously, replacing \(A_i\) with \(B_i\). Let \(K_A\) and \(K_B\) be the corresponding kernel matrices.

Existence of minimizers

There exist coefficients \(\alpha, \beta \in \mathbb{R}^n\) such that minimizers take the form \(\hat{g} = \Phi_A^* \hat{\alpha}\) and \(\hat{h} = \Phi_B^* \hat{\beta}\).

Remark (Subsetted estimator)

The result remains true for the subsetted estimator.

Proof

To begin, write the objective \(\mathcal{E}(g,h)\) as

\[\begin{split}\mathbb{E}_n \left\{ 2 V'_{g,h} \hat{f}_{g,f}'(C') - \hat{f}_{g,h}'(C')^2 \right\} + \mu' \mathbb{E}_n \{ g(A)^2 \} \\ + \mathbb{E}_n \left\{ 2 V_{g,h} \hat{f}_{g,h}(C) - \hat{f}_{g,h}(C)^2 \right\} + \mu \mathbb{E}_n \{ h(B)^2 \}\end{split}\]

By the existence and formula of maximizers,

\[\begin{split}\hat{f}_{g,f}'(C') = \langle \hat{f}_{g,f}', \phi(C') \rangle_{\mathcal{F}} = \langle \Phi_{C'}^* K_{C'}^{\dagger} \vec{V}'_{g,h}, \phi(C') \rangle_{\mathcal{F}} \\ \hat{f}_{g,h}(C) = \langle \hat{f}_{g,f}, \phi(C) \rangle_{\mathcal{F}} = \langle \Phi_{C}^* K_{C}^{\dagger} \vec{V}_{g,h}, \phi(C) \rangle_{\mathcal{F}}\end{split}\]

Hence \((g,h)\) only appear via \(V'_{g,h} = g(A) - Y\), \(V_{g,h} = h(B) - g(A)\), and directly as \(g(A)\) and \(h(B)\). In all of these expressions, they can be further expressed as \(g(A) = \langle g, \phi(A) \rangle_{\mathcal{G}}\) and \(h(B) = \langle h, \phi(B) \rangle_{\mathcal{H}}\), which is a linear functional. The overall objective is quadratic in such terms, so the stated objective has maximizers \((\hat{g}, \hat{h})\) that obtain the maximum.

By a similar argument to the existence of maximizers, for any \((\hat{g}, \hat{h})\) attaining the maximum, \(\mathcal{E}(\hat{g}, \hat{h}) = \mathcal{E}(\hat{g}_n, \hat{h}_n)\) where \(\hat{g}_n \in \text{row}(\Phi_A)\) and \(\hat{h}_n \in \text{row}(\Phi_B)\).

Properties of pseudo-inverse

For any square symmetric matrix \(K \in \mathbb{R}^{n \times n}\), its eigendecomposition is \(K = U \Sigma U^{\top}\) where \(\Sigma \in \mathbb{R}^{r \times r}\) and \(r \leq n\). Its pseudo-inverse is \(K^- = U \Sigma^{\dagger} U^{\top}\). Moreover, \(K^{\dagger} K = KK^{\dagger} = UU^{\top}\), which is a projection.

To lighten notation, let \(K_C^{\dagger} K_C = P_C\).

Formula of minimizers

The explicit formula for the coefficients is

\[\begin{split}\hat{\beta} = \left[ K_A \left\{ - P_C + \left( P_{C'} + P_C + \mu' \right) K_A \left( K_B P_C K_A \right)^{\dagger} K_B \left( P_C + \mu \right) \right\} K_B \right]^{\dagger} K_A P_{C'} Y \\ \hat{\alpha} = \left( K_B P_C K_A \right)^{\dagger} K_B \left( P_C + \mu \right) K_B \hat{\beta}\end{split}\]

`rkhs2iv.RKHS2IVL2`([kernel, gamma, degree, ...])	Nested RKHS IV estimator with L2 regularization.
`rkhs2iv.RKHS2IVL2CV`([kernel, gamma, degree, ...])	Nested RKHS IV estimator with L2 regularization and cross-validation.

Proof

We proceed in steps.

Write the objective \(\mathcal{E}(g,h)\) as

\[\begin{split}2 \langle \hat{f}'_{g,h}, \hat{\mu}'_{g,h} \rangle_{\mathcal{F}} - \langle \hat{f}'_{g,h}, \hat{T}_{C'} \hat{f}'_{g,h} \rangle_{\mathcal{F}} + \mu' \langle g, \hat{T}_A g \rangle_{\mathcal{G}} \\ + 2 \langle \hat{f}_{g,h}, \hat{\mu}_{g,h} \rangle_{\mathcal{F}} - \langle \hat{f}_{g,h}, \hat{T}_C \hat{f}_{g,h} \rangle_{\mathcal{F}} + \mu \langle h, \hat{T}_B h \rangle_{\mathcal{H}}\end{split}\]

where

\[\hat{\mu}'_{g,h} = \frac{1}{n} \Phi_{C'}^* \vec{V}'_{g,h}, \quad \hat{\mu}_{g,h} = \frac{1}{n} \Phi_C^* \vec{V}_{g,h}\]

and the covariance operators are defined analogously to the formula of maximizers. Hence,

\[\begin{split}\mathcal{E}(g,h) = 2 \langle \Phi_{C'}^* K_{C'}^{\dagger} \vec{V}'_{g,h}, \hat{\mu}'_{g,h} \rangle_{\mathcal{F}} - \langle \Phi_{C'}^* K_{C'}^{\dagger} \vec{V}'_{g,h}, \hat{T}_{C'} \Phi_{C'}^* K_{C'}^{\dagger} \vec{V}'_{g,h} \rangle_{\mathcal{F}} \\ + \mu' \langle g, \hat{T}_A g \rangle_{\mathcal{G}} \\ + 2 \langle \Phi_C^* K_C^{\dagger} \vec{V}_{g,h}, \hat{\mu}_{g,h} \rangle_{\mathcal{F}} - \langle \Phi_C^* K_C^{\dagger} \vec{V}_{g,h}, \hat{T}_C \Phi_C^* K_C^{\dagger} \vec{V}_{g,h} \rangle_{\mathcal{F}} \\ + \mu \langle h, \hat{T}_B h \rangle_{\mathcal{H}}\end{split}\]

\[\begin{split}= \frac{2}{n} (\vec{V}'_{g,h})^{\top} K_{C'}^{\dagger} \Phi_{C'} \Phi_{C'}^* \vec{V}'_{g,h} - \frac{1}{n} (\vec{V}'_{g,h})^{\top} K_{C'}^{\dagger} \Phi_{C'} \Phi_{C'}^* \Phi_{C'} \Phi_{C'}^* K_{C'}^{\dagger} \vec{V}'_{g,h} \\ + \mu' \langle g, \hat{T}_A g \rangle_{\mathcal{G}} \\ + \frac{2}{n} \vec{V}_{g,h}^{\top} K_C^{\dagger} \Phi_C \Phi_C^* \vec{V}_{g,h} - \frac{1}{n} \vec{V}_{g,h}^{\top} K_C^{\dagger} \Phi_C \Phi_C^* \Phi_C \Phi_C^* K_C^{\dagger} \vec{V}_{g,h} \\ + \mu \langle h, \hat{T}_B h \rangle_{\mathcal{H}}\end{split}\]

\[\begin{split}= \frac{1}{n} (\vec{V}'_{g,h})^{\top} P_{C'} \vec{V}'_{g,h} + \mu' \langle g, \hat{T}_A g \rangle_{\mathcal{G}} \\ + \frac{1}{n} \vec{V}_{g,h}^{\top} P_C \vec{V}_{g,h} + \mu \langle h, \hat{T}_B h \rangle_{\mathcal{H}}\end{split}\]

Let \(Y, G, H \in \mathbb{R}^n\) be defined with \(G_i = g(A_i)\) and \(H_i = h(B_i)\). In this notation,

\[\begin{split}\frac{1}{n} (\vec{V}'_{g,h})^{\top} P_{C'} \vec{V}'_{g,h} = \frac{1}{n} (Y^{\top} P_{C'} Y - 2 G^{\top} P_{C'} Y + G^{\top} P_{C'} G), \quad \mu' \langle g, \hat{T}_A g \rangle_{\mathcal{G}} = \frac{\mu'}{n} G^{\top} G \\ \frac{1}{n} \vec{V}_{g,h}^{\top} P_C \vec{V}_{g,h} = \frac{1}{n} (H^{\top} P_C H - 2 G^{\top} P_C H + G^{\top} P_C G), \quad \mu \langle h, \hat{T}_B h \rangle_{\mathcal{H}} = \frac{\mu}{n} H^{\top} H\end{split}\]

Combining with \(G = \Phi_A g = K_A \alpha\) and \(H = \Phi_B h = K_B \beta\) from the existence of minimizers,

\[\begin{split}n \mathcal{E}(\alpha, \beta) = Y^{\top} P_{C'} Y - 2 G^{\top} (P_{C'} Y + P_C H) + G^{\top} (P_{C'} + P_C + \mu') G + H^{\top} (P_C + \mu) H \\ = Y^{\top} P_{C'} Y - 2 \alpha^{\top} K_A (P_{C'} Y + P_C K_B \beta) + \alpha^{\top} K_A (P_{C'} + P_C + \mu') K_A \alpha \\ \quad + \beta^{\top} K_B (P_C + \mu) K_B \beta\end{split}\]

The first order conditions yield

\[\begin{split}0 = -2 K_A (P_{C'} Y + P_C K_B \hat{\beta}) + 2 K_A (P_{C'} + P_C + \mu') K_A \hat{\alpha} \\ 0 = -2 K_B P_C K_A \hat{\alpha} + 2 K_B (P_C + \mu) K_B \hat{\beta} \Longrightarrow \hat{\alpha} = \left( K_B P_C K_A \right)^{\dagger} K_B \left( P_C + \mu \right) K_B \hat{\beta}\end{split}\]

Substituting the latter into the former,

\[K_A P_{C'} Y + K_A P_C K_B \hat{\beta} = K_A (P_{C'} + P_C + \mu') K_A \left( K_B P_C K_A \right)^{\dagger} K_B \left( P_C + \mu \right) K_B \hat{\beta}\]

and solving for \(\hat{\beta}\),

\[\hat{\beta} = \left[ K_A \left\{ - P_C + \left( P_{C'} + P_C + \mu' \right) K_A \left( K_B P_C K_A \right)^{\dagger} K_B \left( P_C + \mu \right) \right\} K_B \right]^{\dagger} K_A P_{C'} Y\]

Remark (Subsetted estimator)

Formula of minimizers (Subsetted estimator)

The explicit formula for the coefficients is

\[\begin{split}\hat{\beta} = \left[ K_A \left\{ - \tilde{P}_C + \left( \tilde{P}_{C'} + \tilde{P}_C + \mu' \right) K_A \left( K_B \tilde{P}_C K_A \right)^{\dagger} K_B \left( \tilde{P}_C + \mu \right) \right\} K_B \right]^{\dagger} K_A \tilde{P}_{C'} Y \\ \hat{\alpha} = \left( K_B \tilde{P}_C K_A \right)^{\dagger} K_B \left( \tilde{P}_C + \mu \right) K_B \hat{\beta}\end{split}\]

where \(\tilde{P}_{C'} = \frac{n}{p} I_{[p]}^{\top} P_{C';[p,p]} I_{[p]}\) and \(\tilde{P}_C = \frac{n}{q} I_{[q]}^{\top} P_{C;[q,q]} I_{[q]}\). Note that \(P_{C';[p,p]} = (K_{C';[p,p]})^- K_{C';[p,p]}\) and \(K_{C';[p,p]} = I_{[p]} K_{C'} I_{[p]}^{\top}\).

Proof

We proceed in steps.

Write the objective \(\mathcal{E}(g,h)\) as

\[\begin{split}2 \langle \hat{f}'_{g,h}, \hat{\mu}'_{g,h;[p]} \rangle_{\mathcal{F}} - \langle \hat{f}'_{g,h}, \hat{T}_{C';[p,p]} \hat{f}'_{g,h} \rangle_{\mathcal{F}} + \mu' \langle g, \hat{T}_A g \rangle_{\mathcal{G}} \\ + 2 \langle \hat{f}_{g,h}, \hat{\mu}_{g,h;[q]} \rangle_{\mathcal{F}} - \langle \hat{f}_{g,h}, \hat{T}_{C;[q,q]} \hat{f}_{g,h} \rangle_{\mathcal{F}} + \mu \langle h, \hat{T}_B h \rangle_{\mathcal{H}}\end{split}\]

where

\[\hat{\mu}'_{g,h;[p]} = \frac{1}{p} \Phi_{C';[p]}^* \vec{V}'_{g,h;[p]}, \quad \hat{\mu}_{g,h;[q]} = \frac{1}{q} \Phi_{C;[q]}^* \vec{V}_{g,h;[q]}\]

and the covariance operators are defined analogously to the subsetted estimator. Hence,

\[\begin{split}\mathcal{E}(g,h) = \frac{1}{p} (\vec{V}'_{g,h;[p]})^{\top} P_{C';[p,p]} \vec{V}'_{g,h;[p]} + \mu' \langle g, \hat{T}_A g \rangle_{\mathcal{G}} \\ + \frac{1}{q} \vec{V}_{g,h;[q]}^{\top} P_{C;[q,q]} \vec{V}_{g,h;[q]} + \mu \langle h, \hat{T}_B h \rangle_{\mathcal{H}}\end{split}\]

Let \(Y, G, H \in \mathbb{R}^n\) be defined with \(G_i = g(A_i)\) and \(H_i = h(B_i)\) as before. Now, let \(\tilde{P}_{C'} = \frac{n}{p} I_{[p]}^{\top} P_{C';[p,p]} I_{[p]}\) and \(\tilde{P}_C = \frac{n}{q} I_{[q]}^{\top} P_{C;[q,q]} I_{[q]}\). Then

\[\begin{split}\frac{1}{p} (\vec{V}'_{g,h;[p]})^{\top} P_{C';[p,p]} \vec{V}'_{g,h;[p]} = \frac{1}{n} (Y^{\top} \tilde{P}_{C'} Y - 2 G^{\top} \tilde{P}_{C'} Y + G^{\top} \tilde{P}_{C'} G) \\ \mu' \langle g, \hat{T}_A g \rangle_{\mathcal{G}} = \frac{\mu'}{n} G^{\top} G \\ \frac{1}{q} \vec{V}_{g,h;[q]}^{\top} P_{C;[q,q]} \vec{V}_{g,h;[q]} = \frac{1}{n} (H^{\top} \tilde{P}_C H - 2 G^{\top} \tilde{P}_C H + G^{\top} \tilde{P}_C G) \\ \mu \langle h, \hat{T}_B h \rangle_{\mathcal{H}} = \frac{\mu}{n} H^{\top} H\end{split}\]

Hereafter we use the same argument as in the formula of minimizers.

Nyström approximation

Computation of kernel methods may be demanding due to the inversions of matrices that scale with \(n\) such as \(K_B \in \mathbb{R}^{n \times n}\). One solution is Nyström approximation. We now provide alternative expressions for the minimizers \((\hat{g}, \hat{h})\) that lend themselves to Nyström approximation, then describe the procedure.

Minimizer sufficient statistics

The minimizers may be expressed as

\[\hat{g} = \left(\Phi_B^* P_C \Phi_A\right)^{\dagger} \Phi_B^* (P_C + \mu) \Phi_B \hat{h},\]

\[\hat{h} = \left[ \Phi_A^* \left\{ -P_C + \left( P_{C'} + P_C + \mu' \right) \Phi_A \left( \Phi_B^* P_C \Phi_A \right)^{\dagger} \Phi_B^* \left( P_C + \mu \right) \right\} \Phi_B \right]^{\dagger} \Phi_A^* P_{C'} Y.\]

Proof

We proceed in steps.

By the proof of the Formula of minimizers of Estimator 3, with \(G = \Phi_A g\) and \(H = \Phi_B h\),

\[\begin{split}\begin{align*} n \mathcal{E}(g, h) &= Y^{\top} P_{C'} Y - 2 G^{\top} (P_{C'} Y + P_C H) \\ & \quad + G^{\top} (P_{C'} + P_C + \mu') G + H^{\top} (P_C + \mu) H, \\ &= Y^{\top} P_{C'} Y - 2 g^* \Phi_A^* (P_{C'} Y + P_C \Phi_B h) \\ & \quad + g^* \Phi_A^* (P_{C'} + P_C + \mu') \Phi_A g + h^* \Phi_B^* (P_C + \mu) \Phi_B h. \end{align*}\end{split}\]

Informally, the first order conditions yield

\[\begin{split}\begin{align*} 0 &= -2 \Phi_A^* (P_{C'} Y + P_C \Phi_B \hat{h}) + 2 \Phi_A^* (P_{C'} + P_C + \mu') \Phi_A \hat{g}, \\ 0 &= -2 \Phi_B^* P_C \Phi_A \hat{g} + 2 \Phi_B^* (P_C + \mu) \Phi_B \hat{h}. \end{align*}\end{split}\]

See De Vito and Caponnetto (2005) (Proof of Proposition 2) for the formal way of deriving the first order condition, which incurs additional notation.

Rearranging and taking pseudo-inverses, we arrive at two equations:

\[\Phi_A^* (P_{C'} + P_C + \mu') \Phi_A \hat{g} = \Phi_A^* (P_{C'} Y + P_C \Phi_B \hat{h}),\]

\[\Phi_B^* P_C \Phi_A \hat{g} = \Phi_B^* (P_C + \mu) \Phi_B \hat{h} \Longrightarrow \hat{g} = \left(\Phi_B^* P_C \Phi_A \right)^{\dagger} \Phi_B^* (P_C + \mu) \Phi_B \hat{h}.\]

Substituting the latter into the former,

\[\Phi_A^* P_{C'} Y + \Phi_A^* P_C \Phi_B \hat{h} = \Phi_A^* (P_{C'} + P_C + \mu') \Phi_A \left(\Phi_B^* P_C \Phi_A \right)^{\dagger} \Phi_B^* (P_C + \mu) \Phi_B \hat{h},\]

and solving for \(\hat{h}\),

\[\hat{h} = \left[ \Phi_A^* \left\{ -P_C + \left( P_{C'} + P_C + \mu' \right) \Phi_A \left( \Phi_B^* P_C \Phi_A \right)^{\dagger} \Phi_B^* \left( P_C + \mu \right) \right\} \Phi_B \right]^{\dagger} \Phi_A^* P_{C'} Y.\]

Remark (Nyström subsetted estimator)

Formula of minimizers (Subsetted estimator)

The subsetted minimizers may be expressed as

\[\hat{g} = \left(\Phi_B^* \tilde{P}_C \Phi_A \right)^{\dagger} \Phi_B^* (\tilde{P}_C + \mu) \Phi_B \hat{h},\]

\[\hat{h} = \left[ \Phi_A^* \left\{ -\tilde{P}_C + \left( \tilde{P}_{C'} + \tilde{P}_C + \mu' \right) \Phi_A \left( \Phi_B^* \tilde{P}_C \Phi_A \right)^{\dagger} \Phi_B^* \left( \tilde{P}_C + \mu \right) \right\} \Phi_B \right]^{\dagger} \Phi_A^* \tilde{P}_{C'} Y.\]

Proof

The argument is analogous to the Remark of the properties of pseudo-inverse above.

Properties of pseudo-inverse

Continuing the notation of the (Properties of pseudo-inverse), if \(\Phi = U \Sigma^{1/2} V^{\top}\) and \(K = \Phi \Phi^*\), then \(P = UU^{\top} = K^{\dagger} K = \Phi \Phi^{\dagger}\).

Combining (Minimizer sufficient statistics) and (Properties of pseudo-inverse), we conclude that sufficient statistics for \((\hat{g}, \hat{h})\) are feature operators. Within the feature operator \(\Phi\), the \(i\) th row \(\langle \phi(X_i), \cdot \rangle\) may be viewed as an infinite dimensional vector.

Nyström approximation is a way to approximate infinite dimensional vectors with finite dimensional ones. It uses the substitution \(\phi(x) \mapsto \check{\phi}(x) = (K_{\mathcal{S} \mathcal{S}})^{-\frac{1}{2}} K_{\mathcal{S} x}\), where \(\mathcal{S}\) is a subset of \(s = |\mathcal{S}| \ll n\) observations called landmarks. \(K_{\mathcal{S} \mathcal{S}} \in \mathbb{R}^{s \times s}\) is defined such that \((K_{\mathcal{S} \mathcal{S}})_{ij} = k(X_i, X_j)\) for \(i, j \in \mathcal{S}\). Similarly, \(K_{\mathcal{S} x} \in \mathbb{R}^s\) is defined such that \((K_{\mathcal{S} x})_i = k(X_i, x)\) for \(i \in \mathcal{S}\).

In summary, the approximate sufficient statistics are of the form \(\check{\Phi} \in \mathbb{R}^{n \times s}\), i.e. a matrix whose \(i\) th row \(\langle \check{\phi}(X_i), \cdot \rangle\) may be viewed as a vector in \(\mathbb{R}^s\).

Closed form - Estimator 3 (RKHS norm)

We study the RKHS-norm regularized joint estimator:

\[\begin{split}(\hat{g}, \hat{h}) = \arg \min_{g \in \mathcal{G}, h \in \mathcal{H}} \max_{f' \in \mathcal{F}} \mathbb{E}_n \left[ 2 \left\{ g(A) - Y \right\} f'(C') - f'(C')^2 \right] -\lambda'\|f'\|_\mathcal{F'}^2 + \mu' \| g \|_{\mathcal{G}}^2 \\ \quad + \max_{f \in \mathcal{F}} \mathbb{E}_n \left[ 2 \left\{ h(B) - g(A) \right\} f(C) - f(C)^2 \right] -\lambda\|f\|_\mathcal{F}^2 + \mu \| h \|_{\mathcal{H}}^2\end{split}\]

Formula of minimizers

The minimizer takes the form \(\hat{g} = \Phi_A^*\hat\alpha\), \(\hat{h} = \Phi_B^*\hat\beta\) where,

\[\begin{split}\hat{\beta} &= \left[ K_A \left\{ - P_C + \left(P_{C'} K_A + P_C K_A + \mu'\right) \left( K_B P_C K_A \right)^{\dagger} \left( K_B P_C + \mu \right)\right\} K_B \right]^{\dagger} K_A P_{C'} Y \\ \hat{\alpha} &= \left( K_B P_C K_A \right)^{\dagger} \left( K_B P_C + \mu \right) K_B \hat{\beta}\end{split}\]

and

\[\begin{split}P_C &= \left(K_C+\lambda\right)^{\dagger}K_C \\ P_{C'} &= \left(K_{C'}+\lambda'\right)^{\dagger}K_{C'}\end{split}\]

`rkhs2iv.RKHS2IV`([kernel, gamma, degree, ...])	Nested RKHS IV estimator.
`rkhs2iv.RKHS2IVCV`([kernel, gamma, degree, ...])	Nested RKHS IV estimator with cross-validation.

Remark: Subsetted estimator

Formula of minimizers (Subsetted estimator)

The subsetted estimator satisfies:

\[\begin{split}\hat{\beta} &= \left[ K_A \left\{ - \tilde{P}_C + \left(\tilde{P}_{C'} K_A + \tilde{P}_C K_A + \mu'\right) \left( K_B \tilde{P}_C K_A \right)^{\dagger} \left( K_B \tilde{P}_C + \mu \right)\right\} K_B \right]^{\dagger} K_A \tilde{P}_{C'} Y \\ \hat{\alpha} &= \left( K_B \tilde{P}_C K_A \right)^{\dagger} \left( K_B \tilde{P}_C + \mu \right) K_B \hat{\beta}\end{split}\]

with \(\tilde{P}_{C'}=\frac{n}{p}I_{[p]}^{\top}P_{C';[p,p]}I_{[p]}\) and \(\tilde{P}_{C}=\frac{n}{q}I_{[q]}^{\top}P_{C;[q,q]}I_{[q]}\). And

\[\begin{split}P_{C';[p,p]}&=(K_{C';[p,p]}+\lambda I_{[p]}I_{[p]}^\top)^-K_{C';[p,p]}\;, \qquad K_{C';[p,p]}=I_{[p]}K_{C'}I_{[p]}^{\top} \\ P_{C;[q,q]}&=(K_{C;[q,q]}+\lambda I_{[q]}I_{[q]}^\top)^-K_{C;[q,q]}\;, \qquad K_{C;[q,q]}=I_{[q]}K_{C}I_{[q]}^{\top}\end{split}\]