Regularized Kernel Hilbert Space

In this section we assume that the function classes whenever \(\mathcal{G}\), \(\mathcal{H}\), \(\mathcal{F}\), \(\mathcal{F}^\prime\) are RKHS. Let \(\Phi_A:\mathcal{G}\rightarrow\mathbb{R}^n\) be an operator with \(i\) th row \(\langle \phi(A_i), \cdot \rangle_{\mathcal{G}}\) with corresponding kernel matrix \(K_A\). Define analogously \(\Phi_B, \ldots\) for the rest of the function classes.

Closed form - Estimator 1

We study the estimator

\[\hat{g} = \arg \min_{g \in \mathcal{G}} \max_{f' \in \mathcal{F'}} \mathbb{E}_n \left[ 2 \left\{ g(A) - Y \right\} f'(C') - f'(C')^2 \right] - \lambda \| f \|_{\mathcal{F}}^2 + \mu' \| g \|_{\mathcal{G}}^2\]

Formula of minimizers

The minimizer takes the form \(\hat{g} = \Phi_A^* \hat{\alpha}\) where,

\[\begin{split}\hat{\alpha} &= \left(K_A P_C' K_A + \mu K_A \right)^{\dagger} K_A P_C' Y \\ P_{C'} &= \left(K_{C'} + \lambda \right)^{\dagger} K_{C'}\end{split}\]

rkhsiv.RKHSIV([kernel, gamma, degree, ...])

RKHS IV estimator.

rkhsiv.RKHSIVCV([kernel, gamma, degree, ...])

RKHS IV estimator with cross-validation.

Remark (Nystrom approximation) A low-rank approximation using Nystrom method is also implemented.

rkhsiv.ApproxRKHSIV([kernel_approx, ...])

Approximate RKHS IV estimator using kernel approximations.

rkhsiv.ApproxRKHSIVCV([kernel_approx, ...])

Approximate RKHS IV estimator with cross-validation using kernel approximations.

Closed form - Estimator 2

We study the estimator

\[\hat{g} = \arg \min_{g \in \mathcal{G}} \max_{f' \in \mathcal{F'}} \mathbb{E}_n \left[ 2 \left\{ g(A) - Y \right\} f'(C') - f'(C')^2 \right] + \mu' \mathbb{E}_n \{ g(A)^2 \}\]

Formula of minimizers

The minimizer takes the form \(\hat{g} = \Phi_A^* \hat{\alpha}\) where,

\[\begin{split}\hat{\alpha} &= \left( K_A P_C' K_A + \mu K_A^2 \right)^{\dagger} K_A P_C' Y \\ P_{C'} &= K_{C'}^{\dagger} K_{C'}\end{split}\]

rkhsiv.RKHSIVL2([kernel, gamma, degree, ...])

RKHS IV estimator with L2 regularization.

rkhsiv.RKHSIVL2CV([kernel, gamma, degree, ...])

RKHS IV estimator with L2 regularization and cross-validation.

Remark (Nystrom/RFF approximation) Low-rank feature approximations for the L2 sequential estimator are also implemented.

rkhsiv.ApproxRKHSIVL2([kernel_approx, ...])

Approximate RKHS IV estimator with L2 regularization.

rkhsiv.ApproxRKHSIVL2CV([kernel_approx, ...])

Approximate RKHS IV L2 estimator with cross-validation.

Closed form - Estimator 3

We study the ridge regularized joint estimator:

\[\begin{split}(\hat{g}, \hat{h}) = \arg \min_{g \in \mathcal{G}, h \in \mathcal{H}} \max_{f' \in \mathcal{F}} \mathbb{E}_n \left[ 2 \left\{ g(A) - Y \right\} f'(C') - f'(C')^2 \right] + \mu' \mathbb{E}_n \{ g(A)^2 \} \\ \quad + \max_{f \in \mathcal{F}} \mathbb{E}_n \left[ 2 \left\{ h(B) - g(A) \right\} f(C) - f(C)^2 \right] + \mu \mathbb{E}_n \{ h(B)^2 \}\end{split}\]

Let \(V_{g,h}' = g(A) - Y\) and \(V_{g,h} = h(B) - g(A)\). Let \(\Phi_C : \mathcal{F} \rightarrow \mathbb{R}^n\) be an operator with \(i\) th row \(\langle \phi(C_i), \cdot \rangle_{\mathcal{F}}\). Define \(\Phi_{C'}\) analogously, replacing \(C_i\) with \(C_i'\). Let \(K_C\) and \(K_{C'}\) be the corresponding kernel matrices.

In remarks below, we also study the following modification, which we call the “subsetted” estimator:

\[\begin{split}(\hat{g}, \hat{h}) = \arg \min_{g \in \mathcal{G}, h \in \mathcal{H}} \max_{f' \in \mathcal{F}} \mathbb{E}_p \left[ 2 \left\{ g(A) - Y \right\} f'(C') - f'(C')^2 \right] + \mu' \mathbb{E}_n \{ g(A)^2 \} \\ \quad + \max_{f \in \mathcal{F}} \mathbb{E}_q \left[ 2 \left\{ h(B) - g(A) \right\} f(C) - f(C)^2 \right] + \mu \mathbb{E}_n \{ h(B)^2 \}\end{split}\]

where \([p]\) and \([q]\) partition \([n] = (1, \ldots, n)\), so \(p + q = n\).

For the index set \([p]\), let \(I_{[p]} \in \mathbb{R}^{p \times n}\) be the matrix of ones and zeros such that \(V_{[p]} = I_{[p]} V\) gives the elements of \(V\) whose indices are in \([p]\).

Maximizers

Existence of maximizers

There exist coefficients \(\hat{\gamma}_{g,h}, \hat{\gamma}'_{g,h} \in \mathbb{R}^n\) such that maximizers take the form \(\hat{f}_{g,h} = \Phi_C^* \hat{\gamma}_{g,h}\) and \(\hat{f}'_{g,h} = \Phi_{C'}^* \hat{\gamma}'_{g,h}\).

Remark (Subsetted estimator)

For the subsetted estimator, the same results hold but with \(\hat{\gamma}_{g,h;[q]} \in \mathbb{R}^q\) and \(\hat{\gamma}'_{g,h;[p]} \in \mathbb{R}^p\), acting on appropriately modified feature operators \(\Phi^*_{C;[q]}\) and \(\Phi^*_{C';[p]}\).

Proof

Write the objectives for the maximizers as

\[\begin{split}\mathcal{E}'(f') = \mathbb{E}_n \left\{ 2 V'_{g,h} f'(C') - f'(C')^2 \right\} \\ \mathcal{E}(f) = \mathbb{E}_n \left\{ 2 V_{g,h} f(C) - f(C)^2 \right\}\end{split}\]

We prove the former result; the latter is similar. By the Riesz representation theorem,

\[\mathcal{E}(f) = \mathbb{E}_n \left\{ 2 V_{g,h} \langle f, \phi(C) \rangle_{\mathcal{F}} - \langle f, \phi(C) \rangle_{\mathcal{F}}^2 \right\}\]

For an RKHS, evaluation is a continuous functional represented as the inner product with the feature map. Due to the ridge penalty, the stated objective has a maximizer \(\hat{f}_{g,h}\) that obtains the maximum.

To lighten notation, we suppress the indexing of \(\hat{f}_{g,h}\) by \((g,h)\) for the rest of this argument. Write \(\hat{f} = \hat{f}_n + \hat{f}^{\perp}_n\) where \(\hat{f}_n \in \text{row}(\Phi_C)\) and \(\hat{f}_n^{\perp} \in \text{null}(\Phi_C)\). Substituting this decomposition of \(\hat{f}\) into the objective, we see that

\[\mathcal{E}(\hat{f}) = \mathcal{E}(\hat{f}_n)\]

Hence if \(\hat{f}\) is a maximizer, then there exists \(\hat{f}_n\) that is also a maximizer.

Formula of maximizers

The explicit formula for the coefficients is \(\hat{\gamma}_{g,h} = K_C^{\dagger} \vec{V}_{g,h}\) and \(\hat{\gamma}'_{g,h} = K_{C'}^{\dagger} \vec{V}'_{g,h}\).

Remark (Subsetted estimator)

For the subsetted estimator, the same results hold but with \(\hat{\gamma}_{g,h;[q]} = K_{C;[q,q]}^{\dagger} \vec{V}_{g,h;[q]}\) and \(\hat{\gamma}'_{g,h;[p]} = K_{C';[p,p]}^{\dagger} \vec{V}'_{g,h;[p]}\).

Proof

We prove the former result; the latter is similar. Write the objective as

\[\mathcal{E}(f) = 2 \langle f, \hat{\mu}_{g,h} \rangle_{\mathcal{F}} - \langle f, \hat{T}_C f \rangle_{\mathcal{F}}\]

where \(\hat{\mu}_{g,h} = \mathbb{E}_n \{ V_{g,h} \phi(C) \} = \frac{1}{n} \Phi_C^* \vec{V}_{g,h}\) and \(\hat{T}_C = \mathbb{E}_n \{ \phi(C) \otimes \phi(C)^* \} = \frac{1}{n} \Phi_C^* \Phi_C\). Hence by the existence of maximizers,

\[\mathcal{E}(\gamma) = 2 \langle \Phi_C^* \gamma_{g,h}, \hat{\mu}_{g,h} \rangle_{\mathcal{F}} - \langle \Phi_C^* \gamma_{g,h}, \hat{T}_C \Phi_C^* \gamma_{g,h} \rangle_{\mathcal{F}} = \frac{2}{n} \gamma_{g,h}^{\top} \Phi_C \Phi_C^* \vec{V}_{g,h} - \frac{1}{n} \gamma_{g_h}^{\top} \Phi_C \Phi_C^* \Phi_C \Phi_C^* \gamma_{g,h}\]

Since \(K_C = \Phi_C \Phi_C^*\), the first order condition yields \(K_C \vec{V}_{g,h} = K_C^2 \hat{\gamma}_{g,h}\), i.e. \(\hat{\gamma}_{g,h} = K_C^{\dagger} \vec{V}_{g,h}\) where \(K_C^{\dagger}\) is the pseudoinverse of \(K_C\).

Minimizers

Let \(\Phi_A : \mathcal{H} \rightarrow \mathbb{R}^n\) be an operator with \(i\) th row \(\langle \phi(A_i), \cdot \rangle_{\mathcal{H}}\). Define \(\Phi_B\) analogously, replacing \(A_i\) with \(B_i\). Let \(K_A\) and \(K_B\) be the corresponding kernel matrices.

Existence of minimizers

There exist coefficients \(\alpha, \beta \in \mathbb{R}^n\) such that minimizers take the form \(\hat{g} = \Phi_A^* \hat{\alpha}\) and \(\hat{h} = \Phi_B^* \hat{\beta}\).

Remark (Subsetted estimator)

The result remains true for the subsetted estimator.

Proof

To begin, write the objective \(\mathcal{E}(g,h)\) as

\[\begin{split}\mathbb{E}_n \left\{ 2 V'_{g,h} \hat{f}_{g,f}'(C') - \hat{f}_{g,h}'(C')^2 \right\} + \mu' \mathbb{E}_n \{ g(A)^2 \} \\ + \mathbb{E}_n \left\{ 2 V_{g,h} \hat{f}_{g,h}(C) - \hat{f}_{g,h}(C)^2 \right\} + \mu \mathbb{E}_n \{ h(B)^2 \}\end{split}\]

By the existence and formula of maximizers,

\[\begin{split}\hat{f}_{g,f}'(C') = \langle \hat{f}_{g,f}', \phi(C') \rangle_{\mathcal{F}} = \langle \Phi_{C'}^* K_{C'}^{\dagger} \vec{V}'_{g,h}, \phi(C') \rangle_{\mathcal{F}} \\ \hat{f}_{g,h}(C) = \langle \hat{f}_{g,f}, \phi(C) \rangle_{\mathcal{F}} = \langle \Phi_{C}^* K_{C}^{\dagger} \vec{V}_{g,h}, \phi(C) \rangle_{\mathcal{F}}\end{split}\]

Hence \((g,h)\) only appear via \(V'_{g,h} = g(A) - Y\), \(V_{g,h} = h(B) - g(A)\), and directly as \(g(A)\) and \(h(B)\). In all of these expressions, they can be further expressed as \(g(A) = \langle g, \phi(A) \rangle_{\mathcal{G}}\) and \(h(B) = \langle h, \phi(B) \rangle_{\mathcal{H}}\), which is a linear functional. The overall objective is quadratic in such terms, so the stated objective has maximizers \((\hat{g}, \hat{h})\) that obtain the maximum.

By a similar argument to the existence of maximizers, for any \((\hat{g}, \hat{h})\) attaining the maximum, \(\mathcal{E}(\hat{g}, \hat{h}) = \mathcal{E}(\hat{g}_n, \hat{h}_n)\) where \(\hat{g}_n \in \text{row}(\Phi_A)\) and \(\hat{h}_n \in \text{row}(\Phi_B)\).

Properties of pseudo-inverse

For any square symmetric matrix \(K \in \mathbb{R}^{n \times n}\), its eigendecomposition is \(K = U \Sigma U^{\top}\) where \(\Sigma \in \mathbb{R}^{r \times r}\) and \(r \leq n\). Its pseudo-inverse is \(K^- = U \Sigma^{\dagger} U^{\top}\). Moreover, \(K^{\dagger} K = KK^{\dagger} = UU^{\top}\), which is a projection.

To lighten notation, let \(K_C^{\dagger} K_C = P_C\).

Formula of minimizers

The explicit formula for the coefficients is

\[\begin{split}\hat{\beta} = \left[ K_A \left\{ - P_C + \left( P_{C'} + P_C + \mu' \right) K_A \left( K_B P_C K_A \right)^{\dagger} K_B \left( P_C + \mu \right) \right\} K_B \right]^{\dagger} K_A P_{C'} Y \\ \hat{\alpha} = \left( K_B P_C K_A \right)^{\dagger} K_B \left( P_C + \mu \right) K_B \hat{\beta}\end{split}\]

Implementation note: in the package, Appendix J / Algorithm 2 is implemented by RKHS2IVL2 and RKHS2IVL2CV. The low-rank feature approximations of the same estimator are ApproxRKHS2IVL2 and ApproxRKHS2IVL2CV.

Class mapping summary:

  • RKHS2IVL2 and RKHS2IVL2CV: Appendix J / Algorithm 2 closed-form estimator.

  • ApproxRKHS2IVL2 and ApproxRKHS2IVL2CV: low-rank approximations of the same Algorithm 2 estimator.

  • RKHS2IV and RKHS2IVCV: alternate simultaneous estimator (distinct objective; not Appendix J / Algorithm 2), documented below.

  • ApproxRKHS2IV and ApproxRKHS2IVCV: low-rank approximations of the alternate simultaneous estimator (distinct objective; not Appendix J / Algorithm 2).

Pre-estimation diagnostics are package-level and estimator-agnostic; see Universal Diagnostics API. For finite-sample calibration checks, compare normal and bootstrap CIs and inspect bias/SE decomposition diagnostics.

rkhs2iv.RKHS2IVL2([kernel, gamma, degree, ...])

Nested RKHS IV estimator with L2 regularization.

rkhs2iv.RKHS2IVL2CV([kernel, gamma, degree, ...])

Cross-validated RKHS2IVL2 estimator.

rkhs2iv.ApproxRKHS2IVL2([kernel_approx, ...])

Approximate Appendix J / Algorithm 2 RKHS estimator using finite kernel features.

rkhs2iv.ApproxRKHS2IVL2CV([kernel_approx, ...])

Cross-validated approximate Appendix J / Algorithm 2 RKHS estimator.

Proof

We proceed in steps.

  1. Write the objective \(\mathcal{E}(g,h)\) as

\[\begin{split}2 \langle \hat{f}'_{g,h}, \hat{\mu}'_{g,h} \rangle_{\mathcal{F}} - \langle \hat{f}'_{g,h}, \hat{T}_{C'} \hat{f}'_{g,h} \rangle_{\mathcal{F}} + \mu' \langle g, \hat{T}_A g \rangle_{\mathcal{G}} \\ + 2 \langle \hat{f}_{g,h}, \hat{\mu}_{g,h} \rangle_{\mathcal{F}} - \langle \hat{f}_{g,h}, \hat{T}_C \hat{f}_{g,h} \rangle_{\mathcal{F}} + \mu \langle h, \hat{T}_B h \rangle_{\mathcal{H}}\end{split}\]

where

\[\hat{\mu}'_{g,h} = \frac{1}{n} \Phi_{C'}^* \vec{V}'_{g,h}, \quad \hat{\mu}_{g,h} = \frac{1}{n} \Phi_C^* \vec{V}_{g,h}\]

and the covariance operators are defined analogously to the formula of maximizers. Hence,

\[\begin{split}\mathcal{E}(g,h) = 2 \langle \Phi_{C'}^* K_{C'}^{\dagger} \vec{V}'_{g,h}, \hat{\mu}'_{g,h} \rangle_{\mathcal{F}} - \langle \Phi_{C'}^* K_{C'}^{\dagger} \vec{V}'_{g,h}, \hat{T}_{C'} \Phi_{C'}^* K_{C'}^{\dagger} \vec{V}'_{g,h} \rangle_{\mathcal{F}} \\ + \mu' \langle g, \hat{T}_A g \rangle_{\mathcal{G}} \\ + 2 \langle \Phi_C^* K_C^{\dagger} \vec{V}_{g,h}, \hat{\mu}_{g,h} \rangle_{\mathcal{F}} - \langle \Phi_C^* K_C^{\dagger} \vec{V}_{g,h}, \hat{T}_C \Phi_C^* K_C^{\dagger} \vec{V}_{g,h} \rangle_{\mathcal{F}} \\ + \mu \langle h, \hat{T}_B h \rangle_{\mathcal{H}}\end{split}\]
\[\begin{split}= \frac{2}{n} (\vec{V}'_{g,h})^{\top} K_{C'}^{\dagger} \Phi_{C'} \Phi_{C'}^* \vec{V}'_{g,h} - \frac{1}{n} (\vec{V}'_{g,h})^{\top} K_{C'}^{\dagger} \Phi_{C'} \Phi_{C'}^* \Phi_{C'} \Phi_{C'}^* K_{C'}^{\dagger} \vec{V}'_{g,h} \\ + \mu' \langle g, \hat{T}_A g \rangle_{\mathcal{G}} \\ + \frac{2}{n} \vec{V}_{g,h}^{\top} K_C^{\dagger} \Phi_C \Phi_C^* \vec{V}_{g,h} - \frac{1}{n} \vec{V}_{g,h}^{\top} K_C^{\dagger} \Phi_C \Phi_C^* \Phi_C \Phi_C^* K_C^{\dagger} \vec{V}_{g,h} \\ + \mu \langle h, \hat{T}_B h \rangle_{\mathcal{H}}\end{split}\]
\[\begin{split}= \frac{1}{n} (\vec{V}'_{g,h})^{\top} P_{C'} \vec{V}'_{g,h} + \mu' \langle g, \hat{T}_A g \rangle_{\mathcal{G}} \\ + \frac{1}{n} \vec{V}_{g,h}^{\top} P_C \vec{V}_{g,h} + \mu \langle h, \hat{T}_B h \rangle_{\mathcal{H}}\end{split}\]
  1. Let \(Y, G, H \in \mathbb{R}^n\) be defined with \(G_i = g(A_i)\) and \(H_i = h(B_i)\). In this notation,

\[\begin{split}\frac{1}{n} (\vec{V}'_{g,h})^{\top} P_{C'} \vec{V}'_{g,h} = \frac{1}{n} (Y^{\top} P_{C'} Y - 2 G^{\top} P_{C'} Y + G^{\top} P_{C'} G), \quad \mu' \langle g, \hat{T}_A g \rangle_{\mathcal{G}} = \frac{\mu'}{n} G^{\top} G \\ \frac{1}{n} \vec{V}_{g,h}^{\top} P_C \vec{V}_{g,h} = \frac{1}{n} (H^{\top} P_C H - 2 G^{\top} P_C H + G^{\top} P_C G), \quad \mu \langle h, \hat{T}_B h \rangle_{\mathcal{H}} = \frac{\mu}{n} H^{\top} H\end{split}\]

Combining with \(G = \Phi_A g = K_A \alpha\) and \(H = \Phi_B h = K_B \beta\) from the existence of minimizers,

\[\begin{split}n \mathcal{E}(\alpha, \beta) = Y^{\top} P_{C'} Y - 2 G^{\top} (P_{C'} Y + P_C H) + G^{\top} (P_{C'} + P_C + \mu') G + H^{\top} (P_C + \mu) H \\ = Y^{\top} P_{C'} Y - 2 \alpha^{\top} K_A (P_{C'} Y + P_C K_B \beta) + \alpha^{\top} K_A (P_{C'} + P_C + \mu') K_A \alpha \\ \quad + \beta^{\top} K_B (P_C + \mu) K_B \beta\end{split}\]
  1. The first order conditions yield

\[\begin{split}0 = -2 K_A (P_{C'} Y + P_C K_B \hat{\beta}) + 2 K_A (P_{C'} + P_C + \mu') K_A \hat{\alpha} \\ 0 = -2 K_B P_C K_A \hat{\alpha} + 2 K_B (P_C + \mu) K_B \hat{\beta} \Longrightarrow \hat{\alpha} = \left( K_B P_C K_A \right)^{\dagger} K_B \left( P_C + \mu \right) K_B \hat{\beta}\end{split}\]
  1. Substituting the latter into the former,

\[K_A P_{C'} Y + K_A P_C K_B \hat{\beta} = K_A (P_{C'} + P_C + \mu') K_A \left( K_B P_C K_A \right)^{\dagger} K_B \left( P_C + \mu \right) K_B \hat{\beta}\]

and solving for \(\hat{\beta}\),

\[\hat{\beta} = \left[ K_A \left\{ - P_C + \left( P_{C'} + P_C + \mu' \right) K_A \left( K_B P_C K_A \right)^{\dagger} K_B \left( P_C + \mu \right) \right\} K_B \right]^{\dagger} K_A P_{C'} Y\]

Remark (Subsetted estimator)

Formula of minimizers (Subsetted estimator)

The explicit formula for the coefficients is

\[\begin{split}\hat{\beta} = \left[ K_A \left\{ - \tilde{P}_C + \left( \tilde{P}_{C'} + \tilde{P}_C + \mu' \right) K_A \left( K_B \tilde{P}_C K_A \right)^{\dagger} K_B \left( \tilde{P}_C + \mu \right) \right\} K_B \right]^{\dagger} K_A \tilde{P}_{C'} Y \\ \hat{\alpha} = \left( K_B \tilde{P}_C K_A \right)^{\dagger} K_B \left( \tilde{P}_C + \mu \right) K_B \hat{\beta}\end{split}\]

where \(\tilde{P}_{C'} = \frac{n}{p} I_{[p]}^{\top} P_{C';[p,p]} I_{[p]}\) and \(\tilde{P}_C = \frac{n}{q} I_{[q]}^{\top} P_{C;[q,q]} I_{[q]}\). Note that \(P_{C';[p,p]} = (K_{C';[p,p]})^- K_{C';[p,p]}\) and \(K_{C';[p,p]} = I_{[p]} K_{C'} I_{[p]}^{\top}\).

Proof

We proceed in steps.

  1. Write the objective \(\mathcal{E}(g,h)\) as

\[\begin{split}2 \langle \hat{f}'_{g,h}, \hat{\mu}'_{g,h;[p]} \rangle_{\mathcal{F}} - \langle \hat{f}'_{g,h}, \hat{T}_{C';[p,p]} \hat{f}'_{g,h} \rangle_{\mathcal{F}} + \mu' \langle g, \hat{T}_A g \rangle_{\mathcal{G}} \\ + 2 \langle \hat{f}_{g,h}, \hat{\mu}_{g,h;[q]} \rangle_{\mathcal{F}} - \langle \hat{f}_{g,h}, \hat{T}_{C;[q,q]} \hat{f}_{g,h} \rangle_{\mathcal{F}} + \mu \langle h, \hat{T}_B h \rangle_{\mathcal{H}}\end{split}\]

where

\[\hat{\mu}'_{g,h;[p]} = \frac{1}{p} \Phi_{C';[p]}^* \vec{V}'_{g,h;[p]}, \quad \hat{\mu}_{g,h;[q]} = \frac{1}{q} \Phi_{C;[q]}^* \vec{V}_{g,h;[q]}\]

and the covariance operators are defined analogously to the subsetted estimator. Hence,

\[\begin{split}\mathcal{E}(g,h) = \frac{1}{p} (\vec{V}'_{g,h;[p]})^{\top} P_{C';[p,p]} \vec{V}'_{g,h;[p]} + \mu' \langle g, \hat{T}_A g \rangle_{\mathcal{G}} \\ + \frac{1}{q} \vec{V}_{g,h;[q]}^{\top} P_{C;[q,q]} \vec{V}_{g,h;[q]} + \mu \langle h, \hat{T}_B h \rangle_{\mathcal{H}}\end{split}\]
  1. Let \(Y, G, H \in \mathbb{R}^n\) be defined with \(G_i = g(A_i)\) and \(H_i = h(B_i)\) as before. Now, let \(\tilde{P}_{C'} = \frac{n}{p} I_{[p]}^{\top} P_{C';[p,p]} I_{[p]}\) and \(\tilde{P}_C = \frac{n}{q} I_{[q]}^{\top} P_{C;[q,q]} I_{[q]}\). Then

\[\begin{split}\frac{1}{p} (\vec{V}'_{g,h;[p]})^{\top} P_{C';[p,p]} \vec{V}'_{g,h;[p]} = \frac{1}{n} (Y^{\top} \tilde{P}_{C'} Y - 2 G^{\top} \tilde{P}_{C'} Y + G^{\top} \tilde{P}_{C'} G) \\ \mu' \langle g, \hat{T}_A g \rangle_{\mathcal{G}} = \frac{\mu'}{n} G^{\top} G \\ \frac{1}{q} \vec{V}_{g,h;[q]}^{\top} P_{C;[q,q]} \vec{V}_{g,h;[q]} = \frac{1}{n} (H^{\top} \tilde{P}_C H - 2 G^{\top} \tilde{P}_C H + G^{\top} \tilde{P}_C G) \\ \mu \langle h, \hat{T}_B h \rangle_{\mathcal{H}} = \frac{\mu}{n} H^{\top} H\end{split}\]

Hereafter we use the same argument as in the formula of minimizers.

Nyström approximation

Computation of kernel methods may be demanding due to the inversions of matrices that scale with \(n\) such as \(K_B \in \mathbb{R}^{n \times n}\). One solution is Nyström approximation. We now provide alternative expressions for the minimizers \((\hat{g}, \hat{h})\) that lend themselves to Nyström approximation, then describe the procedure.

Minimizer sufficient statistics

The minimizers may be expressed as

\[\hat{g} = \left(\Phi_B^* P_C \Phi_A\right)^{\dagger} \Phi_B^* (P_C + \mu) \Phi_B \hat{h},\]
\[\hat{h} = \left[ \Phi_A^* \left\{ -P_C + \left( P_{C'} + P_C + \mu' \right) \Phi_A \left( \Phi_B^* P_C \Phi_A \right)^{\dagger} \Phi_B^* \left( P_C + \mu \right) \right\} \Phi_B \right]^{\dagger} \Phi_A^* P_{C'} Y.\]

Proof

We proceed in steps.

  1. By the proof of the Formula of minimizers of Estimator 3, with \(G = \Phi_A g\) and \(H = \Phi_B h\),

\[\begin{split}\begin{align*} n \mathcal{E}(g, h) &= Y^{\top} P_{C'} Y - 2 G^{\top} (P_{C'} Y + P_C H) \\ & \quad + G^{\top} (P_{C'} + P_C + \mu') G + H^{\top} (P_C + \mu) H, \\ &= Y^{\top} P_{C'} Y - 2 g^* \Phi_A^* (P_{C'} Y + P_C \Phi_B h) \\ & \quad + g^* \Phi_A^* (P_{C'} + P_C + \mu') \Phi_A g + h^* \Phi_B^* (P_C + \mu) \Phi_B h. \end{align*}\end{split}\]
  1. Informally, the first order conditions yield

\[\begin{split}\begin{align*} 0 &= -2 \Phi_A^* (P_{C'} Y + P_C \Phi_B \hat{h}) + 2 \Phi_A^* (P_{C'} + P_C + \mu') \Phi_A \hat{g}, \\ 0 &= -2 \Phi_B^* P_C \Phi_A \hat{g} + 2 \Phi_B^* (P_C + \mu) \Phi_B \hat{h}. \end{align*}\end{split}\]

See De Vito and Caponnetto (2005) (Proof of Proposition 2) for the formal way of deriving the first order condition, which incurs additional notation.

Rearranging and taking pseudo-inverses, we arrive at two equations:

\[\Phi_A^* (P_{C'} + P_C + \mu') \Phi_A \hat{g} = \Phi_A^* (P_{C'} Y + P_C \Phi_B \hat{h}),\]
\[\Phi_B^* P_C \Phi_A \hat{g} = \Phi_B^* (P_C + \mu) \Phi_B \hat{h} \Longrightarrow \hat{g} = \left(\Phi_B^* P_C \Phi_A \right)^{\dagger} \Phi_B^* (P_C + \mu) \Phi_B \hat{h}.\]
  1. Substituting the latter into the former,

    \[\Phi_A^* P_{C'} Y + \Phi_A^* P_C \Phi_B \hat{h} = \Phi_A^* (P_{C'} + P_C + \mu') \Phi_A \left(\Phi_B^* P_C \Phi_A \right)^{\dagger} \Phi_B^* (P_C + \mu) \Phi_B \hat{h},\]

    and solving for \(\hat{h}\),

    \[\hat{h} = \left[ \Phi_A^* \left\{ -P_C + \left( P_{C'} + P_C + \mu' \right) \Phi_A \left( \Phi_B^* P_C \Phi_A \right)^{\dagger} \Phi_B^* \left( P_C + \mu \right) \right\} \Phi_B \right]^{\dagger} \Phi_A^* P_{C'} Y.\]

Remark (Nyström subsetted estimator)

Formula of minimizers (Subsetted estimator)

The subsetted minimizers may be expressed as

\[\hat{g} = \left(\Phi_B^* \tilde{P}_C \Phi_A \right)^{\dagger} \Phi_B^* (\tilde{P}_C + \mu) \Phi_B \hat{h},\]
\[\hat{h} = \left[ \Phi_A^* \left\{ -\tilde{P}_C + \left( \tilde{P}_{C'} + \tilde{P}_C + \mu' \right) \Phi_A \left( \Phi_B^* \tilde{P}_C \Phi_A \right)^{\dagger} \Phi_B^* \left( \tilde{P}_C + \mu \right) \right\} \Phi_B \right]^{\dagger} \Phi_A^* \tilde{P}_{C'} Y.\]

Proof

The argument is analogous to the Remark of the properties of pseudo-inverse above.

Properties of pseudo-inverse

Continuing the notation of the (Properties of pseudo-inverse), if \(\Phi = U \Sigma^{1/2} V^{\top}\) and \(K = \Phi \Phi^*\), then \(P = UU^{\top} = K^{\dagger} K = \Phi \Phi^{\dagger}\).

Combining (Minimizer sufficient statistics) and (Properties of pseudo-inverse), we conclude that sufficient statistics for \((\hat{g}, \hat{h})\) are feature operators. Within the feature operator \(\Phi\), the \(i\) th row \(\langle \phi(X_i), \cdot \rangle\) may be viewed as an infinite dimensional vector.

Nyström approximation is a way to approximate infinite dimensional vectors with finite dimensional ones. It uses the substitution \(\phi(x) \mapsto \check{\phi}(x) = (K_{\mathcal{S} \mathcal{S}})^{-\frac{1}{2}} K_{\mathcal{S} x}\), where \(\mathcal{S}\) is a subset of \(s = |\mathcal{S}| \ll n\) observations called landmarks. \(K_{\mathcal{S} \mathcal{S}} \in \mathbb{R}^{s \times s}\) is defined such that \((K_{\mathcal{S} \mathcal{S}})_{ij} = k(X_i, X_j)\) for \(i, j \in \mathcal{S}\). Similarly, \(K_{\mathcal{S} x} \in \mathbb{R}^s\) is defined such that \((K_{\mathcal{S} x})_i = k(X_i, x)\) for \(i \in \mathcal{S}\).

In summary, the approximate sufficient statistics are of the form \(\check{\Phi} \in \mathbb{R}^{n \times s}\), i.e. a matrix whose \(i\) th row \(\langle \check{\phi}(X_i), \cdot \rangle\) may be viewed as a vector in \(\mathbb{R}^s\).

Closed form - Estimator 3 (RKHS norm)

We study the RKHS-norm regularized joint estimator:

\[\begin{split}(\hat{g}, \hat{h}) = \arg \min_{g \in \mathcal{G}, h \in \mathcal{H}} \max_{f' \in \mathcal{F}} \mathbb{E}_n \left[ 2 \left\{ g(A) - Y \right\} f'(C') - f'(C')^2 \right] -\lambda'\|f'\|_\mathcal{F'}^2 + \mu' \| g \|_{\mathcal{G}}^2 \\ \quad + \max_{f \in \mathcal{F}} \mathbb{E}_n \left[ 2 \left\{ h(B) - g(A) \right\} f(C) - f(C)^2 \right] -\lambda\|f\|_\mathcal{F}^2 + \mu \| h \|_{\mathcal{H}}^2\end{split}\]

Formula of minimizers

The minimizer takes the form \(\hat{g} = \Phi_A^*\hat\alpha\), \(\hat{h} = \Phi_B^*\hat\beta\) where,

\[\begin{split}\hat{\beta} &= \left[ K_A \left\{ - P_C + \left(P_{C'} K_A + P_C K_A + \mu'\right) \left( K_B P_C K_A \right)^{\dagger} \left( K_B P_C + \mu \right)\right\} K_B \right]^{\dagger} K_A P_{C'} Y \\ \hat{\alpha} &= \left( K_B P_C K_A \right)^{\dagger} \left( K_B P_C + \mu \right) K_B \hat{\beta}\end{split}\]

and

\[\begin{split}P_C &= \left(K_C+\lambda\right)^{\dagger}K_C \\ P_{C'} &= \left(K_{C'}+\lambda'\right)^{\dagger}K_{C'}\end{split}\]

Implementation note: RKHS2IV and RKHS2IVCV implement this alternate simultaneous estimator, and ApproxRKHS2IV / ApproxRKHS2IVCV are the corresponding low-rank feature approximations. This branch is distinct from Appendix J / Algorithm 2 above.

rkhs2iv.RKHS2IV([kernel, gamma, degree, ...])

Nested RKHS IV estimator.

rkhs2iv.RKHS2IVCV([kernel, gamma, degree, ...])

Cross-validated RKHS2IV estimator.

rkhs2iv.ApproxRKHS2IV([kernel_approx, ...])

Approximate alternate simultaneous RKHS estimator using finite kernel features.

rkhs2iv.ApproxRKHS2IVCV([kernel_approx, ...])

Cross-validated approximate alternate simultaneous RKHS estimator.

Remark: Subsetted estimator

Formula of minimizers (Subsetted estimator)

The subsetted estimator satisfies:

\[\begin{split}\hat{\beta} &= \left[ K_A \left\{ - \tilde{P}_C + \left(\tilde{P}_{C'} K_A + \tilde{P}_C K_A + \mu'\right) \left( K_B \tilde{P}_C K_A \right)^{\dagger} \left( K_B \tilde{P}_C + \mu \right)\right\} K_B \right]^{\dagger} K_A \tilde{P}_{C'} Y \\ \hat{\alpha} &= \left( K_B \tilde{P}_C K_A \right)^{\dagger} \left( K_B \tilde{P}_C + \mu \right) K_B \hat{\beta}\end{split}\]

with \(\tilde{P}_{C'}=\frac{n}{p}I_{[p]}^{\top}P_{C';[p,p]}I_{[p]}\) and \(\tilde{P}_{C}=\frac{n}{q}I_{[q]}^{\top}P_{C;[q,q]}I_{[q]}\). And

\[\begin{split}P_{C';[p,p]}&=(K_{C';[p,p]}+\lambda I_{[p]}I_{[p]}^\top)^-K_{C';[p,p]}\;, \qquad K_{C';[p,p]}=I_{[p]}K_{C'}I_{[p]}^{\top} \\ P_{C;[q,q]}&=(K_{C;[q,q]}+\lambda I_{[q]}I_{[q]}^\top)^-K_{C;[q,q]}\;, \qquad K_{C;[q,q]}=I_{[q]}K_{C}I_{[q]}^{\top}\end{split}\]