Regularized Kernel Hilbert Space
In this section we assume that the function classes whenever \(\mathcal{G}\), \(\mathcal{H}\), \(\mathcal{F}\), \(\mathcal{F}^\prime\) are RKHS. Let \(\Phi_A:\mathcal{G}\rightarrow\mathbb{R}^n\) be an operator with \(i\) th row \(\langle \phi(A_i), \cdot \rangle_{\mathcal{G}}\) with corresponding kernel matrix \(K_A\). Define analogously \(\Phi_B, \ldots\) for the rest of the function classes.
Closed form - Estimator 1
We study the estimator
Formula of minimizers
The minimizer takes the form \(\hat{g} = \Phi_A^* \hat{\alpha}\) where,
|
RKHS IV estimator. |
|
RKHS IV estimator with cross-validation. |
Remark (Nystrom approximation) A low-rank approximation using Nystrom method is also implemented.
|
Approximate RKHS IV estimator using kernel approximations. |
|
Approximate RKHS IV estimator with cross-validation using kernel approximations. |
Closed form - Estimator 2
We study the estimator
Formula of minimizers
The minimizer takes the form \(\hat{g} = \Phi_A^* \hat{\alpha}\) where,
|
RKHS IV estimator with L2 regularization. |
|
RKHS IV estimator with L2 regularization and cross-validation. |
Closed form - Estimator 3
We study the ridge regularized joint estimator:
Let \(V_{g,h}' = g(A) - Y\) and \(V_{g,h} = h(B) - g(A)\). Let \(\Phi_C : \mathcal{F} \rightarrow \mathbb{R}^n\) be an operator with \(i\) th row \(\langle \phi(C_i), \cdot \rangle_{\mathcal{F}}\). Define \(\Phi_{C'}\) analogously, replacing \(C_i\) with \(C_i'\). Let \(K_C\) and \(K_{C'}\) be the corresponding kernel matrices.
In remarks below, we also study the following modification, which we call the “subsetted” estimator:
where \([p]\) and \([q]\) partition \([n] = (1, \ldots, n)\), so \(p + q = n\).
For the index set \([p]\), let \(I_{[p]} \in \mathbb{R}^{p \times n}\) be the matrix of ones and zeros such that \(V_{[p]} = I_{[p]} V\) gives the elements of \(V\) whose indices are in \([p]\).
Maximizers
Existence of maximizers
There exist coefficients \(\hat{\gamma}_{g,h}, \hat{\gamma}'_{g,h} \in \mathbb{R}^n\) such that maximizers take the form \(\hat{f}_{g,h} = \Phi_C^* \hat{\gamma}_{g,h}\) and \(\hat{f}'_{g,h} = \Phi_{C'}^* \hat{\gamma}'_{g,h}\).
Remark (Subsetted estimator)
For the subsetted estimator, the same results hold but with \(\hat{\gamma}_{g,h;[q]} \in \mathbb{R}^q\) and \(\hat{\gamma}'_{g,h;[p]} \in \mathbb{R}^p\), acting on appropriately modified feature operators \(\Phi^*_{C;[q]}\) and \(\Phi^*_{C';[p]}\).
Proof
Write the objectives for the maximizers as
We prove the former result; the latter is similar. By the Riesz representation theorem,
For an RKHS, evaluation is a continuous functional represented as the inner product with the feature map. Due to the ridge penalty, the stated objective has a maximizer \(\hat{f}_{g,h}\) that obtains the maximum.
To lighten notation, we suppress the indexing of \(\hat{f}_{g,h}\) by \((g,h)\) for the rest of this argument. Write \(\hat{f} = \hat{f}_n + \hat{f}^{\perp}_n\) where \(\hat{f}_n \in \text{row}(\Phi_C)\) and \(\hat{f}_n^{\perp} \in \text{null}(\Phi_C)\). Substituting this decomposition of \(\hat{f}\) into the objective, we see that
Hence if \(\hat{f}\) is a maximizer, then there exists \(\hat{f}_n\) that is also a maximizer.
Formula of maximizers
The explicit formula for the coefficients is \(\hat{\gamma}_{g,h} = K_C^{\dagger} \vec{V}_{g,h}\) and \(\hat{\gamma}'_{g,h} = K_{C'}^{\dagger} \vec{V}'_{g,h}\).
Remark (Subsetted estimator)
For the subsetted estimator, the same results hold but with \(\hat{\gamma}_{g,h;[q]} = K_{C;[q,q]}^{\dagger} \vec{V}_{g,h;[q]}\) and \(\hat{\gamma}'_{g,h;[p]} = K_{C';[p,p]}^{\dagger} \vec{V}'_{g,h;[p]}\).
Proof
We prove the former result; the latter is similar. Write the objective as
where \(\hat{\mu}_{g,h} = \mathbb{E}_n \{ V_{g,h} \phi(C) \} = \frac{1}{n} \Phi_C^* \vec{V}_{g,h}\) and \(\hat{T}_C = \mathbb{E}_n \{ \phi(C) \otimes \phi(C)^* \} = \frac{1}{n} \Phi_C^* \Phi_C\). Hence by the existence of maximizers,
Since \(K_C = \Phi_C \Phi_C^*\), the first order condition yields \(K_C \vec{V}_{g,h} = K_C^2 \hat{\gamma}_{g,h}\), i.e. \(\hat{\gamma}_{g,h} = K_C^{\dagger} \vec{V}_{g,h}\) where \(K_C^{\dagger}\) is the pseudoinverse of \(K_C\).
Minimizers
Let \(\Phi_A : \mathcal{H} \rightarrow \mathbb{R}^n\) be an operator with \(i\) th row \(\langle \phi(A_i), \cdot \rangle_{\mathcal{H}}\). Define \(\Phi_B\) analogously, replacing \(A_i\) with \(B_i\). Let \(K_A\) and \(K_B\) be the corresponding kernel matrices.
Existence of minimizers
There exist coefficients \(\alpha, \beta \in \mathbb{R}^n\) such that minimizers take the form \(\hat{g} = \Phi_A^* \hat{\alpha}\) and \(\hat{h} = \Phi_B^* \hat{\beta}\).
Remark (Subsetted estimator)
The result remains true for the subsetted estimator.
Proof
To begin, write the objective \(\mathcal{E}(g,h)\) as
By the existence and formula of maximizers,
Hence \((g,h)\) only appear via \(V'_{g,h} = g(A) - Y\), \(V_{g,h} = h(B) - g(A)\), and directly as \(g(A)\) and \(h(B)\). In all of these expressions, they can be further expressed as \(g(A) = \langle g, \phi(A) \rangle_{\mathcal{G}}\) and \(h(B) = \langle h, \phi(B) \rangle_{\mathcal{H}}\), which is a linear functional. The overall objective is quadratic in such terms, so the stated objective has maximizers \((\hat{g}, \hat{h})\) that obtain the maximum.
By a similar argument to the existence of maximizers, for any \((\hat{g}, \hat{h})\) attaining the maximum, \(\mathcal{E}(\hat{g}, \hat{h}) = \mathcal{E}(\hat{g}_n, \hat{h}_n)\) where \(\hat{g}_n \in \text{row}(\Phi_A)\) and \(\hat{h}_n \in \text{row}(\Phi_B)\).
Properties of pseudo-inverse
For any square symmetric matrix \(K \in \mathbb{R}^{n \times n}\), its eigendecomposition is \(K = U \Sigma U^{\top}\) where \(\Sigma \in \mathbb{R}^{r \times r}\) and \(r \leq n\). Its pseudo-inverse is \(K^- = U \Sigma^{\dagger} U^{\top}\). Moreover, \(K^{\dagger} K = KK^{\dagger} = UU^{\top}\), which is a projection.
To lighten notation, let \(K_C^{\dagger} K_C = P_C\).
Formula of minimizers
The explicit formula for the coefficients is
|
Nested RKHS IV estimator with L2 regularization. |
|
Nested RKHS IV estimator with L2 regularization and cross-validation. |
Proof
We proceed in steps.
Write the objective \(\mathcal{E}(g,h)\) as
where
and the covariance operators are defined analogously to the formula of maximizers. Hence,
Let \(Y, G, H \in \mathbb{R}^n\) be defined with \(G_i = g(A_i)\) and \(H_i = h(B_i)\). In this notation,
Combining with \(G = \Phi_A g = K_A \alpha\) and \(H = \Phi_B h = K_B \beta\) from the existence of minimizers,
The first order conditions yield
Substituting the latter into the former,
and solving for \(\hat{\beta}\),
Remark (Subsetted estimator)
Formula of minimizers (Subsetted estimator)
The explicit formula for the coefficients is
where \(\tilde{P}_{C'} = \frac{n}{p} I_{[p]}^{\top} P_{C';[p,p]} I_{[p]}\) and \(\tilde{P}_C = \frac{n}{q} I_{[q]}^{\top} P_{C;[q,q]} I_{[q]}\). Note that \(P_{C';[p,p]} = (K_{C';[p,p]})^- K_{C';[p,p]}\) and \(K_{C';[p,p]} = I_{[p]} K_{C'} I_{[p]}^{\top}\).
Proof
We proceed in steps.
Write the objective \(\mathcal{E}(g,h)\) as
where
and the covariance operators are defined analogously to the subsetted estimator. Hence,
Let \(Y, G, H \in \mathbb{R}^n\) be defined with \(G_i = g(A_i)\) and \(H_i = h(B_i)\) as before. Now, let \(\tilde{P}_{C'} = \frac{n}{p} I_{[p]}^{\top} P_{C';[p,p]} I_{[p]}\) and \(\tilde{P}_C = \frac{n}{q} I_{[q]}^{\top} P_{C;[q,q]} I_{[q]}\). Then
Hereafter we use the same argument as in the formula of minimizers.
Nyström approximation
Computation of kernel methods may be demanding due to the inversions of matrices that scale with \(n\) such as \(K_B \in \mathbb{R}^{n \times n}\). One solution is Nyström approximation. We now provide alternative expressions for the minimizers \((\hat{g}, \hat{h})\) that lend themselves to Nyström approximation, then describe the procedure.
Minimizer sufficient statistics
The minimizers may be expressed as
Proof
We proceed in steps.
By the proof of the Formula of minimizers of Estimator 3, with \(G = \Phi_A g\) and \(H = \Phi_B h\),
Informally, the first order conditions yield
See De Vito and Caponnetto (2005) (Proof of Proposition 2) for the formal way of deriving the first order condition, which incurs additional notation.
Rearranging and taking pseudo-inverses, we arrive at two equations:
\[\Phi_A^* (P_{C'} + P_C + \mu') \Phi_A \hat{g} = \Phi_A^* (P_{C'} Y + P_C \Phi_B \hat{h}),\]\[\Phi_B^* P_C \Phi_A \hat{g} = \Phi_B^* (P_C + \mu) \Phi_B \hat{h} \Longrightarrow \hat{g} = \left(\Phi_B^* P_C \Phi_A \right)^{\dagger} \Phi_B^* (P_C + \mu) \Phi_B \hat{h}.\]
Substituting the latter into the former,
\[\Phi_A^* P_{C'} Y + \Phi_A^* P_C \Phi_B \hat{h} = \Phi_A^* (P_{C'} + P_C + \mu') \Phi_A \left(\Phi_B^* P_C \Phi_A \right)^{\dagger} \Phi_B^* (P_C + \mu) \Phi_B \hat{h},\]and solving for \(\hat{h}\),
\[\hat{h} = \left[ \Phi_A^* \left\{ -P_C + \left( P_{C'} + P_C + \mu' \right) \Phi_A \left( \Phi_B^* P_C \Phi_A \right)^{\dagger} \Phi_B^* \left( P_C + \mu \right) \right\} \Phi_B \right]^{\dagger} \Phi_A^* P_{C'} Y.\]
Remark (Nyström subsetted estimator)
Formula of minimizers (Subsetted estimator)
The subsetted minimizers may be expressed as
Proof
The argument is analogous to the Remark of the properties of pseudo-inverse above.
Properties of pseudo-inverse
Continuing the notation of the (Properties of pseudo-inverse), if \(\Phi = U \Sigma^{1/2} V^{\top}\) and \(K = \Phi \Phi^*\), then \(P = UU^{\top} = K^{\dagger} K = \Phi \Phi^{\dagger}\).
Combining (Minimizer sufficient statistics) and (Properties of pseudo-inverse), we conclude that sufficient statistics for \((\hat{g}, \hat{h})\) are feature operators. Within the feature operator \(\Phi\), the \(i\) th row \(\langle \phi(X_i), \cdot \rangle\) may be viewed as an infinite dimensional vector.
Nyström approximation is a way to approximate infinite dimensional vectors with finite dimensional ones. It uses the substitution \(\phi(x) \mapsto \check{\phi}(x) = (K_{\mathcal{S} \mathcal{S}})^{-\frac{1}{2}} K_{\mathcal{S} x}\), where \(\mathcal{S}\) is a subset of \(s = |\mathcal{S}| \ll n\) observations called landmarks. \(K_{\mathcal{S} \mathcal{S}} \in \mathbb{R}^{s \times s}\) is defined such that \((K_{\mathcal{S} \mathcal{S}})_{ij} = k(X_i, X_j)\) for \(i, j \in \mathcal{S}\). Similarly, \(K_{\mathcal{S} x} \in \mathbb{R}^s\) is defined such that \((K_{\mathcal{S} x})_i = k(X_i, x)\) for \(i \in \mathcal{S}\).
In summary, the approximate sufficient statistics are of the form \(\check{\Phi} \in \mathbb{R}^{n \times s}\), i.e. a matrix whose \(i\) th row \(\langle \check{\phi}(X_i), \cdot \rangle\) may be viewed as a vector in \(\mathbb{R}^s\).
Closed form - Estimator 3 (RKHS norm)
We study the RKHS-norm regularized joint estimator:
Formula of minimizers
The minimizer takes the form \(\hat{g} = \Phi_A^*\hat\alpha\), \(\hat{h} = \Phi_B^*\hat\beta\) where,
and
|
Nested RKHS IV estimator. |
|
Nested RKHS IV estimator with cross-validation. |
Remark: Subsetted estimator
Formula of minimizers (Subsetted estimator)
The subsetted estimator satisfies:
with \(\tilde{P}_{C'}=\frac{n}{p}I_{[p]}^{\top}P_{C';[p,p]}I_{[p]}\) and \(\tilde{P}_{C}=\frac{n}{q}I_{[q]}^{\top}P_{C;[q,q]}I_{[q]}\). And