Universal Diagnostics API

Assumptions

Arrays for \(A, C, C'\) are row-aligned and represent the same sample.
Diagnostics are finite-dimensional proxies over selected feature maps/sieves.
Stability conclusions should be interpreted across \((J, \\eta)\) paths, not from a single point estimate.

Notation

\(\\Sigma_S, \\Sigma_T\): empirical Gram operators induced by conditional means of featureized \(A\).
\(\\eta\): stabilization magnitude used in generalized-eigenvalue computations.
\(\\kappa\): restricted relative condition number for pre-estimation conditioning.
\(\\kappa_{\\mathrm{eff}}\): post-estimation condition number on the realized first-stage error direction.

Diagnostic A: Relative Well-posedness

Even if full \(L_2\) verification is hard, one can check relative well-posedness on the finite function class used in simulation or estimation. Choose a feature map

\[b(A)=\{b_1(A),\ldots,b_J(A)\}^{\top}\]

and restrict to functions

\[u(A)=\theta^{\top}b(A), \qquad \theta \in \mathbb{R}^J.\]

Define

\[m_S(C') := E\{b(A)\mid C'\}, \qquad m_T(C) := E\{b(A)\mid C\}.\]

Then the induced quadratic forms are

\[\|Su\|_2^2=\theta^{\top}\Sigma_S\theta, \qquad \Sigma_S := E\!\left[m_S(C')m_S(C')^{\top}\right],\]

and

\[\|T_gu\|_2^2=\theta^{\top}\Sigma_T\theta, \qquad \Sigma_T := E\!\left[m_T(C)m_T(C)^{\top}\right].\]

The finite-dimensional bounded relative well-posedness condition is

\[\Sigma_T \preceq \kappa_J^2 \Sigma_S.\]

Equivalently, the sharp constant is the generalized eigenvalue bound

\[\kappa_J^2 = \lambda_{\max}\!\left( \Sigma_S^{\dagger/2}\Sigma_T\Sigma_S^{\dagger/2} \right) = \sup_{\theta:\,\theta^{\top}\Sigma_S\theta>0} \frac{\theta^{\top}\Sigma_T\theta}{\theta^{\top}\Sigma_S\theta}.\]

If there exists \(\theta\) with

\[\theta^{\top}\Sigma_S\theta=0 \quad\text{but}\quad \theta^{\top}\Sigma_T\theta>0,\]

then relative well-posedness fails on that feature span.

In practice, use a small ridge \(\eta>0\) for numerical stability. The default implementation uses the feature Gram stabilization \(\widehat{\Sigma}_I\):

\[\kappa_{J,\eta}^2 = \lambda_{\max}\!\left[(\widehat{\Sigma}_S+\eta \widehat{\Sigma}_I)^{-1/2} \widehat{\Sigma}_T(\widehat{\Sigma}_S+\eta \widehat{\Sigma}_I)^{-1/2}\right].\]

where

\[\widehat{\Sigma}_I = \frac{1}{N}\sum_{i=1}^{N} b(A_i)b(A_i)^\top.\]

Set eta_mode='identity' to recover the older \(\eta I\) stabilization.

Why \(\Sigma_I\) vs \(I\) and how to choose \(\eta\)

The matrix \(\Sigma_I\) is the feature-space second moment:

\[\Sigma_I = E[b(A)b(A)^\top], \qquad \widehat{\Sigma}_I = \frac{1}{N}\sum_{i=1}^N b(A_i)b(A_i)^\top.\]

It differs from identity-ridge in an important way:

eta_mode='identity' regularizes by \(\eta\|\theta\|_2^2\);
eta_mode='sigma_i' regularizes by \(\eta\,\theta^\top\Sigma_I\theta = \eta\,E[(\theta^\top b(A))^2]\), i.e., by function magnitude under \(P_A\).

So sigma_i is often more geometry-aware and less sensitive to arbitrary feature scaling/rotation. In a basis close to orthonormal under \(P_A\), the two choices are similar.

Finite-sample caveat:

if \(\widehat{\Sigma}_I\) has very small eigenvalues, then \(\eta\widehat{\Sigma}_I\) may provide weak regularization in some directions relative to \(\eta I\);
this can inflate \(\widehat{\kappa}_{J,\eta}\) at very small eta.

Practical eta guidance:

to reduce extreme finite-sample \(\kappa\), increase eta;
to approximate the unregularized generalized-eigenvalue target, decrease eta;
always inspect an eta-path (and preferably a J-by-eta grid) instead of a single value.

Unregularized limit:

yes, conceptually eta=0 corresponds to no ridge regularization (pseudo-inverse formulation);
in finite samples this can be numerically unstable or effectively singular, so the implementation requires eta > 0 for stability.

This is the primary pre-estimation simulation diagnostic.

Key output fields for failure-mode detection:

nullspace_violation_flag: True means the diagnostic found directions where \(\widehat{\Sigma}_S\) is (near) null but \(\widehat{\Sigma}_T\) still has signal. This is the empirical failure pattern for relative well-posedness on the tested feature span.
nullspace_leakage_sigma_t_on_null_sigma_s (nullspace_leakage in notebook tables): how much \(\widehat{\Sigma}_T\) leaks into the near-null space of \(\widehat{\Sigma}_S\) (larger is worse). Formally, it is the largest eigenvalue of \(U_0^\top \widehat{\Sigma}_T U_0\), where \(U_0\) spans the near-null eigenspace of \(\widehat{\Sigma}_S\).
stabilization_dominance_ratio: rough size of ridge relative to \(\widehat{\Sigma}_S\), \(\|\eta R\|_{op}/\|\widehat{\Sigma}_S\|_{op}\) with \(R=\widehat{\Sigma}_I\) or \(I\). Large values mean regularization may be driving behavior more than the raw operator pair.
max_diag_ratio_sigma_t_over_sigma_s (max_diag_ratio in notebook tables): maximum coordinatewise ratio \(\mathrm{diag}(\widehat{\Sigma}_T) / \mathrm{diag}(\widehat{\Sigma}_S)\). This is a quick worst-direction proxy; large values suggest weak conditioning.

How to read \(\kappa\)

The number \(\kappa_{J,\eta}\) is a restricted relative condition number. Ignoring the ridge for a moment,

\[\kappa_J = \sup_{u \in \operatorname{span}(b):\,\|Su\|_2>0} \frac{\|T_g u\|_2}{\|Su\|_2}.\]

Thus \(\kappa_J^2\) is the worst-case ratio of squared signal seen by \(T_g\) relative to squared signal seen by \(S\) on the chosen feature span.

Interpretation:

\(\kappa_J \approx 1\): the two conditional-mean operators have similar strength on the feature span;
\(\kappa_J < 1\): \(S\) sees these directions at least as strongly as \(T_g\);
large \(\kappa_J\): there are directions in the feature span that are much more visible to \(T_g\) than to \(S\), so simultaneous estimation can have weak finite-sample curvature;
divergent \(\kappa_J\) as the feature span grows: evidence against bounded relative well-posedness for the full function class.

Gaussian linear-basis example

Suppose \((A,C,C')\) are mean-zero jointly Gaussian and use the linear basis \(b(A)=A\). Then

\[E[A\mid C]=\Sigma_{AC}\Sigma_{CC}^{-1}C, \qquad E[A\mid C']=\Sigma_{AC'}\Sigma_{C'C'}^{-1}C'.\]

Therefore

\[\Sigma_T = \Sigma_{AC}\Sigma_{CC}^{-1}\Sigma_{CA}, \qquad \Sigma_S = \Sigma_{AC'}\Sigma_{C'C'}^{-1}\Sigma_{C'A}.\]

The diagnostic is the largest generalized eigenvalue of this pair. In the scalar case with correlations \(\rho_T=\operatorname{Corr}(A,C)\) and \(\rho_S=\operatorname{Corr}(A,C')\),

\[\Sigma_T=\operatorname{Var}(A)\rho_T^2, \qquad \Sigma_S=\operatorname{Var}(A)\rho_S^2, \qquad \kappa_J=\left|\frac{\rho_T}{\rho_S}\right|.\]

So for Gaussian data with a linear basis, \(\kappa\) is literally the ratio of how predictive \(C\) is for \(A\) relative to how predictive \(C'\) is for \(A\). If \(\rho_S=0\) but \(\rho_T\ne0\), the relative condition fails on this span.

For Gaussian scalar variables and an orthogonal Hermite basis, the degree \(k\) component scales like \(\rho^k\). Heuristically, if the sieve contains degrees up to \(J\), then high-order components can behave like

\[\max_{1\leq k\leq J} \left|\frac{\rho_T}{\rho_S}\right|^k.\]

This explains why a DGP can look benign for linear features but deteriorate quickly once nonlinear directions are added.

How \(\Sigma_S,\Sigma_T\) are estimated

When the DGP is known, use a large auxiliary Monte Carlo sample.

Choose features \(b(A)\).

Use estimator-aligned features whenever possible: polynomial/Hermite bases (transparent), random Fourier features for RKHS, or learned representations/last-layer features for neural nets. A simple first pass is polynomial features in \(A\), including directions relevant for \(g_0(A)\) (for example, cubic directions when \(g_0(A)=A_1^3\)).
Estimate conditional means.

Estimate \(m_S(C')=E[b(A)\mid C']\) and \(m_T(C)=E[b(A)\mid C]\) with flexible regressions on the auxiliary sample.

If desired, nested Monte Carlo can be used:

\[\widehat{m}_S(c') = \frac{1}{M}\sum_{m=1}^{M} b\!\left(A_{c'}^{(m)}\right),\]

with draws \(A_{c'}^{(m)}\sim \mathcal{L}(A\mid C'=c')\), and similarly for \(\widehat{m}_T(c)\).
Build Gram matrices.

\[\widehat{\Sigma}_S = \frac{1}{N}\sum_{i=1}^{N} \widehat{m}_S(C_i')\widehat{m}_S(C_i')^{\top}, \qquad \widehat{\Sigma}_T = \frac{1}{N}\sum_{i=1}^{N} \widehat{m}_T(C_i)\widehat{m}_T(C_i)^{\top}.\]
Compute \(\kappa_{J,\eta}\).

\[\kappa_{J,\eta}^2 = \lambda_{\max}\!\left[(\widehat{\Sigma}_S+\eta\widehat{\Sigma}_I)^{-1/2} \widehat{\Sigma}_T(\widehat{\Sigma}_S+\eta\widehat{\Sigma}_I)^{-1/2}\right].\]

Interpretation of scale and stability

\[\text{diagnostic target: } \kappa_{J,\eta} \text{ modest and stable as } J \uparrow,\, \eta \downarrow.\]

Implementation defaults:

Feature map: 'rff' (alternatives: 'polynomial', callable, or precomputed matrix).
Conditional-mean learner: ridge regression.
Stabilization: eta=1e-6.

Practical interpretation:

modest and stable \(\kappa_{J,\eta}\) across feature sizes supports a DGP favorable to joint estimation;
large values suggest weak finite-sample curvature for simultaneous estimation;
instability as \(J\) grows or \(\eta \downarrow 0\) suggests possible span-level failure of relative well-posedness.

`relative_wellposedness_diagnostic`(A, C, ...)	Compute the finite-dimensional relative well-posedness diagnostic `kappa`.
`relative_wellposedness_from_data`(data, *, A)	Run the pre-estimation diagnostic from dataset-level selectors.
`relative_wellposedness_from_nested_npiv`(A, ...)	Run the pre-estimation diagnostic for canonical nested NPIV blocks.

Growing-sieve diagnostic (J and eta paths)

Because \(\kappa_{J,\eta}\) depends on the chosen feature span, a single fixed basis is only a local check. A more informative diagnostic computes the same quantity over a growing sieve:

\[J_1 < J_2 < \cdots < J_K, \qquad \kappa_{J_1,\eta},\kappa_{J_2,\eta},\ldots,\kappa_{J_K,\eta}.\]

Stable, modest values as \(J\) grows support relative well-posedness on the explored sieve. Rapid growth means the increasingly rich feature class contains directions that \(S\) struggles to see relative to \(T_g\).

For random Fourier features, the sieve grid controls n_features. For polynomial features, it controls poly_degree.

from nnpiv.diagnostics import relative_wellposedness_sieve_from_nested_npiv

sieve = relative_wellposedness_sieve_from_nested_npiv(
    A=A,
    D=D,
    B=B,
    C=C,
    feature_map="polynomial",
    sieve_grid=[1, 2, 3, 4],
    eta_grid=[1e-4, 1e-6, 1e-8],
)

rows = sieve["rows"]
summary = sieve["summary"]

Each row contains sieve_value (J), eta, and kappa. This supports:

\(J \mapsto \widehat{\kappa}_{J,\eta}\) for fixed \(\eta\);
\(\eta \mapsto \widehat{\kappa}_{J,\eta}\) for fixed \(J\).

The returned rows also include kappa_cummax so monotonic-envelope paths are easy to plot. Because finite-sample conditional-mean regressions are re-estimated at each J, the raw empirical path may show small local non-monotonicity; the cummax path is the finite-sample monotone envelope.

`relative_wellposedness_sieve_diagnostic`(A, ...)	Run the pre-estimation diagnostic over a sieve and eta path.
`relative_wellposedness_sieve_from_data`(data, ...)	Run the sieve-path pre-estimation diagnostic from dataset selectors.
`relative_wellposedness_sieve_from_nested_npiv`(A, ...)	Run the pre-estimation sieve-path diagnostic for canonical nested NPIV.

Post-estimation error-direction diagnostic (kappa_eff)

After estimating the first stage, let \(e_g=\widehat g-g_0\). The restricted empirical diagnostic is

\[\kappa_{\mathrm{eff}} = \frac{\|T_g e_g\|_2}{\|S e_g\|_2}.\]

In finite dimensions, relative_wellposedness_effective_diagnostic projects \(e_g\) onto the same feature span and computes this ratio with the same estimated \(\widehat\Sigma_S,\widehat\Sigma_T\).

`relative_wellposedness_effective_diagnostic`(A, ...)	Compute the post-estimation effective-direction diagnostic `kappa_eff`.
`relative_wellposedness_effective_from_data`(...)	Run the post-estimation `kappa_eff` diagnostic from dataset selectors.
`relative_wellposedness_effective_from_nested_npiv`(A, ...)	Run post-estimation `kappa_eff` for canonical nested NPIV blocks.

Post-estimation sieve path (kappa_eff by J and eta)

from nnpiv.diagnostics import (
    relative_wellposedness_sieve_from_nested_npiv,
    relative_wellposedness_effective_sieve_from_nested_npiv,
)

pre = relative_wellposedness_sieve_from_nested_npiv(
    A=A, D=D, B=B, C=C,
    feature_map="rff",
    sieve_grid=[50, 100, 200, 400],
    eta_grid=[1e-4, 1e-6],
    random_state=123,
)

# e_g must be provided post-estimation (same row count as A).
post = relative_wellposedness_effective_sieve_from_nested_npiv(
    A=A, D=D, B=B, C=C,
    e_g=e_g,
    feature_map="rff",
    sieve_grid=[50, 100, 200, 400],
    eta_grid=[1e-4, 1e-6],
    random_state=123,
)

pre_rows = pre["rows"]   # includes kappa, kappa_cummax
post_rows = post["rows"] # includes kappa_eff, kappa_eff_cummax

Plotting the pre/post paths is then straightforward:

import pandas as pd
import matplotlib.pyplot as plt

pre_df = pd.DataFrame(pre_rows)
post_df = pd.DataFrame(post_rows)

fig, ax = plt.subplots(1, 2, figsize=(10, 4), sharex=True)
for eta, g in pre_df.groupby("eta"):
    g = g.sort_values("sieve_value")
    ax[0].plot(g["sieve_value"], g["kappa"], marker="o", label=f"eta={eta:g}")
ax[0].set_title("Pre: kappa_J,eta")
ax[0].set_xlabel("J")
ax[0].set_ylabel("kappa")
ax[0].legend()

for eta, g in post_df.groupby("eta"):
    g = g.sort_values("sieve_value")
    ax[1].plot(g["sieve_value"], g["kappa_eff"], marker="o", label=f"eta={eta:g}")
ax[1].set_title("Post: kappa_eff,J,eta")
ax[1].set_xlabel("J")
ax[1].set_ylabel("kappa_eff")
ax[1].legend()
plt.tight_layout()

`relative_wellposedness_effective_sieve_diagnostic`(A, ...)	Run `kappa_eff` over a sieve and eta path.
`relative_wellposedness_effective_sieve_from_data`(...)	Run the sieve-path post-estimation `kappa_eff` diagnostic from selectors.
`relative_wellposedness_effective_sieve_from_nested_npiv`(A, ...)	Run the post-estimation sieve-path `kappa_eff` diagnostic for nested NPIV.

Availability and workflow

Diagnostic A is the default pre-estimation check whenever nested inputs \((A, C, C')\) are available.
Simulation runners can enable A via diagnostics_opts.enabled=True and write CSV/JSON artifacts alongside Monte Carlo outputs.

Plug-and-play usage with dataset blocks

For minimal notebook usage, call the wrapper with dataset-level block selectors:

from nnpiv.diagnostics import relative_wellposedness_from_data

# Example for nested NPIV data tuple -> dict
data = {"A": A, "B": B, "C": C, "C_prime": D}
out = relative_wellposedness_from_data(
    data,
    A="A",
    B="B",  # optional, accepted for (A,B,C,C') interface consistency
    C="C",
    C_prime="C_prime",
    feature_map="rff",
    n_features=300,
    eta=1e-6,
    random_state=123,
)

print(out["kappa"], out["kappa2"])

The same API supports:

data as pandas DataFrame with A/C/C_prime as column names or lists;
data as a mapping of arrays;
data as a 2D array with integer/slice column selectors.
Optional B selector can be passed but is not used by Diagnostic A.

Canonical nested NPIV shortcut

For the common simulation/notebook layout A, D, B, C, Y, tau_fn = dgps.get_data(...), use:

from nnpiv.diagnostics import relative_wellposedness_from_nested_npiv

out = relative_wellposedness_from_nested_npiv(
    A=A,
    D=D,   # treated as C'
    B=B,   # accepted for interface consistency
    C=C,
    feature_map="rff",
    n_features=300,
    eta=1e-6,
    random_state=123,
)

print(out["kappa"], out["kappa2"])