Universal Diagnostics API
Assumptions
Arrays for \(A, C, C'\) are row-aligned and represent the same sample.
Diagnostics are finite-dimensional proxies over selected feature maps/sieves.
Stability conclusions should be interpreted across \((J, \\eta)\) paths, not from a single point estimate.
Notation
\(\\Sigma_S, \\Sigma_T\): empirical Gram operators induced by conditional means of featureized \(A\).
\(\\eta\): stabilization magnitude used in generalized-eigenvalue computations.
\(\\kappa\): restricted relative condition number for pre-estimation conditioning.
\(\\kappa_{\\mathrm{eff}}\): post-estimation condition number on the realized first-stage error direction.
Diagnostic A: Relative Well-posedness
Even if full \(L_2\) verification is hard, one can check relative well-posedness on the finite function class used in simulation or estimation. Choose a feature map
and restrict to functions
Define
Then the induced quadratic forms are
and
The finite-dimensional bounded relative well-posedness condition is
Equivalently, the sharp constant is the generalized eigenvalue bound
If there exists \(\theta\) with
then relative well-posedness fails on that feature span.
In practice, use a small ridge \(\eta>0\) for numerical stability. The default implementation uses the feature Gram stabilization \(\widehat{\Sigma}_I\):
where
Set eta_mode='identity' to recover the older \(\eta I\) stabilization.
Why \(\Sigma_I\) vs \(I\) and how to choose \(\eta\)
The matrix \(\Sigma_I\) is the feature-space second moment:
It differs from identity-ridge in an important way:
eta_mode='identity'regularizes by \(\eta\|\theta\|_2^2\);eta_mode='sigma_i'regularizes by \(\eta\,\theta^\top\Sigma_I\theta = \eta\,E[(\theta^\top b(A))^2]\), i.e., by function magnitude under \(P_A\).
So sigma_i is often more geometry-aware and less sensitive to arbitrary
feature scaling/rotation. In a basis close to orthonormal under \(P_A\),
the two choices are similar.
Finite-sample caveat:
if \(\widehat{\Sigma}_I\) has very small eigenvalues, then \(\eta\widehat{\Sigma}_I\) may provide weak regularization in some directions relative to \(\eta I\);
this can inflate \(\widehat{\kappa}_{J,\eta}\) at very small
eta.
Practical eta guidance:
to reduce extreme finite-sample \(\kappa\), increase
eta;to approximate the unregularized generalized-eigenvalue target, decrease
eta;always inspect an
eta-path (and preferably aJ-by-etagrid) instead of a single value.
Unregularized limit:
yes, conceptually
eta=0corresponds to no ridge regularization (pseudo-inverse formulation);in finite samples this can be numerically unstable or effectively singular, so the implementation requires
eta > 0for stability.
This is the primary pre-estimation simulation diagnostic.
Key output fields for failure-mode detection:
nullspace_violation_flag:Truemeans the diagnostic found directions where \(\widehat{\Sigma}_S\) is (near) null but \(\widehat{\Sigma}_T\) still has signal. This is the empirical failure pattern for relative well-posedness on the tested feature span.nullspace_leakage_sigma_t_on_null_sigma_s(nullspace_leakagein notebook tables): how much \(\widehat{\Sigma}_T\)leaksinto the near-null space of \(\widehat{\Sigma}_S\) (larger is worse). Formally, it is the largest eigenvalue of \(U_0^\top \widehat{\Sigma}_T U_0\), where \(U_0\) spans the near-null eigenspace of \(\widehat{\Sigma}_S\).stabilization_dominance_ratio: rough size of ridge relative to \(\widehat{\Sigma}_S\), \(\|\eta R\|_{op}/\|\widehat{\Sigma}_S\|_{op}\) with \(R=\widehat{\Sigma}_I\) or \(I\). Large values mean regularization may be driving behavior more than the raw operator pair.max_diag_ratio_sigma_t_over_sigma_s(max_diag_ratioin notebook tables): maximum coordinatewise ratio \(\mathrm{diag}(\widehat{\Sigma}_T) / \mathrm{diag}(\widehat{\Sigma}_S)\). This is a quick worst-direction proxy; large values suggest weak conditioning.
How to read \(\kappa\)
The number \(\kappa_{J,\eta}\) is a restricted relative condition number. Ignoring the ridge for a moment,
Thus \(\kappa_J^2\) is the worst-case ratio of squared signal seen by \(T_g\) relative to squared signal seen by \(S\) on the chosen feature span.
Interpretation:
\(\kappa_J \approx 1\): the two conditional-mean operators have similar strength on the feature span;
\(\kappa_J < 1\): \(S\) sees these directions at least as strongly as \(T_g\);
large \(\kappa_J\): there are directions in the feature span that are much more visible to \(T_g\) than to \(S\), so simultaneous estimation can have weak finite-sample curvature;
divergent \(\kappa_J\) as the feature span grows: evidence against bounded relative well-posedness for the full function class.
Gaussian linear-basis example
Suppose \((A,C,C')\) are mean-zero jointly Gaussian and use the linear basis \(b(A)=A\). Then
Therefore
The diagnostic is the largest generalized eigenvalue of this pair. In the scalar case with correlations \(\rho_T=\operatorname{Corr}(A,C)\) and \(\rho_S=\operatorname{Corr}(A,C')\),
So for Gaussian data with a linear basis, \(\kappa\) is literally the ratio of how predictive \(C\) is for \(A\) relative to how predictive \(C'\) is for \(A\). If \(\rho_S=0\) but \(\rho_T\ne0\), the relative condition fails on this span.
For Gaussian scalar variables and an orthogonal Hermite basis, the degree \(k\) component scales like \(\rho^k\). Heuristically, if the sieve contains degrees up to \(J\), then high-order components can behave like
This explains why a DGP can look benign for linear features but deteriorate quickly once nonlinear directions are added.
How \(\Sigma_S,\Sigma_T\) are estimated
When the DGP is known, use a large auxiliary Monte Carlo sample.
Choose features \(b(A)\).
Use estimator-aligned features whenever possible: polynomial/Hermite bases (transparent), random Fourier features for RKHS, or learned representations/last-layer features for neural nets. A simple first pass is polynomial features in \(A\), including directions relevant for \(g_0(A)\) (for example, cubic directions when \(g_0(A)=A_1^3\)).
Estimate conditional means.
Estimate \(m_S(C')=E[b(A)\mid C']\) and \(m_T(C)=E[b(A)\mid C]\) with flexible regressions on the auxiliary sample.
If desired, nested Monte Carlo can be used:
\[\widehat{m}_S(c') = \frac{1}{M}\sum_{m=1}^{M} b\!\left(A_{c'}^{(m)}\right),\]with draws \(A_{c'}^{(m)}\sim \mathcal{L}(A\mid C'=c')\), and similarly for \(\widehat{m}_T(c)\).
Build Gram matrices.
\[\widehat{\Sigma}_S = \frac{1}{N}\sum_{i=1}^{N} \widehat{m}_S(C_i')\widehat{m}_S(C_i')^{\top}, \qquad \widehat{\Sigma}_T = \frac{1}{N}\sum_{i=1}^{N} \widehat{m}_T(C_i)\widehat{m}_T(C_i)^{\top}.\]Compute \(\kappa_{J,\eta}\).
\[\kappa_{J,\eta}^2 = \lambda_{\max}\!\left[(\widehat{\Sigma}_S+\eta\widehat{\Sigma}_I)^{-1/2} \widehat{\Sigma}_T(\widehat{\Sigma}_S+\eta\widehat{\Sigma}_I)^{-1/2}\right].\]
Interpretation of scale and stability
Implementation defaults:
Feature map:
'rff'(alternatives:'polynomial', callable, or precomputed matrix).Conditional-mean learner: ridge regression.
Stabilization:
eta=1e-6.
Practical interpretation:
modest and stable \(\kappa_{J,\eta}\) across feature sizes supports a DGP favorable to joint estimation;
large values suggest weak finite-sample curvature for simultaneous estimation;
instability as \(J\) grows or \(\eta \downarrow 0\) suggests possible span-level failure of relative well-posedness.
|
Compute the finite-dimensional relative well-posedness diagnostic |
|
Run the pre-estimation diagnostic from dataset-level selectors. |
Run the pre-estimation diagnostic for canonical nested NPIV blocks. |
Growing-sieve diagnostic (J and eta paths)
Because \(\kappa_{J,\eta}\) depends on the chosen feature span, a single fixed basis is only a local check. A more informative diagnostic computes the same quantity over a growing sieve:
Stable, modest values as \(J\) grows support relative well-posedness on the explored sieve. Rapid growth means the increasingly rich feature class contains directions that \(S\) struggles to see relative to \(T_g\).
For random Fourier features, the sieve grid controls n_features. For
polynomial features, it controls poly_degree.
from nnpiv.diagnostics import relative_wellposedness_sieve_from_nested_npiv
sieve = relative_wellposedness_sieve_from_nested_npiv(
A=A,
D=D,
B=B,
C=C,
feature_map="polynomial",
sieve_grid=[1, 2, 3, 4],
eta_grid=[1e-4, 1e-6, 1e-8],
)
rows = sieve["rows"]
summary = sieve["summary"]
Each row contains sieve_value (J), eta, and kappa. This supports:
\(J \mapsto \widehat{\kappa}_{J,\eta}\) for fixed \(\eta\);
\(\eta \mapsto \widehat{\kappa}_{J,\eta}\) for fixed \(J\).
The returned rows also include kappa_cummax so monotonic-envelope paths are
easy to plot.
Because finite-sample conditional-mean regressions are re-estimated at each
J, the raw empirical path may show small local non-monotonicity; the
cummax path is the finite-sample monotone envelope.
Run the pre-estimation diagnostic over a sieve and eta path. |
|
|
Run the sieve-path pre-estimation diagnostic from dataset selectors. |
Run the pre-estimation sieve-path diagnostic for canonical nested NPIV. |
Post-estimation error-direction diagnostic (kappa_eff)
After estimating the first stage, let \(e_g=\widehat g-g_0\). The restricted empirical diagnostic is
In finite dimensions, relative_wellposedness_effective_diagnostic projects
\(e_g\) onto the same feature span and computes this ratio with the same
estimated \(\widehat\Sigma_S,\widehat\Sigma_T\).
Compute the post-estimation effective-direction diagnostic |
|
Run the post-estimation |
|
Run post-estimation |
Post-estimation sieve path (kappa_eff by J and eta)
from nnpiv.diagnostics import (
relative_wellposedness_sieve_from_nested_npiv,
relative_wellposedness_effective_sieve_from_nested_npiv,
)
pre = relative_wellposedness_sieve_from_nested_npiv(
A=A, D=D, B=B, C=C,
feature_map="rff",
sieve_grid=[50, 100, 200, 400],
eta_grid=[1e-4, 1e-6],
random_state=123,
)
# e_g must be provided post-estimation (same row count as A).
post = relative_wellposedness_effective_sieve_from_nested_npiv(
A=A, D=D, B=B, C=C,
e_g=e_g,
feature_map="rff",
sieve_grid=[50, 100, 200, 400],
eta_grid=[1e-4, 1e-6],
random_state=123,
)
pre_rows = pre["rows"] # includes kappa, kappa_cummax
post_rows = post["rows"] # includes kappa_eff, kappa_eff_cummax
Plotting the pre/post paths is then straightforward:
import pandas as pd
import matplotlib.pyplot as plt
pre_df = pd.DataFrame(pre_rows)
post_df = pd.DataFrame(post_rows)
fig, ax = plt.subplots(1, 2, figsize=(10, 4), sharex=True)
for eta, g in pre_df.groupby("eta"):
g = g.sort_values("sieve_value")
ax[0].plot(g["sieve_value"], g["kappa"], marker="o", label=f"eta={eta:g}")
ax[0].set_title("Pre: kappa_J,eta")
ax[0].set_xlabel("J")
ax[0].set_ylabel("kappa")
ax[0].legend()
for eta, g in post_df.groupby("eta"):
g = g.sort_values("sieve_value")
ax[1].plot(g["sieve_value"], g["kappa_eff"], marker="o", label=f"eta={eta:g}")
ax[1].set_title("Post: kappa_eff,J,eta")
ax[1].set_xlabel("J")
ax[1].set_ylabel("kappa_eff")
ax[1].legend()
plt.tight_layout()
Run |
|
Run the sieve-path post-estimation |
|
|
Run the post-estimation sieve-path |
Availability and workflow
Diagnostic A is the default pre-estimation check whenever nested inputs \((A, C, C')\) are available.
Simulation runners can enable A via
diagnostics_opts.enabled=Trueand write CSV/JSON artifacts alongside Monte Carlo outputs.
Plug-and-play usage with dataset blocks
For minimal notebook usage, call the wrapper with dataset-level block selectors:
from nnpiv.diagnostics import relative_wellposedness_from_data
# Example for nested NPIV data tuple -> dict
data = {"A": A, "B": B, "C": C, "C_prime": D}
out = relative_wellposedness_from_data(
data,
A="A",
B="B", # optional, accepted for (A,B,C,C') interface consistency
C="C",
C_prime="C_prime",
feature_map="rff",
n_features=300,
eta=1e-6,
random_state=123,
)
print(out["kappa"], out["kappa2"])
The same API supports:
dataas pandasDataFramewithA/C/C_primeas column names or lists;dataas a mapping of arrays;dataas a 2D array with integer/slice column selectors.Optional
Bselector can be passed but is not used by Diagnostic A.
Canonical nested NPIV shortcut
For the common simulation/notebook layout
A, D, B, C, Y, tau_fn = dgps.get_data(...), use:
from nnpiv.diagnostics import relative_wellposedness_from_nested_npiv
out = relative_wellposedness_from_nested_npiv(
A=A,
D=D, # treated as C'
B=B, # accepted for interface consistency
C=C,
feature_map="rff",
n_features=300,
eta=1e-6,
random_state=123,
)
print(out["kappa"], out["kappa2"])