Universal Diagnostics API
=========================

.. currentmodule:: nnpiv.diagnostics

Assumptions
-----------

- Arrays for :math:`A, C, C'` are row-aligned and represent the same sample.
- Diagnostics are finite-dimensional proxies over selected feature maps/sieves.
- Stability conclusions should be interpreted across :math:`(J, \\eta)` paths,
  not from a single point estimate.

Notation
--------

- :math:`\\Sigma_S, \\Sigma_T`: empirical Gram operators induced by
  conditional means of featureized :math:`A`.
- :math:`\\eta`: stabilization magnitude used in generalized-eigenvalue
  computations.
- :math:`\\kappa`: restricted relative condition number for pre-estimation
  conditioning.
- :math:`\\kappa_{\\mathrm{eff}}`: post-estimation condition number on the
  realized first-stage error direction.

Diagnostic A: Relative Well-posedness
-------------------------------------

Even if full :math:`L_2` verification is hard, one can check relative
well-posedness on the finite function class used in simulation or estimation.
Choose a feature map

.. math::

   b(A)=\{b_1(A),\ldots,b_J(A)\}^{\top}

and restrict to functions

.. math::

   u(A)=\theta^{\top}b(A), \qquad \theta \in \mathbb{R}^J.

Define

.. math::

   m_S(C') := E\{b(A)\mid C'\}, \qquad
   m_T(C) := E\{b(A)\mid C\}.

Then the induced quadratic forms are

.. math::

   \|Su\|_2^2=\theta^{\top}\Sigma_S\theta,
   \qquad
   \Sigma_S := E\!\left[m_S(C')m_S(C')^{\top}\right],

and

.. math::

   \|T_gu\|_2^2=\theta^{\top}\Sigma_T\theta,
   \qquad
   \Sigma_T := E\!\left[m_T(C)m_T(C)^{\top}\right].

The finite-dimensional bounded relative well-posedness condition is

.. math::

   \Sigma_T \preceq \kappa_J^2 \Sigma_S.

Equivalently, the sharp constant is the generalized eigenvalue bound

.. math::

   \kappa_J^2
   =
   \lambda_{\max}\!\left(
   \Sigma_S^{\dagger/2}\Sigma_T\Sigma_S^{\dagger/2}
   \right)
   =
   \sup_{\theta:\,\theta^{\top}\Sigma_S\theta>0}
   \frac{\theta^{\top}\Sigma_T\theta}{\theta^{\top}\Sigma_S\theta}.

If there exists :math:`\theta` with

.. math::

   \theta^{\top}\Sigma_S\theta=0
   \quad\text{but}\quad
   \theta^{\top}\Sigma_T\theta>0,

then relative well-posedness fails on that feature span.

In practice, use a small ridge :math:`\eta>0` for numerical stability. The
default implementation uses the feature Gram stabilization
:math:`\widehat{\Sigma}_I`:

.. math::

   \kappa_{J,\eta}^2
   = \lambda_{\max}\!\left[(\widehat{\Sigma}_S+\eta \widehat{\Sigma}_I)^{-1/2}
   \widehat{\Sigma}_T(\widehat{\Sigma}_S+\eta \widehat{\Sigma}_I)^{-1/2}\right].

where

.. math::

   \widehat{\Sigma}_I
   =
   \frac{1}{N}\sum_{i=1}^{N} b(A_i)b(A_i)^\top.

Set ``eta_mode='identity'`` to recover the older :math:`\eta I` stabilization.

Why :math:`\Sigma_I` vs :math:`I` and how to choose :math:`\eta`
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The matrix :math:`\Sigma_I` is the feature-space second moment:

.. math::

   \Sigma_I = E[b(A)b(A)^\top],
   \qquad
   \widehat{\Sigma}_I = \frac{1}{N}\sum_{i=1}^N b(A_i)b(A_i)^\top.

It differs from identity-ridge in an important way:

* ``eta_mode='identity'`` regularizes by :math:`\eta\|\theta\|_2^2`;
* ``eta_mode='sigma_i'`` regularizes by
  :math:`\eta\,\theta^\top\Sigma_I\theta = \eta\,E[(\theta^\top b(A))^2]`,
  i.e., by function magnitude under :math:`P_A`.

So ``sigma_i`` is often more geometry-aware and less sensitive to arbitrary
feature scaling/rotation. In a basis close to orthonormal under :math:`P_A`,
the two choices are similar.

Finite-sample caveat:

* if :math:`\widehat{\Sigma}_I` has very small eigenvalues, then
  :math:`\eta\widehat{\Sigma}_I` may provide weak regularization in some
  directions relative to :math:`\eta I`;
* this can inflate :math:`\widehat{\kappa}_{J,\eta}` at very small ``eta``.

Practical ``eta`` guidance:

* to reduce extreme finite-sample :math:`\kappa`, increase ``eta``;
* to approximate the unregularized generalized-eigenvalue target, decrease
  ``eta``;
* always inspect an ``eta``-path (and preferably a ``J``-by-``eta`` grid)
  instead of a single value.

Unregularized limit:

* yes, conceptually ``eta=0`` corresponds to no ridge regularization
  (pseudo-inverse formulation);
* in finite samples this can be numerically unstable or effectively singular,
  so the implementation requires ``eta > 0`` for stability.

This is the primary pre-estimation simulation diagnostic.

Key output fields for failure-mode detection:

* ``nullspace_violation_flag``: ``True`` means the diagnostic found directions
  where :math:`\widehat{\Sigma}_S` is (near) null but
  :math:`\widehat{\Sigma}_T` still has signal. This is the empirical failure
  pattern for relative well-posedness on the tested feature span.
* ``nullspace_leakage_sigma_t_on_null_sigma_s`` (``nullspace_leakage`` in
  notebook tables): how much :math:`\widehat{\Sigma}_T` ``leaks`` into the
  near-null space of :math:`\widehat{\Sigma}_S` (larger is worse). Formally,
  it is the largest eigenvalue of
  :math:`U_0^\top \widehat{\Sigma}_T U_0`, where :math:`U_0` spans the
  near-null eigenspace of :math:`\widehat{\Sigma}_S`.
* ``stabilization_dominance_ratio``: rough size of ridge relative to
  :math:`\widehat{\Sigma}_S`,
  :math:`\|\eta R\|_{op}/\|\widehat{\Sigma}_S\|_{op}` with
  :math:`R=\widehat{\Sigma}_I` or :math:`I`. Large values mean regularization
  may be driving behavior more than the raw operator pair.
* ``max_diag_ratio_sigma_t_over_sigma_s`` (``max_diag_ratio`` in notebook
  tables): maximum coordinatewise ratio
  :math:`\mathrm{diag}(\widehat{\Sigma}_T) / \mathrm{diag}(\widehat{\Sigma}_S)`.
  This is a quick worst-direction proxy; large values suggest weak
  conditioning.

How to read :math:`\kappa`
^^^^^^^^^^^^^^^^^^^^^^^^^^

The number :math:`\kappa_{J,\eta}` is a restricted relative condition number.
Ignoring the ridge for a moment,

.. math::

   \kappa_J
   =
   \sup_{u \in \operatorname{span}(b):\,\|Su\|_2>0}
   \frac{\|T_g u\|_2}{\|Su\|_2}.

Thus :math:`\kappa_J^2` is the worst-case ratio of squared signal seen by
:math:`T_g` relative to squared signal seen by :math:`S` on the chosen feature
span.

Interpretation:

* :math:`\kappa_J \approx 1`: the two conditional-mean operators have similar
  strength on the feature span;
* :math:`\kappa_J < 1`: :math:`S` sees these directions at least as strongly as
  :math:`T_g`;
* large :math:`\kappa_J`: there are directions in the feature span that are much
  more visible to :math:`T_g` than to :math:`S`, so simultaneous estimation can
  have weak finite-sample curvature;
* divergent :math:`\kappa_J` as the feature span grows: evidence against
  bounded relative well-posedness for the full function class.

Gaussian linear-basis example
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Suppose :math:`(A,C,C')` are mean-zero jointly Gaussian and use the linear basis
:math:`b(A)=A`. Then

.. math::

   E[A\mid C]=\Sigma_{AC}\Sigma_{CC}^{-1}C,
   \qquad
   E[A\mid C']=\Sigma_{AC'}\Sigma_{C'C'}^{-1}C'.

Therefore

.. math::

   \Sigma_T
   =
   \Sigma_{AC}\Sigma_{CC}^{-1}\Sigma_{CA},
   \qquad
   \Sigma_S
   =
   \Sigma_{AC'}\Sigma_{C'C'}^{-1}\Sigma_{C'A}.

The diagnostic is the largest generalized eigenvalue of this pair. In the
scalar case with correlations
:math:`\rho_T=\operatorname{Corr}(A,C)` and
:math:`\rho_S=\operatorname{Corr}(A,C')`,

.. math::

   \Sigma_T=\operatorname{Var}(A)\rho_T^2,
   \qquad
   \Sigma_S=\operatorname{Var}(A)\rho_S^2,
   \qquad
   \kappa_J=\left|\frac{\rho_T}{\rho_S}\right|.

So for Gaussian data with a linear basis, :math:`\kappa` is literally the ratio
of how predictive :math:`C` is for :math:`A` relative to how predictive
:math:`C'` is for :math:`A`. If :math:`\rho_S=0` but :math:`\rho_T\ne0`, the
relative condition fails on this span.

For Gaussian scalar variables and an orthogonal Hermite basis, the degree
:math:`k` component scales like :math:`\rho^k`. Heuristically, if the sieve
contains degrees up to :math:`J`, then high-order components can behave like

.. math::

   \max_{1\leq k\leq J}
   \left|\frac{\rho_T}{\rho_S}\right|^k.

This explains why a DGP can look benign for linear features but deteriorate
quickly once nonlinear directions are added.

How :math:`\Sigma_S,\Sigma_T` are estimated
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

When the DGP is known, use a large auxiliary Monte Carlo sample.

1. Choose features :math:`b(A)`.

   Use estimator-aligned features whenever possible: polynomial/Hermite bases
   (transparent), random Fourier features for RKHS, or learned
   representations/last-layer features for neural nets. A simple first pass is
   polynomial features in :math:`A`, including directions relevant for
   :math:`g_0(A)` (for example, cubic directions when :math:`g_0(A)=A_1^3`).

2. Estimate conditional means.

   Estimate :math:`m_S(C')=E[b(A)\mid C']` and
   :math:`m_T(C)=E[b(A)\mid C]` with flexible regressions on the auxiliary
   sample.

   If desired, nested Monte Carlo can be used:

   .. math::

      \widehat{m}_S(c')
      =
      \frac{1}{M}\sum_{m=1}^{M} b\!\left(A_{c'}^{(m)}\right),

   with draws :math:`A_{c'}^{(m)}\sim \mathcal{L}(A\mid C'=c')`, and similarly
   for :math:`\widehat{m}_T(c)`.

3. Build Gram matrices.

   .. math::

      \widehat{\Sigma}_S
      =
      \frac{1}{N}\sum_{i=1}^{N}
      \widehat{m}_S(C_i')\widehat{m}_S(C_i')^{\top},
      \qquad
      \widehat{\Sigma}_T
      =
      \frac{1}{N}\sum_{i=1}^{N}
      \widehat{m}_T(C_i)\widehat{m}_T(C_i)^{\top}.

4. Compute :math:`\kappa_{J,\eta}`.

   .. math::

      \kappa_{J,\eta}^2
      = \lambda_{\max}\!\left[(\widehat{\Sigma}_S+\eta\widehat{\Sigma}_I)^{-1/2}
      \widehat{\Sigma}_T(\widehat{\Sigma}_S+\eta\widehat{\Sigma}_I)^{-1/2}\right].

Interpretation of scale and stability
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. math::

   \text{diagnostic target: } \kappa_{J,\eta} \text{ modest and stable as } J \uparrow,\, \eta \downarrow.

Implementation defaults:

* Feature map: ``'rff'`` (alternatives: ``'polynomial'``, callable, or precomputed matrix).
* Conditional-mean learner: ridge regression.
* Stabilization: ``eta=1e-6``.

Practical interpretation:

* modest and stable :math:`\kappa_{J,\eta}` across feature sizes supports a DGP favorable to joint estimation;
* large values suggest weak finite-sample curvature for simultaneous estimation;
* instability as :math:`J` grows or :math:`\eta \downarrow 0` suggests possible span-level failure of relative well-posedness.

.. autosummary::
   :toctree: _autosummary
   :template: function.rst

   relative_wellposedness_diagnostic
   relative_wellposedness_from_data
   relative_wellposedness_from_nested_npiv

Growing-sieve diagnostic (J and eta paths)
------------------------------------------

Because :math:`\kappa_{J,\eta}` depends on the chosen feature span, a single
fixed basis is only a local check. A more informative diagnostic computes the
same quantity over a growing sieve:

.. math::

   J_1 < J_2 < \cdots < J_K,
   \qquad
   \kappa_{J_1,\eta},\kappa_{J_2,\eta},\ldots,\kappa_{J_K,\eta}.

Stable, modest values as :math:`J` grows support relative well-posedness on the
explored sieve. Rapid growth means the increasingly rich feature class contains
directions that :math:`S` struggles to see relative to :math:`T_g`.

For random Fourier features, the sieve grid controls ``n_features``. For
polynomial features, it controls ``poly_degree``.

.. code-block:: python

   from nnpiv.diagnostics import relative_wellposedness_sieve_from_nested_npiv

   sieve = relative_wellposedness_sieve_from_nested_npiv(
       A=A,
       D=D,
       B=B,
       C=C,
       feature_map="polynomial",
       sieve_grid=[1, 2, 3, 4],
       eta_grid=[1e-4, 1e-6, 1e-8],
   )

   rows = sieve["rows"]
   summary = sieve["summary"]

Each row contains ``sieve_value`` (J), ``eta``, and ``kappa``. This supports:

* :math:`J \mapsto \widehat{\kappa}_{J,\eta}` for fixed :math:`\eta`;
* :math:`\eta \mapsto \widehat{\kappa}_{J,\eta}` for fixed :math:`J`.

The returned rows also include ``kappa_cummax`` so monotonic-envelope paths are
easy to plot.
Because finite-sample conditional-mean regressions are re-estimated at each
``J``, the raw empirical path may show small local non-monotonicity; the
cummax path is the finite-sample monotone envelope.

.. autosummary::
   :toctree: _autosummary
   :template: function.rst

   relative_wellposedness_sieve_diagnostic
   relative_wellposedness_sieve_from_data
   relative_wellposedness_sieve_from_nested_npiv

Post-estimation error-direction diagnostic (kappa_eff)
------------------------------------------------------

After estimating the first stage, let :math:`e_g=\widehat g-g_0`. The
restricted empirical diagnostic is

.. math::

   \kappa_{\mathrm{eff}}
   =
   \frac{\|T_g e_g\|_2}{\|S e_g\|_2}.

In finite dimensions, ``relative_wellposedness_effective_diagnostic`` projects
:math:`e_g` onto the same feature span and computes this ratio with the same
estimated :math:`\widehat\Sigma_S,\widehat\Sigma_T`.

.. autosummary::
   :toctree: _autosummary
   :template: function.rst

   relative_wellposedness_effective_diagnostic
   relative_wellposedness_effective_from_data
   relative_wellposedness_effective_from_nested_npiv

Post-estimation sieve path (kappa_eff by J and eta)
---------------------------------------------------

.. code-block:: python

   from nnpiv.diagnostics import (
       relative_wellposedness_sieve_from_nested_npiv,
       relative_wellposedness_effective_sieve_from_nested_npiv,
   )

   pre = relative_wellposedness_sieve_from_nested_npiv(
       A=A, D=D, B=B, C=C,
       feature_map="rff",
       sieve_grid=[50, 100, 200, 400],
       eta_grid=[1e-4, 1e-6],
       random_state=123,
   )

   # e_g must be provided post-estimation (same row count as A).
   post = relative_wellposedness_effective_sieve_from_nested_npiv(
       A=A, D=D, B=B, C=C,
       e_g=e_g,
       feature_map="rff",
       sieve_grid=[50, 100, 200, 400],
       eta_grid=[1e-4, 1e-6],
       random_state=123,
   )

   pre_rows = pre["rows"]   # includes kappa, kappa_cummax
   post_rows = post["rows"] # includes kappa_eff, kappa_eff_cummax

Plotting the pre/post paths is then straightforward:

.. code-block:: python

   import pandas as pd
   import matplotlib.pyplot as plt

   pre_df = pd.DataFrame(pre_rows)
   post_df = pd.DataFrame(post_rows)

   fig, ax = plt.subplots(1, 2, figsize=(10, 4), sharex=True)
   for eta, g in pre_df.groupby("eta"):
       g = g.sort_values("sieve_value")
       ax[0].plot(g["sieve_value"], g["kappa"], marker="o", label=f"eta={eta:g}")
   ax[0].set_title("Pre: kappa_J,eta")
   ax[0].set_xlabel("J")
   ax[0].set_ylabel("kappa")
   ax[0].legend()

   for eta, g in post_df.groupby("eta"):
       g = g.sort_values("sieve_value")
       ax[1].plot(g["sieve_value"], g["kappa_eff"], marker="o", label=f"eta={eta:g}")
   ax[1].set_title("Post: kappa_eff,J,eta")
   ax[1].set_xlabel("J")
   ax[1].set_ylabel("kappa_eff")
   ax[1].legend()
   plt.tight_layout()

.. autosummary::
   :toctree: _autosummary
   :template: function.rst

   relative_wellposedness_effective_sieve_diagnostic
   relative_wellposedness_effective_sieve_from_data
   relative_wellposedness_effective_sieve_from_nested_npiv

Availability and workflow
-------------------------

* Diagnostic A is the default pre-estimation check whenever nested inputs
  :math:`(A, C, C')` are available.
* Simulation runners can enable A via ``diagnostics_opts.enabled=True`` and
  write CSV/JSON artifacts alongside Monte Carlo outputs.

Plug-and-play usage with dataset blocks
---------------------------------------

For minimal notebook usage, call the wrapper with dataset-level block selectors:

.. code-block:: python

   from nnpiv.diagnostics import relative_wellposedness_from_data

   # Example for nested NPIV data tuple -> dict
   data = {"A": A, "B": B, "C": C, "C_prime": D}
   out = relative_wellposedness_from_data(
       data,
       A="A",
       B="B",  # optional, accepted for (A,B,C,C') interface consistency
       C="C",
       C_prime="C_prime",
       feature_map="rff",
       n_features=300,
       eta=1e-6,
       random_state=123,
   )

   print(out["kappa"], out["kappa2"])

The same API supports:

* ``data`` as pandas ``DataFrame`` with ``A/C/C_prime`` as column names or lists;
* ``data`` as a mapping of arrays;
* ``data`` as a 2D array with integer/slice column selectors.
* Optional ``B`` selector can be passed but is not used by Diagnostic A.

Canonical nested NPIV shortcut
------------------------------

For the common simulation/notebook layout
``A, D, B, C, Y, tau_fn = dgps.get_data(...)``, use:

.. code-block:: python

   from nnpiv.diagnostics import relative_wellposedness_from_nested_npiv

   out = relative_wellposedness_from_nested_npiv(
       A=A,
       D=D,   # treated as C'
       B=B,   # accepted for interface consistency
       C=C,
       feature_map="rff",
       n_features=300,
       eta=1e-6,
       random_state=123,
   )

   print(out["kappa"], out["kappa2"])