Installation & Replication
A Python package for nested NPIV estimation with RKHS, neural networks (AGMM/AGMM2), linear/ensemble baselines, and DML-based semiparametric procedures. The repository also contains scripts to reproduce all simulation tables and empirical figures.
Package source:
nnpiv/Simulation drivers:
simulations/Notebooks (usage & empirical replications):
local_notebooks/
1. Installation
The project is PEP 517/518 compliant (pyproject.toml).
1.1. Create and activate an environment
# From repository root
python3.14 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
For cluster jobs, the Slurm runner keeps using:
module load python/3.13
mamba activate nnpiv_venv
1.2. Install dependencies
# Base requirements (CPU-only friendly, Python 3.14)
pip install -r requirements.txt
# (Optional) If you are on a cluster, you can also use the cluster pin file
# pip install -r requirements_cluster.txt
If you want GPU acceleration for PyTorch, install the wheel that matches your CUDA runtime from PyTorch’s index.
1.3. Install the package
# From the repository root
pip install -e .
This installs nnpiv in editable mode for development and replication.
Alternatively, you can use the following command (deprecated):
python setup.py develop
2. What’s in the box?
Core estimators (
nnpiv): RKHS (exact & Nyström-approximate), AGMM/AGMM2, linear & ensemble baselines, and semiparametric DML engines (long-term + mediated variants).Simulations (
simulations/): Nonparametric experiments (Table 1) and Semiparametric coverage experiments (Table 2), with config files to switch DGP/estimators and Slurm/local runners.Notebooks (
local_notebooks/): Usage examples and replication of empirical figures (Project STAR; Job Corps).
3. Quick start (library)
Example: long-term effects via DML + RKHS.
import numpy as np
from sklearn.linear_model import LogisticRegression
from nnpiv.rkhs import ApproxRKHSIVCV
from nnpiv.semiparametrics import DML_longterm
# Toy shapes: Y,D,S,G are (n,1)
Y, D, S, G = [np.random.randn(1000,1) for _ in range(4)]
m1 = ApproxRKHSIVCV(kernel_approx='nystrom', n_components=400,
kernel='rbf', gamma=.001, delta_scale='auto',
delta_exp=.4, alpha_scales=np.geomspace(1, 10000, 10), cv=10)
m2 = ApproxRKHSIVCV(kernel_approx='nystrom', n_components=400,
kernel='rbf', gamma=.001, delta_scale='auto',
delta_exp=.4, alpha_scales=np.geomspace(1, 10000, 10), cv=10)
dml = DML_longterm(Y, D, S, G,
longterm_model='latent_unconfounded',
model1=[m1, m2],
n_folds=5, n_rep=1, CHIM=False,
prop_score=LogisticRegression(max_iter=2000))
theta, var, ci = dml.dml()
print(theta, var, ci)
4. Reproducing the simulations
4.1. Folder layout
simulations/-config_*.py— configuration files (DGP, estimators, seeds, output paths) -run_simulations_local.sh— canonical local execution script -run_simulations.sbatch— canonical Slurm execution script -submit_simulations.sh— submission helper with resource profiles -sweep_np.py/sweep_sp.py— experiment drivers -./nonparametric_fit/— results -./semiparametric_cov/— results
4.2. Nonparametric simulations (Table 1)
Run locally:
cd simulations
./run_simulations_local.sh --config config_np_benchmark
Smoke test locally:
cd simulations
./run_simulations_local.sh --config config_np_benchmark --smoke-test
Run on Slurm:
cd simulations
./submit_simulations.sh --profile sapphire --config config_np_benchmark
(Config input can be config_x, config_x.py, or a path like
simulations/config_x.py.)
4.3. Semiparametric coverage simulations (Table 2)
Run locally:
cd simulations
./run_simulations_local.sh --config config_sp_benchmark
Smoke test locally:
cd simulations
./run_simulations_local.sh --config config_sp_benchmark --smoke-test
Run on Slurm:
cd simulations
./submit_simulations.sh --profile sapphire --config config_sp_benchmark
4.4. Unified Slurm options
Run all configs as a Slurm array:
cd simulations
./submit_simulations.sh --profile test --all-configs --smoke-test
Run all configs locally:
cd simulations
./run_simulations_local.sh --all-configs --smoke-test
Track per-config runtime locally (printed summary + CSV log):
cd simulations
./run_simulations_local.sh --all-configs --smoke-test --timing-log ./timings_smoke.csv
Override replication count and seed:
cd simulations
./submit_simulations.sh --profile shared --config config_np_nn --n-experiments 50 --seed 999
4.5. Notes on parallelism & threads
To avoid oversubscription with joblib/NumPy/BLAS/OpenMP, we cap native threads to 1. The Slurm scripts already export:
export OMP_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1
export MKL_NUM_THREADS=1
export VECLIB_MAXIMUM_THREADS=1
export NUMEXPR_NUM_THREADS=1
Internally, the Python drivers also set threadpoolctl(1) when appropriate.
5. Empirical replications (notebooks)
Replication notebooks are located in local_notebooks/:
STAR long-term outcomes — reproduces paper figures (RKHS + NN).
Job Corps mediation — DML(mediated) with neural nets and RKHS.
Note
The repository includes paths expecting CSVs under data/.
Data might not be redistributed for license reasons.
6. Repository structure
NNPIV/
├─ nnpiv/ # package
├─ simulations/ # simulation configs + runners
│ ├─ run_simulations_local.sh # Local execution runner
│ ├─ run_simulations.sbatch # Slurm execution runner
│ ├─ submit_simulations.sh # Slurm submission helper
│ ├─ sweep_np.py # driver (NP)
│ ├─ sweep_sp.py # driver (SP)
│ └─ config_*.py # experiment configs
├─ local_notebooks/ # usage + empirical replications
├─ data/ # (data; not always distributed)
├─ output/ # results (created on run)
├─ pyproject.toml
├─ requirements.txt
└─ README.rst
7. Citing
If you use this package, please cite the associated paper and code artifact:
Meza, I., & Singh, R. (2025). Nested Nonparametric Instrumental Variable Regression.
https://doi.org/10.48550/arXiv.2112.14249
8. License
MIT License (see LICENSE.txt).