ivcheck

Lifecycle: stable License: MIT

Introduction

ivcheck is an R package that tests the identifying assumptions behind instrumental variable (IV) estimation. It provides three published falsification tests as named R functions, with S3 methods for fitted fixest and ivreg models plus a one-shot wrapper that runs every applicable test in a single call.

Every applied IV paper rests on two assumptions about the instrument Z: the exclusion restriction (Z affects the outcome Y only through the endogenous treatment D) and monotonicity (no defiers). Under these assumptions plus independence, the IV estimand identifies the local average treatment effect (LATE) for compliers (Imbens and Angrist 1994). Both assumptions are untestable-looking in principle, but the methodological literature has derived testable implications on the joint distribution of (Y, D, Z): Kitagawa (2015), Mourifie-Wan (2017), Frandsen-Lefgren-Leslie (2023). Rejection of these tests is evidence that at least one of exclusion or monotonicity has failed. Non-rejection is evidence of no detectable violation at the chosen level.

Applied IV research has not adopted these tests widely. Most empirical IV papers still argue identification by narrative (“my instrument is random-looking because X”), and referees are increasingly frustrated with this. The limiting factor has been tooling rather than conviction: Kitagawa’s test ships as supplementary Matlab code, Mourifie-Wan relies on the Stata clrtest module, and Frandsen-Lefgren-Leslie ships a Stata SSC module called testjfe. None is in R. ivcheck closes that gap: two added lines to a fixest::feols call and you have a published falsification test ready for your paper’s appendix.

The current landscape

The R ecosystem for IV estimation is mature. fixest is the dominant package for fast fixed-effects IV estimation via feols(y ~ x | d ~ z). ivreg provides classical 2SLS with Wu-Hausman, Sargan, and weak-IV F tests. ivmodel covers k-class estimators and weak-IV-robust confidence intervals. ivDiag (Lal, Lockhart, Xu, and Zu 2024, Political Analysis) implements effective-F and Anderson-Rubin diagnostics, valid-t and local-to-zero tests, plus sensitivity analysis.

None of these packages implements the LATE-validity family of falsification tests. Applied researchers who want their IV design formally tested have had to choose between writing a one-off replication script from the original paper’s methodology section, switching to Stata for the test and back to R for the rest of the analysis, or not running the test at all. The third option has dominated.

ivcheck is the first R-native implementation of the LATE-validity family. The implementations are faithful to the published statistics: Kitagawa’s variance-weighted interval-sup Kolmogorov-Smirnov form (equation 2.1 of the paper), the full Chernozhukov-Lee-Rosen intersection-bounds inference with Andrews-Soares adaptive moment selection for Mourifie-Wan with covariates, and the asymptotic chi-squared form of Frandsen-Lefgren-Leslie with multivalued-treatment support via section 4 of the paper. All designed to slot into existing fixest and ivreg workflows without friction.

Installation

# Once accepted by CRAN
install.packages("ivcheck")

# Development version from GitHub
# install.packages("devtools")
devtools::install_github("charlescoverdale/ivcheck")

Quick start

library(fixest)
library(ivcheck)

m <- feols(lwage ~ controls | educ ~ near_college, data = card1995)
iv_check(m)
#> IV validity diagnostic
#>   Kitagawa (2015):     stat = 0.01, p = 1.00, pass
#>   Mourifie-Wan (2017): stat = 0.65, p = 0.99, pass
#> Overall: cannot reject IV validity at 0.05.

Two added lines, a falsification test the referee is almost guaranteed to ask about, citation-ready output.

Walkthrough

Output lines prefixed with #> show what the console prints.

A single test on raw vectors

library(ivcheck)

set.seed(1)
n <- 500
z <- sample(0:1, n, replace = TRUE)
d <- rbinom(n, 1, 0.3 + 0.4 * z)
y <- rnorm(n, mean = d)

k <- iv_kitagawa(y, d, z, n_boot = 500)
print(k)
#>
#> -- Kitagawa (2015) -----------------------------------------------------------
#> Sample size: 500
#> Statistic: 0.04, p-value: 0.91
#> Verdict: cannot reject IV validity at 0.05

The bootstrap p-value comes from the multiplier resampling procedure of Kitagawa (2015) section 3.2. With parallel = TRUE (the default) replications run across cores on POSIX systems.

With covariates (Mourifie-Wan)

x <- rnorm(n)
mw <- iv_mw(y, d, z, x = x, n_boot = 500)
print(mw)

iv_mw() with covariates estimates F(y, d | X = x, Z = z) by cubic-polynomial series regression, computes heteroscedasticity-robust standard errors, and takes the sup of the studentised positive-part violation over a grid of (y, x) points. Critical values use adaptive moment selection with Andrews-Soares kappa_n = sqrt(log(log(n))). Without covariates it reduces exactly to the variance-weighted Kitagawa test.

Judge designs (Frandsen-Lefgren-Leslie)

set.seed(1)
n <- 2000
judge <- sample.int(20, n, replace = TRUE)
d <- rbinom(n, 1, 0.3 + 0.02 * judge)
y <- rnorm(n, mean = d)

jfe <- iv_testjfe(y, d, judge, n_boot = 500)

Designs where the instrument is a set of mutually exclusive dummies (judge, caseworker, examiner) need a purpose-built test. iv_testjfe() fits a weighted-LS regression of per-judge mu_j on per-judge p_j and tests the implied linearity via chi-squared with K - 2 degrees of freedom (default) or multiplier bootstrap (method = "bootstrap"). Multivalued treatment is supported via Frandsen-Lefgren-Leslie (2023) section 4.

One-shot diagnostic on a fitted model

library(fixest)

df <- data.frame(z = z, d = d, y = y, x = x)
m  <- feols(y ~ x | d ~ z, data = df)

iv_check(m, n_boot = 500)

iv_check() detects which tests are applicable from the model structure (binary versus multivalued D, discrete versus judge-style Z, presence of covariates) and runs all of them. Works identically on ivreg::ivreg() objects.

Power planning

pw <- iv_power(y, d, z, method = "kitagawa", n_sims = 200)

Simulates data under a parametric exclusion violation and reports rejection probability at a grid of deviation sizes. Useful when choosing between candidate tests on the same design, or planning a minimum sample size for a study.

Example: end-to-end with Card (1995)

library(ivcheck)
library(fixest)

data(card1995)   # bundled
m <- feols(
  lwage ~ age + married + black + south | college ~ near_college,
  data = card1995
)

iv_check(m, n_boot = 1000)
#> IV validity diagnostic
#>   Kitagawa (2015):     stat = 7.98, p = 0.00, reject
#>   Mourifie-Wan (2017): stat = 7.98, p = 0.00, reject
#> Overall: at least one test rejects IV validity at 0.05.

The interval-sup Kitagawa test rejects on this binary-discretised college treatment. The binding violation sits in the upper lwage interval [6.25, 7.78], where college-graduates living away from a college have more mass than the test’s implied bound admits. Monte Carlo on a Card-shaped DGP with Gaussian outcome produces empirical size of 1.25% at nominal 5%, so the rejection is not a size artifact: it reflects a genuine feature of Card’s empirical outcome distribution conditional on college and proximity.

This is not a rejection of Card’s original IV, which targets continuous years of schooling. The binary educ >= 16 threshold creates mixed complier subpopulations whose testable implications bite differently than the continuous-treatment case. Users running the test on their own binary-IV designs should inspect result$binding to see which outcome interval carries the violation, and consider whether the discretisation itself is driving the finding.

Functions

Function Purpose
iv_kitagawa() Kitagawa (2015) variance-weighted KS test. Extends to multivalued D via Sun (2023).
iv_mw() Mourifie-Wan (2017) conditional-inequality test. Full CLR intersection-bounds with adaptive moment selection under covariates.
iv_testjfe() Frandsen-Lefgren-Leslie (2023) test for judge / group IV designs. Supports multivalued treatment.
iv_check() Wrapper that auto-detects applicable tests and runs them on a fitted IV model.
iv_power() Monte Carlo power curve for sample-size planning.

Limitations

Read before using in published work.

Scope (v0.1.0 does not cover)

Notes on fidelity to the published tests

Interpretation

Why trust this implementation

Planned for future versions

Package What it covers
fixest Fast IV estimation via feols(y ~ x \| d ~ z) (upstream from ivcheck)
ivreg 2SLS with Wu-Hausman, Sargan, weak-IV F (upstream from ivcheck)
ivmodel k-class estimators, weak-IV robust CIs, sensitivity analysis
ivDiag Effective F, Anderson-Rubin, valid-t, local-to-zero tests

ivcheck complements rather than competes with these. fixest or ivreg does the estimation, ivDiag does weak-IV post-estimation diagnostics, and ivcheck does LATE-assumption falsification.

Issues and requests

Report bugs or request additional tests at GitHub Issues. Pull requests implementing additional IV-validity tests from the literature are welcome; please include a reference to the original paper and a reproduction test against its empirical example.

References

Cite both the package and the underlying paper(s) for the test you use. Package citation:

citation("ivcheck")

Test-specific references (DOIs verified via crossref.org)

Function Reference DOI
iv_kitagawa() Kitagawa, T. (2015). A Test for Instrument Validity. Econometrica 83(5): 2043-2063. 10.3982/ECTA11974
iv_kitagawa() (multivalued D) Sun, Z. (2023). Instrument Validity for Heterogeneous Causal Effects. Journal of Econometrics. 10.1016/j.jeconom.2023.105628
iv_mw() Mourifie, I. and Wan, Y. (2017). Testing Local Average Treatment Effect Assumptions. Review of Economics and Statistics 99(2): 305-313. 10.1162/REST_a_00622
iv_testjfe() Frandsen, B. R., Lefgren, L. J., Leslie, E. C. (2023). Judging Judge Fixed Effects. American Economic Review 113(1): 253-277. 10.1257/aer.20201860

Foundational and methodological references

Package comparison

Keywords

instrumental variables, LATE, causal inference, exclusion restriction, monotonicity, specification testing, falsification, judge IV, Kitagawa test, Mourifie-Wan test, FLL test, econometrics.