--- title: "Augmented (Doubly-Robust) Estimation" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Augmented (Doubly-Robust) Estimation} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ## Overview The default `forest_balance()` estimator uses kernel energy balancing weights alone. When the kernel captures all confounding structure, this produces an unbiased ATE estimate. In practice, however, residual confounding may remain --- particularly when the confounding operates through complex nonlinear functions of the covariates. The **augmented** (doubly-robust) estimator combines the balancing weights with **group-specific** outcome regression models $\hat\mu_1(X) = E[Y \mid X, A{=}1]$ and $\hat\mu_0(X) = E[Y \mid X, A{=}0]$. The ATE is estimated as: $$ \hat\tau_{\mathrm{aug}} = \frac{1}{n}\sum_{i=1}^n \bigl[\hat\mu_1(X_i) - \hat\mu_0(X_i)\bigr] + \frac{\sum_i w_i A_i \bigl(Y_i - \hat\mu_1(X_i)\bigr)} {\sum_i w_i A_i} - \frac{\sum_i w_i (1 - A_i) \bigl(Y_i - \hat\mu_0(X_i)\bigr)} {\sum_i w_i (1 - A_i)}. $$ The first term is the outcome-model-based regression estimate of the ATE. The second and third terms are weighted bias corrections that use the balancing weights to account for errors in the outcome models. This has a key robustness property: the estimator is consistent if **either** the kernel (balancing weights) or the outcome models are correctly specified. When both are reasonable, the augmented estimator typically has lower bias and RMSE than either approach alone. ## Basic usage Enable augmentation with `augmented = TRUE`: ``` r set.seed(123) dat <- simulate_data(n = 2000, p = 20, ate = 0, dgp = 2) # Standard (weighting only) fit <- forest_balance(dat$X, dat$A, dat$Y, num.trees = 500) # Augmented (doubly-robust) fit_aug <- forest_balance(dat$X, dat$A, dat$Y, num.trees = 500, augmented = TRUE) c("Standard" = fit$ate, "Augmented" = fit_aug$ate, "Truth" = 0) #> Standard Augmented Truth #> 0.01664814 0.06350369 0.00000000 ``` The `print` method indicates when augmentation is active: ``` r fit_aug #> Forest Kernel Energy Balancing (cross-fitted) #> -------------------------------------------------- #> n = 2,000 (n_treated = 1042, n_control = 958) #> Trees: 500 #> Cross-fitting: 2 folds #> Augmented: yes (doubly-robust) #> Solver: direct #> ATE estimate: 0.0635 #> Fold ATEs: 0.0898, 0.0372 #> ESS: treated = 795/1042 (76%) control = 712/958 (74%) #> -------------------------------------------------- #> Use summary() for covariate balance details. ``` ## How it works with cross-fitting When both `cross.fitting = TRUE` (the default) and `augmented = TRUE`, the outcome model is cross-fitted **in lockstep** with the kernel: For each fold $k$: 1. Train the joint forest (for the kernel) on folds $-k$. 2. Train two outcome regression forests on folds $-k$: one on treated observations ($\hat\mu_1$) and one on control observations ($\hat\mu_0$). 3. Predict leaf nodes, $\hat\mu_1$, and $\hat\mu_0$ for fold $k$ out-of-sample. 4. Compute the fold-level doubly-robust ATE using the cross-fitted weights and outcome predictions. This ensures that neither the kernel nor the outcome models overfit to the estimation sample. The user does not need to manage the cross-fitting of $\hat\mu_1$ and $\hat\mu_0$ separately --- it is handled automatically. ## Simulation comparison We compare four configurations using DGP 2, which has a rich outcome surface with moderate confounding ($n = 5{,}000$, $p = 20$): ``` r set.seed(123) nreps <- 5 n <- 5000; p <- 20 methods <- c("Standard", "CF only", "Aug only", "CF + Aug") res <- matrix(NA, nreps, length(methods), dimnames = list(NULL, methods)) for (r in seq_len(nreps)) { dat <- simulate_data(n = n, p = p, ate = 0, dgp = 2) res[r, "Standard"] <- forest_balance(dat$X, dat$A, dat$Y, num.trees = 500, cross.fitting = FALSE, min.node.size = 10)$ate res[r, "CF only"] <- forest_balance(dat$X, dat$A, dat$Y, num.trees = 500)$ate res[r, "Aug only"] <- forest_balance(dat$X, dat$A, dat$Y, num.trees = 500, cross.fitting = FALSE, augmented = TRUE, min.node.size = 10)$ate res[r, "CF + Aug"] <- forest_balance(dat$X, dat$A, dat$Y, num.trees = 500, augmented = TRUE)$ate } ``` Table: DGP 2: n = 5,000, p = 20, true ATE = 0, 5 reps. |Method | Bias| SD| RMSE| |:--------|------:|------:|------:| |Standard | 0.1204| 0.0437| 0.1281| |CF only | 0.0151| 0.0673| 0.0690| |Aug only | 0.0515| 0.0535| 0.0743| |CF + Aug | 0.0055| 0.0609| 0.0611| The combination of cross-fitting and augmentation (CF + Aug) achieves the lowest bias. Cross-fitting alone handles the kernel overfitting, while augmentation accounts for outcome variation that the kernel does not fully capture. DGP 2 has a rich outcome surface depending on 8 covariates, of which only 2 are confounders, making the outcome model particularly valuable. ## User-supplied outcome predictions You can supply your own group-specific outcome predictions via `mu.hat` as a list with components `mu1` (predictions of $E[Y|X, A{=}1]$) and `mu0` (predictions of $E[Y|X, A{=}0]$): ``` r set.seed(123) dat <- simulate_data(n = 1000, p = 20, ate = 0, dgp = 2) # Fit your own group-specific outcome models df_all <- data.frame(Y = dat$Y, dat$X) df1 <- df_all[dat$A == 1, ] df0 <- df_all[dat$A == 0, ] lm1 <- lm(Y ~ ., data = df1) lm0 <- lm(Y ~ ., data = df0) mu_custom <- list( mu1 = predict(lm1, newdata = df_all), mu0 = predict(lm0, newdata = df_all) ) fit_custom <- forest_balance(dat$X, dat$A, dat$Y, num.trees = 500, mu.hat = mu_custom) fit_custom #> Forest Kernel Energy Balancing (cross-fitted) #> -------------------------------------------------- #> n = 1,000 (n_treated = 505, n_control = 495) #> Trees: 500 #> Cross-fitting: 2 folds #> Augmented: yes (doubly-robust) #> Solver: direct #> ATE estimate: 0.1576 #> Fold ATEs: 0.2724, 0.0428 #> ESS: treated = 391/505 (77%) control = 380/495 (77%) #> -------------------------------------------------- #> Use summary() for covariate balance details. ``` When `mu.hat` is supplied with `cross.fitting = TRUE`, the package uses the provided predictions as-is and emits a message reminding you to ensure they were cross-fitted externally. ## When to use augmentation Augmentation is most beneficial when: - **The confounding is nonlinear** and the kernel alone cannot fully capture it. - **The outcome model is reasonably accurate**, even if not perfectly specified. - **You want robustness** against partial misspecification of either the kernel or the outcome model. The computational cost of augmentation is modest: fitting two additional `regression_forest` models (one per treatment group) per fold, which is typically fast relative to the joint forest and kernel construction. ## References Robins, J.M., Rotnitzky, A. and Zhao, L.P. (1994). Estimation of regression coefficients when some regressors are not always observed. *Journal of the American Statistical Association*, 89(427), 846--866. De, S. and Huling, J.D. (2025). Data adaptive covariate balancing for causal effect estimation for high dimensional data. *arXiv preprint arXiv:2512.18069*.