--- title: "Performance and Scalability" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Performance and Scalability} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ## Overview This vignette benchmarks the `forestBalance` pipeline to characterize how speed and memory scale with sample size ($n$) and number of trees ($B$). The pipeline has four stages, each optimized: 1. **Forest fitting** (`grf::multi_regression_forest`) -- C++ via grf. 2. **Leaf node extraction** (`get_leaf_node_matrix`) -- C++ via Rcpp. 3. **Indicator matrix construction** (`leaf_node_kernel_Z`) -- C++ via Rcpp, building CSC sparse format directly without sorting. 4. **Weight computation** (`kernel_balance`) -- adaptive solver: - **Direct**: sparse Cholesky on treated/control sub-blocks. Preferred for smaller problems. - **CG**: conjugate gradient using the factored $Z$ representation. No kernel matrix is formed. Preferred for large $n$ or dense kernels. ``` r library(grf) library(Matrix) ``` ## Mathematical background ### The kernel energy balancing system Given a kernel matrix $K \in \mathbb{R}^{n \times n}$ and a binary treatment vector $A \in \{0,1\}^n$ with $n_1$ treated and $n_0$ control units, the kernel energy balancing weights $w$ are obtained by solving the linear system $$ K_q \, w = z, $$ where $K_q$ is the *modified kernel* and $z$ is a right-hand side vector determined by a linear constraint. The modified kernel is defined element-wise as $$ K_q(i,j) = \frac{A_i \, A_j \, K(i,j)}{n_1^2} + \frac{(1-A_i)(1-A_j) \, K(i,j)}{n_0^2}. $$ A critical structural observation is that $K_q(i,j) = 0$ whenever $A_i \neq A_j$: the treated--control cross-blocks are identically zero. Therefore $K_q$ is **block-diagonal**: $$ K_q = \begin{pmatrix} K_{tt} / n_1^2 & 0 \\ 0 & K_{cc} / n_0^2 \end{pmatrix}, $$ where $K_{tt} = K[A{=}1,\; A{=}1]$ is the $n_1 \times n_1$ treated sub-block and $K_{cc} = K[A{=}0,\; A{=}0]$ is the $n_0 \times n_0$ control sub-block. The right-hand side vector $b$ has a similarly separable structure. Writing $\mathbf{r} = K \mathbf{1}$ (the row sums of $K$), we have $$ b_i = \frac{A_i \, r_i}{n_1 \, n} + \frac{(1-A_i) \, r_i}{n_0 \, n}. $$ The full system decomposes into two **independent** sub-problems: - **Treated block:** solve $K_{tt} \, w_t = n_1^2 \, z_t$ of size $n_1$, - **Control block:** solve $K_{cc} \, w_c = n_0^2 \, z_c$ of size $n_0$, where $z_t$ and $z_c$ are the constraint-adjusted right-hand sides for each group (each requiring two preliminary solves to determine the Lagrange multiplier). ### The kernel factorization The proximity kernel is $K = Z Z^\top / B$, where $Z$ is a sparse $n \times L$ indicator matrix ($L = \sum_{b=1}^B L_b$, total leaves across all trees). Each row of $Z$ has exactly $B$ nonzero entries (one per tree), so $Z$ has $nB$ nonzeros total. The $Z$ matrix is constructed in C++ directly in compressed sparse column (CSC) format, avoiding the overhead of R's triplet sorting. ### Direct solver (block Cholesky) For moderate $n$, we form the sub-block kernels explicitly: $$ K_{tt} = Z_t Z_t^\top / B, \qquad K_{cc} = Z_c Z_c^\top / B, $$ where $Z_t = Z[A{=}1, \,\cdot\,]$ and $Z_c = Z[A{=}0, \,\cdot\,]$. Each sub-block is computed via a sparse `tcrossprod`. The linear systems are then solved by sparse Cholesky factorization. **Computational Cost:** $O(nB \cdot \bar{s})$ for the two sub-block cross-products (where $\bar{s}$ is the average leaf size), plus $O(n_1^{3/2} + n_0^{3/2})$ for sparse Cholesky (in the best case). ### CG solver (matrix-free) For large $n$, forming $K_{tt}$ and $K_{cc}$ becomes expensive. The conjugate gradient (CG) solver avoids forming *any* kernel matrix by operating on the factored representation. To solve $$ K_{tt} \, x = r \quad \Longleftrightarrow \quad Z_t Z_t^\top x = B \, r, $$ CG only needs matrix--vector products of the form $$ v \;\mapsto\; Z_t \bigl(Z_t^\top v\bigr), $$ each costing $O(n_1 B)$ via two sparse matrix--vector multiplies. The same applies to the control block with $Z_c$. Here, the memory use is $O(nB)$ for $Z$ alone, versus $O(n^2)$ for the kernel. Each CG iteration costs $O(n_g B)$ (where $n_g$ is the group size). Convergence typically requires 100--200 iterations, independent of $n$, so the total cost is $O(n_g B \cdot T_{\text{iter}})$. Only four solves are needed (two per block), since the third right-hand side is a linear combination of the first two. **Computational Cost:** $O\bigl((n_1 + n_0) \cdot B \cdot T_{\text{iter}}\bigr)$ where $T_{\text{iter}} \approx 100\text{--}200$. This scales linearly in both $n$ and $B$. ### Block Jacobi preconditioned CG (default for large $n$) Plain CG can require many iterations at large $n$. The **Block Jacobi** solver uses the first tree's leaf partition to build a block-diagonal preconditioner. Each leaf block is a small dense system (~`min.node.size` $\times$ `min.node.size`) that is cheap to factor. Preconditioned CG then converges in far fewer iterations---typically 5--10$\times$ fewer than plain CG---giving a large overall speedup. Note: only 2 linear solves per treatment-group block are needed (not 3), because the third right-hand side is a linear combination of the first two. ## Solver comparison We compare all three solvers at $n = 10{,}000$ and $n = 25{,}000$: ``` r # Fit one forest per n, then time each solver on the SAME kernel system. solver_bench <- do.call(rbind, lapply(c(10000, 25000), function(nn) { set.seed(123) dat <- simulate_data(n = nn, p = 10) B_val <- 1000 mns <- max(20L, min(floor(nn / 200) + 10, floor(nn / 50))) forest <- grf::multi_regression_forest( dat$X, scale(cbind(dat$A, dat$Y)), num.trees = B_val, min.node.size = mns ) leaf_mat <- get_leaf_node_matrix(forest, dat$X) Z <- leaf_node_kernel_Z(leaf_mat) solvers <- if (nn <= 10000) c("bj", "cg", "direct") else c("bj", "cg") do.call(rbind, lapply(solvers, function(s) { if (s == "direct") { K <- leaf_node_kernel(leaf_mat) t <- system.time( w <- kernel_balance(dat$A, kern = K, solver = "direct") )["elapsed"] } else { t <- system.time( w <- kernel_balance(dat$A, Z = Z, leaf_matrix = leaf_mat, num.trees = B_val, solver = s) )["elapsed"] } ate <- weighted.mean(dat$Y[dat$A == 1], w$weights[dat$A == 1]) - weighted.mean(dat$Y[dat$A == 0], w$weights[dat$A == 0]) data.frame(n = nn, solver = s, time = t, ate = ate) })) })) ``` Table: Solver comparison on the same kernel (B = 1,000). ATE agreement confirms solvers find the same solution. | n|Solver | Time (s)| ATE| |-----:|:------|--------:|--------:| | 10000|bj | 5.54| 0.015260| | 10000|cg | 4.26| 0.015261| | 10000|direct | 17.39| 0.015260| | 25000|bj | 30.47| 0.006337| | 25000|cg | 40.63| 0.006344| All solvers produce the same ATE to high precision, confirming they solve the same system. CG (Rcpp) is the robust default; Block Jacobi can be faster in specific regimes (moderate leaf size, many trees) and is available via `solver = "bj"`. ## End-to-end timing We benchmark the full `forest_balance()` pipeline (forest fitting, leaf extraction, Z construction, and weight computation) across a range of sample sizes up to $n = 50{,}000$ with $B = 1{,}000$ trees: ``` r n_vals <- c(500, 1000, 2500, 5000, 10000, 25000, 50000) B <- 1000 p <- 10 bench <- do.call(rbind, lapply(n_vals, function(nn) { set.seed(123) dat <- simulate_data(n = nn, p = p) # Auto (default) t_auto <- system.time( fit_auto <- forest_balance(dat$X, dat$A, dat$Y, num.trees = B) )["elapsed"] data.frame(n = nn, trees = B, auto = t_auto, auto_solver = fit_auto$solver) })) ``` Table: Full pipeline time (auto solver). | n| Trees| Time (s)|Solver | |-----:|-----:|--------:|:------| | 500| 1000| 0.15|cg | | 1000| 1000| 0.18|direct | | 2500| 1000| 0.62|direct | | 5000| 1000| 2.20|direct | | 10000| 1000| 9.51|direct | | 25000| 1000| 38.16|cg | | 50000| 1000| 115.98|cg | ![plot of chunk timing-plot](performance-timing-plot-1.png) The solver is selected automatically based on the per-fold sample size. The Block Jacobi preconditioned CG (green) activates for large $n$ and provides the best scaling. ## Scaling with number of trees ``` r tree_vals <- c(200, 500, 1000, 2000) n_test <- c(1000, 5000, 25000) tree_bench <- do.call(rbind, lapply(n_test, function(nn) { do.call(rbind, lapply(tree_vals, function(B) { set.seed(123) dat <- simulate_data(n = nn, p = 10) t <- system.time( fit <- forest_balance(dat$X, dat$A, dat$Y, num.trees = B) )["elapsed"] data.frame(n = nn, trees = B, time = t) })) })) ``` Table: Pipeline time across tree counts. | n| Trees| Time (s)| |-----:|-----:|--------:| | 1000| 200| 0.05| | 1000| 500| 0.10| | 1000| 1000| 0.17| | 1000| 2000| 0.34| | 5000| 200| 0.93| | 5000| 500| 1.30| | 5000| 1000| 2.09| | 5000| 2000| 3.97| | 25000| 200| 10.00| | 25000| 500| 23.66| | 25000| 1000| 37.66| | 25000| 2000| 66.03| ![plot of chunk tree-plot](performance-tree-plot-1.png) The pipeline scales approximately linearly in both $n$ and $B$. ## Pipeline stage breakdown The following experiment profiles each stage of the pipeline individually to show where time is spent at different scales: ``` r breakdown <- do.call(rbind, lapply( list(c(1000, 500), c(5000, 500), c(5000, 2000), c(25000, 500), c(25000, 1000)), function(cfg) { nn <- cfg[1]; B_val <- cfg[2] set.seed(123) dat <- simulate_data(n = nn, p = 10) mns <- max(20L, min(floor(nn / 200) + 10, floor(nn / 50))) # 1. Forest fitting t_fit <- system.time({ resp <- scale(cbind(dat$A, dat$Y)) forest <- grf::multi_regression_forest(dat$X, Y = resp, num.trees = B_val, min.node.size = mns) })["elapsed"] # 2. Leaf extraction (Rcpp) t_leaf <- system.time( leaf_mat <- get_leaf_node_matrix(forest, dat$X) )["elapsed"] # 3. Z construction (Rcpp) t_z <- system.time( Z <- leaf_node_kernel_Z(leaf_mat) )["elapsed"] # 4. Weight computation (kernel_balance) t_bal <- system.time( w <- kernel_balance(dat$A, Z = Z, num.trees = B_val, solver = "cg") )["elapsed"] data.frame(n = nn, trees = B_val, mns = mns, fit = t_fit, leaf = t_leaf, Z = t_z, balance = t_bal, total = t_fit + t_leaf + t_z + t_bal) } )) ``` Table: Wall-clock time for each pipeline stage across configurations (CG solver, single fold). | n| Trees| mns| Forest fit (s)| Leaf extract (s)| Z build (s)| Balance (s)| Total (s)| |-----:|-----:|---:|--------------:|----------------:|-----------:|-----------:|---------:| | 1000| 500| 20| 0.03| 0.02| 0.004| 0.10| 0.15| | 5000| 500| 35| 0.17| 0.11| 0.012| 0.65| 0.94| | 5000| 2000| 35| 0.66| 0.42| 0.102| 2.33| 3.52| | 25000| 500| 135| 0.94| 0.53| 0.069| 26.14| 27.68| | 25000| 1000| 135| 1.87| 1.06| 0.181| 41.14| 44.26| ![plot of chunk breakdown-plot](performance-breakdown-plot-1.png) At small $n$, forest fitting and the balance solver take similar time. At large $n$, the CG solver dominates --- its cost grows with the number of iterations and the sparse matrix--vector product size. The C++ stages (leaf extraction and $Z$ construction) remain a small fraction throughout. ## Memory usage When the direct solver is used, the kernel is formed as a sparse matrix. For the CG solver, only the sparse indicator matrix $Z$ is stored. The table below shows the actual memory usage compared to the theoretical dense kernel: ``` r mem_data <- do.call(rbind, lapply( c(500, 1000, 2500, 5000, 10000, 25000, 50000), function(nn) { set.seed(123) dat <- simulate_data(n = nn, p = 10) B_val <- 1000 mns <- max(20L, min(floor(nn / 200) + ncol(dat$X), floor(nn / 50))) fit_forest <- multi_regression_forest( dat$X, scale(cbind(dat$A, dat$Y)), num.trees = B_val, min.node.size = mns ) leaf_mat <- get_leaf_node_matrix(fit_forest, dat$X) Z <- leaf_node_kernel_Z(leaf_mat) z_mb <- as.numeric(object.size(Z)) / 1e6 # Kernel sparsity (skip forming K for very large n) if (nn <= 10000) { K <- leaf_node_kernel(leaf_mat) pct_nz <- round(100 * length(K@x) / as.numeric(nn)^2, 1) k_mb <- round(as.numeric(object.size(K)) / 1e6, 1) } else { pct_nz <- NA k_mb <- NA } n_fold <- nn %/% 2 auto_solver <- if (n_fold > 5000 || mns > n_fold / 20) "cg" else "direct" dense_mb <- round(8 * as.numeric(nn)^2 / 1e6, 0) data.frame(n = nn, mns = mns, pct_nz = pct_nz, sparse_MB = k_mb, Z_MB = round(z_mb, 1), dense_MB = dense_mb, solver = auto_solver) } )) ``` Table: Memory usage with adaptive leaf size (p = 10). | n| mns|Solver |K % nonzero |Stored | Actual (MB)| Dense K (MB)|Actual / Dense | |-----:|---:|:------|:-----------|:----------|-----------:|------------:|:--------------| | 500| 20|cg |39.4% |Z (factor) | 6.0| 2|300% | | 1000| 20|direct |28.8% |K (sparse) | 3.5| 8|43.8% | | 2500| 22|direct |20.1% |K (sparse) | 15.1| 50|30.2% | | 5000| 35|direct |16% |K (sparse) | 48.1| 200|24% | | 10000| 60|direct |12.6% |K (sparse) | 151.7| 800|19% | | 25000| 135|cg |-- |Z (factor) | 300.4| 5000|6% | | 50000| 260|cg |-- |Z (factor) | 600.3| 20000|3% | ![plot of chunk memory-plot](performance-memory-plot-1.png) At $n = 50{,}000$, a dense kernel would require **19 GB**. The CG solver stores only the $Z$ matrix, using a small fraction of this. ## Summary - The full `forest_balance()` pipeline scales to **$n = 50{,}000$** with 1,000 trees. - All computationally intensive steps are in compiled code: C++ for leaf extraction and $Z$ construction (Rcpp), BLAS for sparse matrix operations, and CHOLMOD for sparse Cholesky. - The adaptive `min.node.size` heuristic adjusts leaf size to both $n$ and $p$, creating denser kernels than the old default of 10. - The solver switchover between direct and CG depends on both the per-fold sample size and the kernel density. The `"auto"` setting handles this transparently. - The pipeline scales approximately **linearly** in both $n$ and $B$.