--- title: "Getting Started with TemporalHazard" subtitle: "Build, fit, and predict with temporal parametric hazard models" format: html: code-fold: false vignette: > %\VignetteIndexEntry{Getting Started with TemporalHazard} %\VignetteEngine{quarto::html} %\VignetteEncoding{UTF-8} --- ```{r} #| include: false has_pkg <- requireNamespace("TemporalHazard", quietly = TRUE) has_ggplot2 <- requireNamespace("ggplot2", quietly = TRUE) knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = has_pkg ) ``` TemporalHazard fits *parametric hazard models* to time-to-event data. If you've worked with `survival::coxph()` or Kaplan-Meier curves before you already have most of the vocabulary; this is a quick refresh of the bits that matter for what follows. A **hazard** is the instantaneous risk of the event at time $t$, given the subject has survived until $t$. A Kaplan-Meier curve reflects it implicitly: KM gives you survival probabilities, and the rate at which those probabilities drop is (minus) the hazard. A **parametric** hazard model writes the hazard as a smooth function of $t$ with a small number of parameters, plus covariate effects. The payoff over nonparametric KM is a smooth curve you can extrapolate, patient-specific risk predictions, and the ability to compare fitted hazard *shapes* across groups or models. This vignette walks the minimal workflow. We start with a single-distribution Weibull fit on simulated data — fit, summary, predict — then move to the multiphase fit on CABGKUL, the canonical three-phase cardiac-surgery example. ## A first Weibull fit The simulated data below has 180 patients with three plausible risk covariates — age, NYHA functional class, and cardiogenic shock. We fit a single Weibull hazard: a scale parameter (`mu`) and a shape exponent (`nu`) that lets the hazard accelerate when `nu > 1`, decelerate when `nu < 1`, or stay flat at `nu = 1`. The Weibull is the natural starting point for a single-distribution fit because it covers all three monotone shapes with just two parameters. The `hazard()` interface takes a survival formula (`Surv(time, status)` on the left, covariates on the right), a starting-value vector (`theta`), and a distribution name. Passing `fit = TRUE` runs the optimizer and returns a fitted model; passing `fit = FALSE` returns the model specification without fitting, which is useful for sanity-checking the formula and starting values before committing to the optimizer. ```{r} library(TemporalHazard) library(survival) set.seed(1001) n <- 180 dat <- data.frame( time = rexp(n, rate = 0.35) + 0.05, status = rbinom(n, size = 1, prob = 0.6), age = rnorm(n, mean = 62, sd = 11), nyha = sample(1:4, n, replace = TRUE), shock = rbinom(n, size = 1, prob = 0.18) ) fit <- TemporalHazard::hazard( Surv(time, status) ~ age + nyha + shock, data = dat, theta = c(mu = 0.25, nu = 1.10, beta1 = 0, beta2 = 0, beta3 = 0), dist = "weibull", fit = TRUE, control = list(maxit = 300) ) summary(fit) ``` ## Prediction workflow A fitted hazard model lets you score new patients without refitting. The `predict()` method takes a `newdata` frame and a `type` argument that selects which quantity to compute: - `"linear_predictor"` — the covariate-driven log-hazard-ratio for each row, $x^\top \beta$. - `"hazard"` — the hazard multiplier, $\exp(x^\top \beta)$. - `"survival"` — the probability of surviving past each row's `time`, $S(t \mid x)$. - `"cumulative_hazard"` — the integrated hazard up to each row's `time`, $H(t \mid x) = -\log S(t \mid x)$. The three rows below are made-up patients spanning the covariate range in the training data: a younger, low-NYHA, no-shock patient at an early time; a middle case at the median; and an older, high-NYHA, shock patient further out. We compute all four prediction types for each. ```{r} new_patients <- data.frame( time = c(0.5, 1.5, 3.0), age = c(50, 65, 75), nyha = c(1, 3, 4), shock = c(0, 0, 1) ) pred_input <- new_patients new_patients$linear_predictor <- predict(fit, newdata = pred_input, type = "linear_predictor") new_patients$hazard_multiplier <- predict(fit, newdata = pred_input, type = "hazard") new_patients$survival <- predict(fit, newdata = pred_input, type = "survival") new_patients$cumulative_hazard <- predict(fit, newdata = pred_input, type = "cumulative_hazard") new_patients ``` ## Visualizing predicted survival The summary table and the patient-level scores tell us the fit *converged* and what it *predicts*, but not whether the predicted shape matches the data. For that we plot the model. Below we build a smooth survival curve over a fine time grid for a median-profile patient and overlay the Kaplan-Meier nonparametric estimate from the raw data on the same axes. If the parametric curve hugs the KM step function the Weibull shape is honest to the data; visible drift in either direction flags a model the optimizer fit happily but that doesn't actually describe the cohort. ```{r} #| label: fig-survival #| fig-cap: "Parametric Weibull survival curve with Kaplan-Meier overlay" #| fig-width: 7 #| fig-height: 5 #| eval: !expr requireNamespace("TemporalHazard", quietly = TRUE) && requireNamespace("ggplot2", quietly = TRUE) library(ggplot2) # Parametric curve on a fine grid t_grid <- seq(0.05, max(dat$time), length.out = 80) curve_df <- data.frame( time = t_grid, age = median(dat$age), nyha = 2, shock = 0 ) curve_df$survival <- predict(fit, newdata = curve_df, type = "survival") * 100 # Kaplan-Meier empirical overlay km <- survival::survfit(survival::Surv(time, status) ~ 1, data = dat) km_df <- data.frame(time = km$time, survival = km$surv * 100) ggplot() + geom_step(data = km_df, aes(time, survival, colour = "Kaplan-Meier")) + geom_line(data = curve_df, aes(time, survival, colour = "Parametric (Weibull)")) + scale_colour_manual( values = c("Parametric (Weibull)" = "#0072B2", "Kaplan-Meier" = "#D55E00") ) + scale_y_continuous(limits = c(0, 100)) + labs(x = "Months after surgery", y = "Freedom from death (%)", colour = NULL) + theme_minimal() + theme(legend.position = "bottom") ``` ## Multiphase models The Weibull fit above describes the hazard with one smooth, monotone curve — it can rise, fall, or stay flat over time, but it cannot change direction. That's fine for some processes (light bulbs failing, parts wearing out). It breaks down quickly for clinical data. Consider CABG (coronary artery bypass graft) surgery. In the days right after the operation the risk of death is at its highest: the patient is in the ICU, exposed to operative complications, infections, bleeding. If they survive that window the risk drops to a low, roughly flat background rate: they recover, go home, live their lives. Years later the risk rises again as the grafts begin to deteriorate and the cohort ages. Three distinct regimes, each with its own clinical mechanism. No single Weibull (or log-normal, or any other monotone distribution) can fit all three at once. The optimizer would have to compromise: picking a shape that's mediocre everywhere instead of right in any one regime. You'd predict too few early deaths, the wrong slope in the middle, and too few late deaths. The **multiphase decomposition** is the workaround. Instead of asking one distribution to do everything, we describe the total cumulative hazard as a *sum* of phase-specific contributions, each with its own temporal shape: $$H(t \mid x) = \sum_{j=1}^{J} \mu_j(x) \cdot \Phi_j(t)$$ Each $\Phi_j(t)$ is a phase-specific *temporal shape*: a unit-scaled curve that says *how* phase $j$'s hazard accumulates over time. Each $\mu_j(x)$ is the phase-specific *scale*: how much cumulative hazard belongs to phase $j$, possibly modulated by covariates. The phases overlap and their hazards add. There's no switching, no thresholds, no piecewise joins; at any time $t$ the instantaneous hazard $h(t) = dH/dt$ is the sum of the per-phase contributions $\mu_j \cdot \phi_j(t)$, where $\phi_j = d\Phi_j/dt$. ### Why multiple phases? Because the number of phases is the modeling lever that lets you match the data's actual structure. Some processes are well-described by two phases (early failure + steady-state). Most cardiac-surgery cohorts need three: early operative risk + constant background + late deterioration. A small handful — acute aortic dissection is the classic — need four (operative + subacute + constant + late). The framework is general; the *count* is a modeling choice driven by clinical knowledge and the shape of the Kaplan-Meier cumulative hazard curve. For the CABGKUL fit below we use three phases. That's not a property of the package; it's a property of this dataset. Each phase is specified with `hzr_phase()`. The first argument picks the shape function: `"cdf"` for a saturating curve bounded between 0 and 1 (the SAS "early" / G1 shape), `"constant"` for a flat hazard plateau (SAS "G2"), `"g3"` for the polynomial late-rising shape from the SAS "late" library. Remaining arguments set the shape's free parameters, or fix them with `fixed = "shapes"`. The Phase types section below has the full menu. For this fit we use the textbook three-phase decomposition: a saturating early peak, a constant background, and a polynomial late rise. We fix the shapes so we can compare the fitted scales against the published reference; in your own work you would usually estimate them. ```{r} #| label: multiphase-fit # CABGKUL is the benchmark dataset for 3-phase decomposition (n = 5,880) data(cabgkul) fit_mp <- hazard( Surv(int_dead, dead) ~ 1, data = cabgkul, dist = "multiphase", phases = list( early = hzr_phase("cdf", t_half = 0.2, nu = 1, m = 1, fixed = "shapes"), constant = hzr_phase("constant"), late = hzr_phase("g3", tau = 1, gamma = 3, alpha = 1, eta = 1, fixed = "shapes") ), fit = TRUE, control = list(n_starts = 5, maxit = 1000) ) summary(fit_mp) ``` The summary marks most rows `NA` in the `std_error`, `z_stat`, and `p_value` columns. Two mechanisms produce these, both deliberate: - The seven shape parameters carry `fixed = "shapes"` and act as user-supplied constants (`t_half`, `nu`, `m` on the early phase; `tau`, `gamma`, `alpha`, `eta` on the late phase). The optimizer never moves them, so they have no estimated variance to report. - One of the three `log_mu` scale parameters is also `NA`. By default the multiphase optimizer enforces Conservation of Events (Turner's theorem): one phase's `log_mu` is solved analytically at every iteration so that total predicted events equal total observed. That phase is not a free parameter at the MLE, so its inverse-Hessian variance is degenerate, and the table marks it `NA` rather than print a misleading number. Pass `control = list(conserve = FALSE)` to disable Conservation of Events and estimate all three scale parameters freely. The two remaining `log_mu` rows are the optimizer-estimated free parameters; their Wald statistics reflect the two-parameter fit dimension that remains after the Conservation-of-Events constraint absorbs the third `log_mu`. ### Decomposed hazard visualization The MLE numbers in the summary tell us *what* the model committed to, but not *how* the three phases fit together over time. For that we need the per-phase contributions. The `decompose = TRUE` argument on `predict()` returns one cumulative-hazard column per phase plus the total; we numerically differentiate each one (a coarse first-difference is fine here, since we just want to look) to get an instantaneous hazard rate. The plot that follows is the diagnostic that matters: it shows whether the model carved up the timeline the way we expected. The early phase should dominate near $t = 0$ and die off. The constant phase should be a flat floor. The late phase should be near zero early and rise after a lag. If any of those shapes looks wrong, the fix is in the starting values or in the choice of shape function, not in the data. ```{r} #| label: fig-hazard-phases #| fig-cap: "Additive phase decomposition: total hazard (solid) = early + constant + late (dashed)" #| fig-width: 7 #| fig-height: 4.5 #| eval: !expr requireNamespace("TemporalHazard", quietly = TRUE) && requireNamespace("ggplot2", quietly = TRUE) t_grid <- seq(0.01, max(cabgkul$int_dead) * 0.95, length.out = 200) nd <- data.frame(time = t_grid) # decompose = TRUE returns per-phase cumulative hazard columns decomp <- predict(fit_mp, newdata = nd, type = "cumulative_hazard", decompose = TRUE) # Numerical differentiation: h(t) ≈ ΔH(t) / Δt num_hazard <- function(cumhaz, time) { dt <- diff(time) dH <- diff(cumhaz) c(dH[1] / dt[1], dH / dt) } h_long <- rbind( data.frame(time = t_grid, hazard = num_hazard(decomp$early, t_grid), Phase = "Early"), data.frame(time = t_grid, hazard = num_hazard(decomp$constant, t_grid), Phase = "Constant"), data.frame(time = t_grid, hazard = num_hazard(decomp$late, t_grid), Phase = "Late"), data.frame(time = t_grid, hazard = num_hazard(decomp$total, t_grid), Phase = "Total") ) h_long$Phase <- factor(h_long$Phase, levels = c("Total", "Early", "Constant", "Late")) ggplot(h_long, aes(time, hazard, colour = Phase, linetype = Phase)) + geom_line(aes(linewidth = Phase)) + scale_colour_manual(values = c(Total = "#222222", Early = "#E69F00", Constant = "#56B4E9", Late = "#CC79A7")) + scale_linetype_manual(values = c(Total = "solid", Early = "dashed", Constant = "dashed", Late = "dashed")) + scale_linewidth_manual(values = c(Total = 1.3, Early = 0.7, Constant = 0.7, Late = 0.7)) + labs(x = "Months after CABG", y = "Hazard rate", colour = "Phase", linetype = "Phase", linewidth = "Phase") + theme_minimal() + theme(legend.position = "bottom") ``` ### Multiphase survival with Kaplan-Meier overlay The decomposition plot tells us the model has the right *shape*; the overlay plot tells us it has the right *level*. Plotting the parametric survival curve against the Kaplan-Meier estimate on the same axis is the single most useful diagnostic for a hazard fit. KM is the gold-standard nonparametric reference; if the parametric curve drifts away from it, something is wrong with the model that more iterations won't fix. ```{r} #| label: fig-mp-survival #| fig-cap: "Multiphase parametric survival vs. Kaplan-Meier" #| fig-width: 7 #| fig-height: 4.5 #| eval: !expr requireNamespace("TemporalHazard", quietly = TRUE) && requireNamespace("ggplot2", quietly = TRUE) surv_df <- data.frame( time = t_grid, survival = predict(fit_mp, newdata = nd, type = "survival") * 100 ) km <- survfit(Surv(int_dead, dead) ~ 1, data = cabgkul) km_df <- data.frame(time = km$time, survival = km$surv * 100) ggplot() + geom_step(data = km_df, aes(time, survival, colour = "Kaplan-Meier"), linewidth = 0.6) + geom_line(data = surv_df, aes(time, survival, colour = "Multiphase model"), linewidth = 1.1) + scale_colour_manual( values = c("Multiphase model" = "#0072B2", "Kaplan-Meier" = "#D55E00") ) + scale_y_continuous(limits = c(0, 100)) + labs(x = "Months after CABG", y = "Freedom from death (%)", colour = NULL) + theme_minimal() + theme(legend.position = "bottom") ``` ### Phase types Picking the right phase shape is the single biggest modeling decision in a multiphase fit. The package ships the canonical shapes from the SAS C HAZARD library: - **`"cdf"`** — a saturating curve $\Phi(t) \in [0, 1]$, parameterized by `t_half` (the time at which $G(t_{1/2}) = 0.5$), `nu` (a time exponent), and `m` (a shape exponent). This is the SAS "early" or G1 shape. Use it for the early phase, or for any phase whose accumulated hazard saturates rather than growing without bound. - **`"constant"`** — $\Phi(t) = t$. The phase's hazard rate is flat; there are no shape parameters to estimate, only the scale $\mu$. This is the SAS G2 shape. Use it for a background plateau. - **`"g3"`** — the SAS "late" shape, parameterized by `tau` (a time-scale), `gamma` (a time exponent), `alpha` (a shape parameter; $\alpha = 0$ gives an exponential limit), and `eta` (an outer exponent). Use it for a late phase that is near-zero early and accelerates over time — graft deterioration over years is the archetype. See `?hzr_phase` for the additional `"hazard"` shape and the full parameterization of each. In practice you choose phase shapes by looking at the Kaplan-Meier cumulative hazard plot, identifying which regimes are present, and matching each regime to the shape family whose behavior fits. Then you fit, look at the decomposition plot, and adjust starting values or fix shapes if a phase doesn't land where you expected. ```{r} #| eval: false hzr_phase("cdf", t_half = 0.5, nu = 2, m = 1) # Early risk (bounded) hzr_phase("constant") # Flat background rate hzr_phase("cdf", t_half = 10, nu = 1, m = 0) # Late risk (CDF-based) hzr_phase("g3", tau = 1, gamma = 3, alpha = 1, # Late risk (G3 power law) eta = 1) ``` The `"g3"` type uses the four-parameter G3 decomposition from the original C/SAS HAZARD program, providing unbounded power-law growth for late-phase hazards. See `vignette("mf-mathematical-foundations")` for the full mathematical treatment. ## Numerical helpers The package ships a small set of numerical helpers used internally to keep the likelihood evaluations stable across the wide range of arguments an optimizer encounters. They're exported so you can reach for them in custom code or when debugging a fit. The most common one is `hzr_log1pexp(x)`, which computes $\log(1 + e^x)$ in a numerically stable way. The naive `log(1 + exp(x))` overflows once $x$ exceeds about 700 (because `exp(x)` itself overflows); the helper switches to the asymptotic $x + \log(1 + e^{-x})$ for large $x$, which is exact to floating-point precision. The package uses it inside every log-likelihood expression that involves a log-survival term, so the same evaluation can run on $x = -50$ and $x = 5000$ without producing `Inf` or `NaN`. ```{r} TemporalHazard::hzr_log1pexp(c(-2, 0, 2)) ```