---
title: "Fitting Hazard Models"
subtitle: "From intercept-only to multiphase additive hazard"
format:
  html:
    code-fold: false
    toc: true
    toc-depth: 3
    number-sections: true
vignette: >
  %\VignetteIndexEntry{Fitting Hazard Models}
  %\VignetteEngine{quarto::html}
  %\VignetteEncoding{UTF-8}
---

```{r}
#| include: false
has_pkg <- requireNamespace("TemporalHazard", quietly = TRUE) &&
  requireNamespace("ggplot2", quietly = TRUE)
knitr::opts_chunk$set(
  collapse = TRUE,
  comment  = "#>",
  eval     = has_pkg
)
```

```{r}
#| label: setup
library(TemporalHazard)
library(survival)
library(ggplot2)
```

This vignette walks the core model-fitting workflow from the inside
out: intercept-only fits to establish the baseline hazard shape, then
covariates on top, then the multiphase decomposition, then
multi-endpoint analyses on the same cohort. Every example uses a
clinical dataset shipped with the package. If you haven't seen the
basics — what a parametric hazard model is, why we use the
`Surv(time, status)` formula — start with `vignette("getting-started")`
first; this vignette assumes that context.

The progression matters. Single-distribution intercept-only fits tell
you whether the baseline hazard shape is monotone (Weibull territory)
or has structure that demands a multiphase decomposition.
Multivariable fits add covariate effects on top of a shape you already
trust. Multiphase fits split the baseline shape into clinically
interpretable phases. Multi-endpoint analyses reuse all the above for
separate clinical outcomes — death, reoperation, infection — on the
same patient cohort.


## Intercept-only model: CABG survival (KU Leuven)

The `cabgkul` dataset contains 5,880 patients who underwent primary isolated
coronary artery bypass grafting at KU Leuven between 1971 and 1987.  With only
two columns --- follow-up time and death indicator --- it is the simplest
starting point.

```{r}
#| label: kul-data
data(cabgkul)
str(cabgkul)
```

Fit an intercept-only Weibull. With no covariates on the right-hand
side of the formula the model estimates only the baseline hazard
shape — the scale `mu` and exponent `nu` of a Weibull curve fit to all
5,880 patients pooled. This is the right starting point for any new
dataset: before asking which covariates matter, ask whether a single
monotone hazard even fits the population-level pattern.

```{r}
#| label: kul-weibull
fit_kul <- hazard(
  Surv(int_dead, dead) ~ 1,
  data  = cabgkul,
  dist  = "weibull",
  theta = c(mu = 0.10, nu = 1.0),
  fit   = TRUE
)

fit_kul
```

The summary tells us where the optimizer landed; the picture tells us
whether that landing point matches the data. Plot the fitted survival
curve on a fine time grid and overlay the Kaplan-Meier step function
from the raw cohort.

```{r}
#| label: fig-kul-km
#| fig-cap: "Weibull parametric survival vs. Kaplan-Meier (CABG, KU Leuven)"
#| fig-width: 7
#| fig-height: 4.5
t_grid <- seq(0.01, max(cabgkul$int_dead) * 0.9, length.out = 200)
nd     <- data.frame(time = t_grid)
surv   <- predict(fit_kul, newdata = nd, type = "survival") * 100

km     <- survfit(Surv(int_dead, dead) ~ 1, data = cabgkul)
km_df  <- data.frame(time = km$time, survival = km$surv * 100)

ggplot() +
  geom_step(data = km_df, aes(time, survival, colour = "Kaplan-Meier"),
            linewidth = 0.5) +
  geom_line(data = data.frame(time = t_grid, survival = surv),
            aes(time, survival, colour = "Weibull"), linewidth = 1) +
  scale_colour_manual(values = c("Weibull" = "#0072B2",
                                 "Kaplan-Meier" = "#D55E00")) +
  scale_y_continuous(limits = c(0, 100)) +
  labs(x = "Months after CABG", y = "Freedom from death (%)",
       colour = NULL) +
  theme_minimal() +
  theme(legend.position = "bottom")
```

A single Weibull captures the broad trend but misses the distinct early
operative risk and late attrition that the KM curve reveals.  This motivates
the multiphase approach below.


## Multivariable model: AVC repair

The `avc` dataset has 310 patients who underwent atrioventricular canal
repair, with 9 candidate covariates spanning patient demographics (age,
NYHA status), anatomical features (malalignment, orifice morphology),
intra-operative grading (`inc_surg` = surgical grade of AV valve
incompetence), and post-operative complications (`com_iv` = grade IV
complications). We drop incomplete rows so the design matrix is
rectangular, then look at the column types and ranges.

```{r}
#| label: avc-data
data(avc)
avc <- na.omit(avc)
str(avc)
```

Now we put covariates on the right-hand side of the formula and refit.
The `theta` vector grows: two Weibull shape parameters (`mu`, `nu`) plus
six covariate coefficients (`beta1`..`beta6`), each starting at zero.
The optimizer estimates a log-hazard-ratio for every covariate jointly
with the Weibull shape — so the shape and the covariate effects are
identified from the same likelihood, not sequentially.

```{r}
#| label: avc-mv
fit_avc <- hazard(
  Surv(int_dead, dead) ~ age + status + mal + com_iv + inc_surg + orifice,
  data  = avc,
  dist  = "weibull",
  theta = c(mu = 0.20, nu = 1.0, rep(0, 6)),
  fit   = TRUE,
  control = list(maxit = 500)
)

fit_avc
```

Each coefficient is a log-hazard-ratio: positive means higher risk,
negative means lower, zero means no effect. The large positive
coefficients on `mal` (anatomical malalignment) and `com_iv`
(grade IV post-operative complications) flag these as the dominant
risk markers in this cohort. The standard errors and Wald
z-statistics in the summary tell you which effects are well
identified and which are noise — a coefficient with a z-statistic
near zero contributes essentially nothing the data can defend.


## Multiphase model: additive hazard decomposition

The single-Weibull fits above gave us a curve that's mediocre
everywhere instead of right anywhere. That's a structural limitation
of a monotone parametric shape, not something more iterations will
fix. The Blackstone–Naftel–Turner framework's key idea is to split
the hazard into a *sum* of phase-specific contributions, each with its
own temporal shape and its own scale:

$$H(t \mid x) = \sum_{j=1}^{J} \mu_j(x) \cdot \Phi_j(t)$$

Each $\Phi_j(t)$ is a phase-specific unit-scaled curve (early-peaking
saturating, flat constant, late-rising polynomial) and each $\mu_j(x)$
is the phase-specific scale, possibly modulated by covariates. The
phases overlap and add — no switching, no thresholds — so the total
instantaneous hazard at any $t$ is the sum of the per-phase rates.
See `vignette("getting-started")` for the longer-form motivation; what
follows here is the practical workflow for *fitting* one.

For AVC we'll use two phases — an early phase to absorb the
operative-window mortality, and a constant phase for the background
rate. AVC patients don't have a clear late-deterioration regime over
this follow-up window, so a third (g3) phase would be unidentified.
We fix the shape parameters and estimate only the scales, matching
the workflow you'd run against a SAS HAZARD reference fit.

```{r}
#| label: avc-mp
fit_mp <- hazard(
  Surv(int_dead, dead) ~ 1,
  data   = avc,
  dist   = "multiphase",
  phases = list(
    early    = hzr_phase("cdf", t_half = 0.5, nu = 1, m = 1,
                          fixed = "shapes"),
    constant = hzr_phase("constant")
  ),
  fit     = TRUE,
  control = list(n_starts = 5, maxit = 1000)
)

summary(fit_mp)
```

The diagnostic that matters is whether the multiphase fit actually
out-performs the single Weibull against the data. Plot both parametric
curves against the same Kaplan-Meier reference so we can see, by eye,
where each model is honest and where each is reaching.

```{r}
#| label: fig-avc-compare
#| fig-cap: "Single-phase Weibull vs. multiphase model against Kaplan-Meier (AVC)"
#| fig-width: 7
#| fig-height: 4.5
t_grid <- seq(0.01, max(avc$int_dead) * 0.95, length.out = 200)
nd     <- data.frame(time = t_grid)

km_avc <- survfit(Surv(int_dead, dead) ~ 1, data = avc)
km_df  <- data.frame(time = km_avc$time, survival = km_avc$surv * 100)

fit_wb <- hazard(
  Surv(int_dead, dead) ~ 1, data = avc, dist = "weibull",
  theta = c(mu = 0.20, nu = 1.0), fit = TRUE
)

surv_wb <- predict(fit_wb, newdata = nd, type = "survival") * 100
surv_mp <- predict(fit_mp, newdata = nd, type = "survival") * 100

plot_df <- rbind(
  data.frame(time = t_grid, survival = surv_wb, Model = "Single Weibull"),
  data.frame(time = t_grid, survival = surv_mp, Model = "Multiphase (2-phase)")
)

ggplot() +
  geom_step(data = km_df, aes(time, survival), colour = "grey50",
            linewidth = 0.5) +
  geom_line(data = plot_df, aes(time, survival, colour = Model),
            linewidth = 1) +
  scale_colour_manual(values = c("Single Weibull" = "#E69F00",
                                 "Multiphase (2-phase)" = "#0072B2")) +
  scale_y_continuous(limits = c(0, 100)) +
  annotate("text", x = max(t_grid) * 0.6, y = 95, label = "KM (grey)",
           size = 3, colour = "grey50") +
  labs(x = "Months after AVC repair", y = "Freedom from death (%)",
       colour = NULL) +
  theme_minimal() +
  theme(legend.position = "bottom")
```

The multiphase model tracks the KM curve much more closely than the
single Weibull, especially across the steep early-mortality window —
which is exactly where the single Weibull was forced to compromise.
The constant phase then carries the slow post-recovery attrition. The
point isn't that multiphase always wins; it's that *when the data has
phase structure*, fitting that structure explicitly is strictly more
honest than averaging it away into one monotone curve.


## Multi-endpoint models: heart valve replacement

The `valves` dataset (1,533 patients) has multiple time-to-event endpoints ---
death, prosthetic valve endocarditis (PVE), and reoperation --- each with its
own follow-up time and event indicator.  The same `hazard()` call fits each
endpoint independently:

Start with the death endpoint. We use age at operation, NYHA class,
and mechanical-valve indicator as covariates — the clinically
canonical set for survival after valve replacement.

```{r}
#| label: valves-death
data(valves)
valves <- na.omit(valves)

fit_death <- hazard(
  Surv(int_dead, dead) ~ age_cop + nyha + mechvalv,
  data  = valves,
  dist  = "weibull",
  theta = c(mu = 0.10, nu = 1.0, rep(0, 3)),
  fit   = TRUE,
  control = list(maxit = 500)
)

fit_death
```

Switch endpoints. Same data, same package, but now we model time to
prosthetic valve endocarditis instead of death. The covariate list
shifts to match the clinical question: `nve` (native-valve endocarditis
history) replaces `nyha` because functional class is less informative
for infection risk than prior endocarditis exposure. The fit returns
its own MLE, coefficients, and standard errors completely independent
of the death model.

```{r}
#| label: valves-pve
fit_pve <- hazard(
  Surv(int_pve, pve) ~ age_cop + nve + mechvalv,
  data  = valves,
  dist  = "weibull",
  theta = c(mu = 0.02, nu = 1.0, rep(0, 3)),
  fit   = TRUE,
  control = list(maxit = 500)
)

fit_pve
```

Each endpoint gets its own model with its own covariates, but the
hazard model structure — temporal shape plus covariate effects — stays
the same whatever the clinical endpoint. Repeating the workflow for a
third endpoint (reoperation, for example) is mechanical: swap the
`Surv(...)` columns, swap the covariates, refit. The advantage over
running three separate analyses in different tools is that the
predictions, diagnostics, and uncertainty quantification all come from
the same package — there's no risk of subtle differences in censoring
handling or estimator choice between endpoints.


## Phase types reference

You've now seen each phase type in use: a `"cdf"` early phase for AVC
operative mortality, a `"constant"` phase for AVC background rate, and
the implicit single shape of every Weibull fit. The package supports
three phase types in total, summarized here for quick reference:

| Type | Description | Typical use |
|------|-------------|-------------|
| `"cdf"` | Sigmoidal CDF shape (parameterized by `t_half`, `nu`, `m`) | Early or late phases with transient risk |
| `"constant"` | Flat hazard (no temporal shape parameters) | Ongoing background risk |
| `"g3"` | Late-phase G3 parameterization (4 parameters: `tau`, `gamma`, `alpha`, `eta`) | Late-rising risk matching C/SAS G3 output |

The `"cdf"` type covers the widest range of shapes: setting `t_half`
small (e.g., 0.5) creates an early-peaking phase; setting it large
(e.g., 10) creates a late-rising phase. The `"constant"` phase needs no
shape parameters. The `"g3"` shape is the explicit late-rising parameterization
that matches the SAS HAZARD "late" library; use it when you need
parity against a C/SAS reference fit, or when the late rise has a
clear lag-then-accelerate pattern that a delayed `"cdf"` doesn't
capture cleanly. See `vignette("mf-mathematical-foundations")` for the
full mathematical treatment of each, including the parameter
identifiability constraints.