---
title: "Inference & Diagnostics"
subtitle: "Exploratory screening, bootstrap confidence intervals, and sensitivity analysis"
format:
  html:
    code-fold: false
    toc: true
    toc-depth: 3
    number-sections: true
vignette: >
  %\VignetteIndexEntry{Inference & Diagnostics}
  %\VignetteEngine{quarto::html}
  %\VignetteEncoding{UTF-8}
---

```{r}
#| include: false
has_pkg <- requireNamespace("TemporalHazard", quietly = TRUE) &&
  requireNamespace("ggplot2", quietly = TRUE)
knitr::opts_chunk$set(
  collapse = TRUE,
  comment  = "#>",
  eval     = has_pkg
)
```

```{r}
#| label: setup
library(TemporalHazard)
library(survival)
library(ggplot2)
```

Fitting a hazard model is the middle of the workflow, not the end.
Before you fit, you screen the covariates so you know what
transformations the data actually supports. After you fit, you
quantify uncertainty (bootstrap CIs, Wald intervals), check whether
the model is honest to the data (goodness-of-fit overlays,
decile-of-risk calibration), and probe its behavior under different
covariate scenarios (sensitivity analysis). This vignette walks the
diagnostic and inferential tools the package ships for each of those
steps.

These pieces correspond directly to steps in the classic SAS HAZARD
workflow — the `lg.*` logistic-screening macros, the `bs.*` bootstrap
macros, the `hs.*` hazard-sensitivity macros — re-implemented as
first-class R functions (`hzr_calibrate()`, `hzr_kaplan()`,
`hzr_nelson()`, `hzr_bootstrap()`, `hzr_gof()`, `hzr_deciles()`). If
you've worked through a SAS HAZARD analysis you already know the
shape of what follows; if not, treat this vignette as the standard
diagnostic pass you'd run on any fitted hazard model before trusting
its conclusions.


## Exploratory covariate screening

Before fitting the parametric hazard model, screen each candidate
covariate with a simple logistic regression against the event
indicator. This is much cheaper than fitting the full hazard model
and answers two distinct questions: which covariates carry signal at
all (rough significance triage), and what *functional form* each
covariate enters with — linear, log, polynomial, or something
non-monotone that needs binning. The screening step is what stops you
from blindly throwing every column at `hazard()` and then trying to
debug a model that has shape mis-specification baked in.

```{r}
#| label: explore-data
data(avc)
avc <- na.omit(avc)

# Candidate covariates
candidates <- c("age", "status", "mal", "com_iv", "inc_surg", "orifice")
```

Fit a univariable logistic regression for each candidate against the
death indicator. The p-values that come out are not the final word on
whether a covariate matters — they ignore the joint structure of the
hazard model and they treat the event time as a binary outcome — but
they do tell you which covariates have no chance of mattering, and
which deserve closer inspection.

```{r}
#| label: explore-logit
logit_fits <- lapply(candidates, function(var) {
  fmla <- as.formula(paste("dead ~", var))
  glm(fmla, data = avc, family = binomial())
})
names(logit_fits) <- candidates

# Which covariates are significant univariably?
p_values <- vapply(logit_fits, function(f) {
  coef(summary(f))[2, "Pr(>|z|)"]
}, numeric(1))

data.frame(covariate = names(p_values),
           p_value   = round(p_values, 4),
           row.names = NULL)
```

```{r}
#| label: fig-explore-age
#| fig-cap: "Exploratory: age vs. mortality with LOESS smooth"
#| fig-width: 7
#| fig-height: 4
ggplot(avc, aes(age, dead)) +
  geom_jitter(height = 0.05, width = 0, alpha = 0.3, size = 1) +
  geom_smooth(method = "loess", se = TRUE, colour = "#0072B2",
              linewidth = 0.8) +
  labs(x = "Age at operation (months)", y = "Death (0/1)") +
  theme_minimal()
```

The LOESS smooth reveals the *shape* of the covariate-mortality
relationship in a non-parametric way. A roughly linear smooth means
the covariate enters the hazard model on its natural scale; a U or
inverted-U shape means a polynomial or a non-monotone transform; a
plateau-then-rise means a threshold effect that may need binning.
This is the call you make *before* writing the formula for
`hazard()`, not after staring at a converged-but-misspecified model.


### Quantile calibration with `hzr_calibrate()`

`hzr_calibrate()` bins a continuous covariate into quantile groups and
applies the logit / Gompertz / Cox link transform, so you can eyeball
whether the covariate-outcome relationship is linear on the chosen link
scale.  It corresponds to the SAS `logit.sas` / `logitgr.sas` macros.

```{r}
#| label: calibrate-age
age_cal <- hzr_calibrate(
  x      = avc$age,
  event  = avc$dead,
  groups = 10,
  link   = "logit"
)
print(age_cal)
```

A roughly linear trend across groups means the logit link is
appropriate; strong curvature flags the need for a transform (for
example, polynomial or log).

## Nonparametric baselines

A parametric fit has to be compared against *something*. The natural
something is the nonparametric estimate of the same quantity from the
same data — Kaplan-Meier for survival, Nelson-Aalen for cumulative
hazard. The package's `hzr_kaplan()` and `hzr_nelson()` are
convenience wrappers that return richer output than `survival::survfit()`:
`hzr_kaplan()` ships KM with logit-transformed confidence limits
(more accurate in the tails than the standard Greenwood limits, which
can stray outside $[0, 1]$), interval hazard rates, density estimates,
and a restricted-mean-survival-time scalar. `hzr_nelson()` returns the
Wayne Nelson cumulative hazard estimator with log-normal confidence
limits — the right reference when you care about the integrated
intensity rather than the survival probability.

```{r}
#| label: km-baseline
km <- hzr_kaplan(time = avc$int_dead, status = avc$dead, conf_level = 0.95)
print(km)
```

```{r}
#| label: nelson-baseline
nel <- hzr_nelson(time = avc$int_dead, event = avc$dead, conf_level = 0.95)
print(nel)
```


## Bootstrap confidence intervals

The delta-method CIs `predict(se.fit = TRUE)` returns are fast and
asymptotically correct, but they lean on the Hessian-derived
parameter vcov — which can mislead when the likelihood surface is
poorly behaved near the MLE (boundary fits, near-degenerate
parameter combinations, small samples). Bootstrap resampling is the
robust alternative: refit the model on each resampled dataset,
collect the resulting parameter and prediction values, and read the
empirical 2.5th / 97.5th percentiles as the 95% CI. It's
computationally heavier but it doesn't assume the likelihood is
locally quadratic, and it picks up the right tail behavior even
when delta-method theory creaks.

`hzr_bootstrap()` wraps the resample-and-refit loop and corresponds to
the SAS `bootstrap.hazard.sas` and `bootstrap.summary.sas` macros. We
fit a base model first; bootstrap replicates will resample from the
same data and refit using these starting values.

```{r}
#| label: fit-base
# Fit the base model that bootstrapping will use
fit <- hazard(
  Surv(int_dead, dead) ~ age + status + mal + com_iv,
  data  = avc,
  dist  = "weibull",
  theta = c(mu = 0.20, nu = 1.0, rep(0, 4)),
  fit   = TRUE,
  control = list(maxit = 500)
)

# Prediction grid at a median-risk profile
t_grid <- seq(0.01, max(avc$int_dead) * 0.9, length.out = 100)
base_nd <- data.frame(time = t_grid, age = median(avc$age),
                      status = 2, mal = 0, com_iv = 0)

surv_point <- predict(fit, newdata = base_nd, type = "survival")
```

`hzr_bootstrap()` returns a long data frame of per-replicate
coefficient estimates (`$replicates`) and a parameter-level summary
(mean, SD, 95% percentile CI) in `$summary`. We use `n_boot = 30`
here purely so the vignette renders in reasonable time — for a real
analysis you'd want 200 or more replicates. The `set.seed(42)` call
before `hzr_bootstrap()` makes the resample sequence reproducible.

```{r}
#| label: bootstrap
set.seed(42)
boot <- hzr_bootstrap(fit, n_boot = 30)  # kept small for vignette build time
print(boot)
```

The parameter-level CIs above come directly from the bootstrap
distribution.  Prediction-level CIs (on the survival curve or hazard
profile) require pivoting `boot$replicates` to wide format and
evaluating the parametric survival formula once per replicate.  That
workflow is out of scope for this section, but the building blocks are
all in the returned object.


## Goodness-of-fit overlay

The Kaplan-Meier overlay in the prediction-visualization vignette is
the *visual* goodness-of-fit check. `hzr_gof()` is the *quantitative*
complement: it tabulates parametric predictions side by side with the
nonparametric KM estimate at each event time, and computes the
Conservation-of-Events observed-vs-expected ratio across the whole
follow-up window. The ratio compresses everything the model is doing
into a single number, which makes it the right summary statistic to
report alongside coefficient tables. (`hzr_gof()` corresponds to the
SAS `hazplot.sas` macro.)

```{r}
#| label: gof-summary
gof <- hzr_gof(fit)
print(gof)
```

The observed vs. expected ratio (printed at the bottom) is the
Conservation-of-Events check: a well-specified model recovers the event
count exactly, so a ratio close to 1 is what we want.  Ratios well
above or below 1 flag either a misspecified shape or a covariate that
isn't really linear on the log-hazard scale.


## Decile-of-risk calibration

The conservation-of-events ratio above tells you whether the model is
calibrated *on average*. That's necessary but not sufficient — a
model can hit the right total event count while systematically
over-predicting risk for one part of the cohort and under-predicting
for another, with the errors cancelling in aggregate.
`hzr_deciles()` is the check that catches that failure mode: it
partitions patients into deciles of predicted risk at a chosen time
point and compares observed with expected event counts within each
decile, returning a chi-square goodness-of-fit test. A
well-calibrated model has observed-vs-expected close in every decile;
a model that's globally calibrated but locally biased will show large
residuals in the extreme deciles. (Corresponds to the SAS
`deciles.hazard.sas` macro.)

```{r}
#| label: deciles
deciles <- hzr_deciles(fit, time = 120, groups = 10)
print(deciles)
```

The chi-square test statistic and p-value summarise whether the
observed decile-by-decile event counts are consistent with what the
parametric model predicts.  A good visual check is to plot the two
columns side by side:

```{r}
#| label: fig-deciles
#| fig-cap: "Observed vs. expected mortality by decile of risk"
#| fig-width: 7
#| fig-height: 4.5
decile_df <- as.data.frame(deciles)
cal_long <- rbind(
  data.frame(group = decile_df$group,
             rate  = decile_df$observed_rate,
             Series = "Observed"),
  data.frame(group = decile_df$group,
             rate  = decile_df$expected_rate,
             Series = "Expected")
)
ggplot(cal_long, aes(group, rate, fill = Series)) +
  geom_col(position = position_dodge(width = 0.7), alpha = 0.8) +
  scale_fill_manual(values = c("Observed" = "#56B4E9",
                                "Expected" = "#E69F00")) +
  scale_x_continuous(breaks = seq_len(nrow(decile_df))) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  labs(x = "Risk decile (1 = lowest)", y = "Mortality rate", fill = NULL) +
  theme_minimal() +
  theme(legend.position = "bottom")
```


## Sensitivity analysis

Coefficient estimates tell you the *direction and magnitude* of a
covariate's effect on the hazard scale, but they don't directly answer
the clinical question of *what survival difference this implies* for a
real patient. Sensitivity analysis closes that gap. You define two
or more covariate profiles — typically a reference profile and one or
more high-risk profiles — score each through `predict()`, and plot
the resulting survival curves on shared axes. The horizontal and
vertical gaps between curves translate the model's coefficients into
the clinically meaningful currency of survival probability differences
and median-survival shifts. It's also a stress test: if your high-risk
profile produces only a small survival gap, either the risk factors
aren't doing much in this cohort, or the model is missing the
mechanism that makes them matter.

```{r}
#| label: sensitivity
# Define reference and high-risk profiles
ref_profile <- data.frame(
  age    = median(avc$age),
  status = 2,
  mal    = 0,
  com_iv = 0
)

high_risk <- data.frame(
  age    = quantile(avc$age, 0.90),
  status = 4,
  mal    = 1,
  com_iv = 1
)
```

```{r}
#| label: fig-sensitivity
#| fig-cap: "Sensitivity analysis: reference vs. high-risk profile"
#| fig-width: 7
#| fig-height: 4.5
sens_curves <- do.call(rbind, lapply(
  list("Reference" = ref_profile, "High risk" = high_risk),
  function(p) {
    nd <- p[rep(1, length(t_grid)), ]
    nd$time <- t_grid
    data.frame(time = t_grid,
               survival = predict(fit, newdata = nd, type = "survival") * 100,
               Profile = deparse(substitute(p)))
  }
))
# Fix profile labels
sens_curves$Profile <- rep(c("Reference", "High risk"),
                           each = length(t_grid))
sens_curves$Profile <- factor(sens_curves$Profile,
                              levels = c("Reference", "High risk"))

ggplot(sens_curves, aes(time, survival, colour = Profile)) +
  geom_line(linewidth = 0.9) +
  scale_colour_manual(values = c("Reference" = "#009E73",
                                 "High risk" = "#D55E00")) +
  scale_y_continuous(limits = c(0, 100)) +
  labs(x = "Months after AVC repair", y = "Freedom from death (%)",
       colour = NULL) +
  theme_minimal() +
  theme(legend.position = "bottom")
```

The gap between the curves quantifies the clinical impact of the risk factors.
The sensitivity analysis is most informative when combined with bootstrap
confidence intervals on each curve, showing whether the difference is
statistically meaningful.


## Analysis workflow summary

The pieces in this vignette aren't independent diagnostics — they
chain together into a single analytical workflow that goes from raw
covariates to a defensible, uncertainty-quantified, well-calibrated
fitted model. The complete sequence, with the SAS HAZARD
correspondences each piece descends from:

1. **Exploratory screening** (`glm`, `hzr_calibrate()`) --- identify
   covariate transformations and functional forms
2. **Nonparametric baselines** (`hzr_kaplan()`, `hzr_nelson()`) ---
   reference KM / Nelson cumulative hazard estimators
3. **Fit hazard model** (`hazard()`) --- parametric shape + covariates
4. **Predict & visualize** (`predict()`) --- survival, hazard, risk profiles
5. **Goodness-of-fit overlay** (`hzr_gof()`) --- parametric vs. KM +
   Conservation-of-Events check
6. **Bootstrap CIs** (`hzr_bootstrap()`) --- uncertainty quantification
   via resampling
7. **Sensitivity analysis** (`predict()` on covariate profiles) ---
   compare scenarios across risk factors
8. **Decile calibration** (`hzr_deciles()`) --- chi-square observed
   vs. expected by risk decile

See `vignette("getting-started")` for the minimal workflow and
`vignette("fitting-hazard-models")` for the model-fitting details.