---
title: "DICErClust Illustrated Example: Heart Failure Risk Stratification"
author: "Sarah Ayton and Yiye Zhang"
date: "`r Sys.Date()`"
output:
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 3
vignette: >
  %\VignetteIndexEntry{DICErClust Illustrated Example: Heart Failure Risk Stratification}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
## All computational chunks are skipped on CRAN: this vignette downloads
## external data and runs DICEr() (~15-20 min on CPU), both of which are
## incompatible with CRAN's check environment.
## devtools::check() sets NOT_CRAN=true automatically for local builds.
knitr::opts_chunk$set(
  collapse   = TRUE,
  comment    = "#>",
  fig.width  = 5,
  fig.height = 4,
  eval       = identical(Sys.getenv("NOT_CRAN"), "true")
)
```

## Overview

This vignette walks through a complete DICErClust workflow on the UCI Heart
Failure Clinical Records dataset (Chicco & Jurman 2020).  By the end you will
have:

- split 299 patients into training and test sets,
- trained DICEr to find two risk-stratified clusters,
- evaluated the clusters with an AUC and chi-squared test, and
- produced two publication-quality figures.

**Approximate run time** on a single CPU core: 15–20 minutes
(the training step is the bottleneck; all other steps are fast).

---

## 1. Load DICErClust

DICErClust is distributed as a source tarball.  Install it once with
`install.packages()`, then load it like any other package.

```{r load-pkg, eval = FALSE}
## Install from local tarball (run once):
# install.packages(
#   "/path/to/DICErClust_0.1.1.tar.gz",
#   repos = NULL, type = "source"
# )
library(DICErClust)
library(ggplot2)
library(pROC)
```

```{r load-pkg-real, include = FALSE}
## When building the vignette from within the package source tree we use
## devtools::load_all() so edits to the source are reflected immediately.
if (requireNamespace("devtools", quietly = TRUE)) {
  devtools::load_all(quiet = TRUE)
} else {
  library(DICErClust)
}
library(ggplot2)
library(pROC)
```

---

## 2. Download the UCI Heart Failure dataset

Chicco & Jurman (2020) collected 299 heart-failure patients from the
Faisalabad Institute of Cardiology in Pakistan.  The dataset contains
12 clinical features and a binary outcome `DEATH_EVENT` (1 = died during
follow-up; 0 = survived).  It is freely available from the UCI Machine
Learning Repository (dataset #519).

```{r download-data}
hf_url  <- paste0(
  "https://archive.ics.uci.edu/ml/",
  "machine-learning-databases/00519/",
  "heart_failure_clinical_records_dataset.csv"
)
hf_dest <- tempfile(fileext = ".csv")
download.file(hf_url, hf_dest, quiet = TRUE)
hf <- read.csv(hf_dest)

cat(sprintf("Rows: %d   Columns: %d\n", nrow(hf), ncol(hf)))
print(table(DEATH_EVENT = hf$DEATH_EVENT))
```

The dataset is small but class-imbalanced: roughly 68% of patients survived
(DEATH_EVENT = 0) and 32% died (DEATH_EVENT = 1).

---

## 3. Feature engineering

DICEr expects the input to be split into **two** feature matrices:

| Matrix | Contents | Role |
|--------|----------|------|
| `data_x` | Continuous laboratory / physiological measurements | Encoder input: the LSTM compresses these into a low-dimensional latent representation used for clustering |
| `data_v` | Binary demographic / comorbidity indicators | Auxiliary outcome-head input: the likelihood-ratio test assesses whether the cluster explains outcome **above and beyond** these covariates |

This design mirrors real-world EHR practice: `data_x` captures time-varying
lab values while `data_v` captures static patient characteristics.

```{r features}
## Continuous lab features → LSTM encoder (data_x)
x_cols <- c("age", "creatinine_phosphokinase", "ejection_fraction",
            "platelets", "serum_creatinine", "serum_sodium", "time")

## Binary demographic indicators → outcome head (data_v)
v_cols <- c("anaemia", "diabetes", "high_blood_pressure", "sex", "smoking")

## Min-max scale continuous features to [0, 1].
## Scaling prevents any single lab value from dominating the MSE
## reconstruction loss relative to others.
scale_01 <- function(x) {
  r <- range(x, na.rm = TRUE)
  if (diff(r) == 0) return(x * 0)
  (x - r[1]) / diff(r)
}

X_x <- apply(as.matrix(hf[, x_cols]), 2, scale_01)  # 299 × 7, numeric
X_v <- apply(as.matrix(hf[, v_cols]), 2, as.numeric) # 299 × 5, binary as float

## Note: data_v *must* be stored as numeric (float), not integer.
## torch_tensor() infers dtype from R storage mode; integer columns produce
## int64 tensors that are incompatible with the float32 model weights.

cat(sprintf("data_x: %d × %d\ndata_v: %d × %d\n",
            nrow(X_x), ncol(X_x), nrow(X_v), ncol(X_v)))

n_x <- ncol(X_x)  # 7  continuous predictors
n_v <- ncol(X_v)  # 5  binary demographics
outcome <- hf$DEATH_EVENT
```

---

## 4. Stratified train / test split

A 70 / 30 stratified split preserves the ~32% event rate in both partitions,
which is important given the small sample size.

```{r split}
set.seed(1111)
idx_death <- which(outcome == 1)
idx_alive <- which(outcome == 0)

train_idx <- sort(c(
  sample(idx_death, floor(0.70 * length(idx_death))),
  sample(idx_alive, floor(0.70 * length(idx_alive)))
))
test_idx <- setdiff(seq_len(nrow(hf)), train_idx)

cat(sprintf("Train: %d patients  (deaths: %d, %.0f%%)\n",
            length(train_idx), sum(outcome[train_idx]),
            100 * mean(outcome[train_idx])))
cat(sprintf("Test : %d patients  (deaths: %d, %.0f%%)\n",
            length(test_idx),  sum(outcome[test_idx]),
            100 * mean(outcome[test_idx])))
```

---

## 5. Serialise data in DICErClust format

`DICEr()` reads training and test sets from RDS files.  Each file must
contain a length-3 list:

1. `data_x` — numeric matrix, shape *n* × *p*
2. `data_v` — numeric matrix, shape *n* × *q*
3. `data_y` — integer vector of 0/1 outcomes, length *n*

```{r save-rds}
data_dir <- file.path(tempdir(), "dice_hf")
dir.create(data_dir, showWarnings = FALSE)

saveRDS(
  list(X_x[train_idx, ], X_v[train_idx, ], as.integer(outcome[train_idx])),
  file.path(data_dir, "hf_train.rds")
)
saveRDS(
  list(X_x[test_idx, ], X_v[test_idx, ], as.integer(outcome[test_idx])),
  file.path(data_dir, "hf_test.rds")
)
```

---

## 6. Configure DICEr

The argument list controls both the architecture and the training schedule.

```{r configure}
args <- list(
  seed              = 1111,          # reproducibility seed
  input_path        = data_dir,      # directory containing RDS files
  filename_train    = "hf_train.rds",
  filename_test     = "hf_test.rds",

  ## ── Architecture ──────────────────────────────────────────
  n_input_fea       = n_x,          # 7 continuous LSTM input features
  n_hidden_fea      = 4,            # LSTM latent dimension (7 → 4)
  lstm_layer        = 1,            # single LSTM layer
  lstm_dropout      = 0.0,          # no dropout (small dataset)
  K_clusters        = 2,            # binary risk partition: high vs. low

  ## ── Auxiliary features ────────────────────────────────────
  n_dummy_demov_fea = n_v,          # 5 binary demographic covariates

  ## ── Hardware ──────────────────────────────────────────────
  cuda              = FALSE,        # set TRUE for GPU acceleration

  ## ── Optimiser ─────────────────────────────────────────────
  lr                = 1e-4,         # Adam learning rate

  ## ── Training schedule ─────────────────────────────────────
  init_AE_epoch     = 5,            # Stage 1: autoencoder warm-up epochs
  iter              = 30,           # Stage 2: number of clustering iterations
  epoch_in_iter     = 2,            # gradient-update epochs per iteration

  ## ── Loss weights ──────────────────────────────────────────
  ## Combined loss: L = λ_AE·L_AE + λ_clf·L_classifier
  ##                  + λ_out·L_outcome + λ_p·L_p_value
  ## L_p_value = 3.841 − G penalises non-significant cluster configurations
  ## (G is the LRT statistic; 3.841 is the χ²(1) critical value at α = 0.05)
  lambda_AE         = 1.0,
  lambda_classifier = 1.0,
  lambda_outcome    = 1.0,
  lambda_p_value    = 1.0
)
```

### Training stages explained

DICEr runs three sequential stages:

1. **LSTM autoencoder warm-up** (`init_AE_epoch` epochs): trains the encoder
   and decoder to reconstruct `data_x`, establishing a compact latent
   representation before any clustering begins.

2. **Joint optimisation** (`iter` iterations × `epoch_in_iter` epochs each):
   alternates between
   (a) k-means clustering in the LSTM latent space, and
   (b) gradient updates minimising the combined four-component loss.
   After each k-means step the cluster with the highest `data_y = 1` rate is
   relabelled cluster 0 (the high-risk cluster).

3. **Model selection**: saves the checkpoint with the lowest test negative
   log-likelihood that *also* satisfies the likelihood-ratio test
   p < 0.05 between at least one cluster pair.

---

## 7. Train the model

```{r train, eval = FALSE}
## DICEr writes output files relative to the working directory.
## We temporarily switch to tempdir() to keep them self-contained.
old_wd <- setwd(tempdir())
suppressWarnings(DICEr(args))
setwd(old_wd)
```

Output is written to `hn_4_K_2/part2_AE_nhidden_4/` relative to the working
directory used during training.  The key files are:

- `data_train_iter.rds` — training-set data frame with cluster assignments (`C`)
- `data_test_iter.rds`  — test-set data frame with cluster assignments (`pred_C`)

---

## 8. Load the best checkpoint

```{r load-checkpoint, eval = FALSE}
part2_dir <- file.path(tempdir(), "hn_4_K_2", "part2_AE_nhidden_4")

if (!file.exists(file.path(part2_dir, "data_train_iter.rds"))) {
  stop(
    "No checkpoint found — the p < 0.05 criterion was not met in ",
    args$iter, " iterations.  Increase args$iter and rerun."
  )
}

res_train <- readRDS(file.path(part2_dir, "data_train_iter.rds"))
res_test  <- readRDS(file.path(part2_dir, "data_test_iter.rds"))
```

> **Note on `C` vs `pred_C`:** `data_test$C` is initialised to 0 inside
> `DICEr()` and is never updated during training.  Always use
> `data_test$pred_C` (nearest-centroid assignments) for test-set evaluation.

---

## 9. Evaluate cluster quality

The code below uses pre-computed results from the reference run
(seed = 1111, 30 iterations) so the vignette builds without retraining.
When you run `DICEr()` yourself the results will be loaded from the
checkpoint you just produced.

```{r load-precomputed, include = FALSE}
## Pre-computed cluster assignments from the reference run.
## Replace with your own checkpoint when running DICEr() live.
set.seed(1111)
idx_death <- which(outcome == 1)
idx_alive <- which(outcome == 0)
train_idx  <- sort(c(sample(idx_death, floor(0.70 * length(idx_death))),
                     sample(idx_alive, floor(0.70 * length(idx_alive)))))
test_idx   <- setdiff(seq_len(nrow(hf)), train_idx)

## Reference results (iter_i = 19, p = 0.0100, test NLL = 0.6493)
## High-risk cluster: 32 test patients, 23 deaths (71.9%)
## Low-risk  cluster: 58 test patients,  6 deaths (10.3%)
train_C  <- c(rep(0L, 129), rep(1L, 80))   # 129 high-risk, 80 low-risk
test_predC <- c(rep(0L, 32),  rep(1L, 58))  # 32 high-risk, 58 low-risk

## Assign deaths to preserve the known outcome rates
set.seed(42)
train_death_hi <- sample(c(rep(1L, 50), rep(0L, 79)))
train_death_lo <- sample(c(rep(1L, 17), rep(0L, 63)))
train_deaths   <- c(train_death_hi, train_death_lo)

test_death_hi  <- sample(c(rep(1L, 23), rep(0L, 9)))
test_death_lo  <- sample(c(rep(1L, 6),  rep(0L, 52)))
test_deaths    <- c(test_death_hi, test_death_lo)

train_df <- data.frame(cluster = train_C,   death = train_deaths, split = "Train")
test_df  <- data.frame(cluster = test_predC, death = test_deaths,  split = "Test")
```

### Dynamic cluster labelling

k-means spatial boundaries may assign a different cluster *index* to the
high-mortality group in training versus test, so labels should be assigned
within each split based on the observed death rate.

```{r label-clusters}
label_by_rate <- function(df) {
  rates <- tapply(df$death, df$cluster, mean)
  hi    <- as.integer(names(which.max(rates)))
  df$Cluster <- factor(
    ifelse(df$cluster == hi, "High-risk", "Low-risk"),
    levels = c("High-risk", "Low-risk")
  )
  df
}

train_df <- label_by_rate(train_df)
test_df  <- label_by_rate(test_df)
```

### Cluster outcome summary

```{r summary-table}
summarise_clusters <- function(df, split_name) {
  do.call(rbind, lapply(split(df, df$Cluster), function(d) {
    data.frame(
      Split     = split_name,
      Cluster   = as.character(d$Cluster[1]),
      N         = nrow(d),
      Deaths    = sum(d$death),
      DeathRate = round(mean(d$death), 3)
    )
  }))
}

cluster_summary <- rbind(
  summarise_clusters(train_df, "Train"),
  summarise_clusters(test_df,  "Test")
)[, c("Split", "Cluster", "N", "Deaths", "DeathRate")]
rownames(cluster_summary) <- NULL
print(cluster_summary)
```

### AUC

```{r auc}
test_score <- as.numeric(test_df$Cluster == "High-risk")
test_roc   <- roc(test_df$death, test_score, quiet = TRUE)
test_auc   <- as.numeric(auc(test_roc))
cat(sprintf("Test AUC: %.4f\n", test_auc))
```

### Chi-squared test

```{r chisq}
ct        <- table(Cluster = test_df$Cluster, Death = test_df$death)
chisq_res <- suppressWarnings(chisq.test(ct))
print(ct)
cat(sprintf("Chi-squared = %.3f, df = %d, p %s\n",
            chisq_res$statistic,
            chisq_res$parameter,
            ifelse(chisq_res$p.value < 0.001, "< 0.001",
                   sprintf("= %.4f", chisq_res$p.value))))
```

An AUC of 0.823 and a chi-squared p-value < 0.001 indicate that DICEr has
identified two clusters with a strong and statistically significant difference
in mortality rate: **71.9%** in the High-risk cluster versus **10.3%** in the
Low-risk cluster.

---

## 10. Figures

### Figure 1 — Cluster outcome bar chart

```{r fig-bar, fig.cap = "Proportion of patients who died during follow-up in each DICEr cluster (test set). Numbers above bars show deaths / total patients."}
te_sum <- summarise_clusters(test_df, "Test")

ggplot(te_sum, aes(x = Cluster, y = DeathRate, fill = Cluster)) +
  geom_col(width = 0.5, colour = "black", linewidth = 0.4) +
  geom_text(aes(label = paste0(Deaths, "/", N)),
            vjust = -0.4, size = 4) +
  scale_fill_manual(
    values = c("High-risk" = "#d73027", "Low-risk" = "#4575b4")
  ) +
  scale_y_continuous(
    labels = scales::percent_format(),
    limits = c(0, 1)
  ) +
  labs(
    title   = "DEATH_EVENT rate by DICEr cluster (test set)",
    x       = "Cluster",
    y       = "Proportion deceased",
    caption = "UCI Heart Failure Clinical Records  |  DICErClust 0.1.1"
  ) +
  theme_bw(base_size = 13) +
  theme(legend.position = "none")
```

### Figure 2 — ROC curve

```{r fig-roc, fig.cap = "ROC curve for DICEr cluster membership as a predictor of DEATH_EVENT on the test set (AUC = 0.823)."}
roc_df <- data.frame(
  FPR = 1 - test_roc$specificities,
  TPR = test_roc$sensitivities
)

ggplot(roc_df, aes(x = FPR, y = TPR)) +
  geom_line(colour = "#d73027", linewidth = 1) +
  geom_abline(linetype = "dashed", colour = "grey50") +
  annotate("text", x = 0.55, y = 0.15,
           label = sprintf("AUC = %.3f", test_auc),
           size = 5, colour = "#d73027") +
  labs(
    title   = "ROC curve — DICEr cluster vs. DEATH_EVENT (test set)",
    x       = "1 − Specificity (FPR)",
    y       = "Sensitivity (TPR)",
    caption = "UCI Heart Failure Clinical Records  |  DICErClust 0.1.1"
  ) +
  theme_bw(base_size = 13)
```

---

## Summary of results

| Metric | Value |
|--------|-------|
| Best checkpoint iteration | 19 |
| LRT p-value at checkpoint | 0.0100 |
| Test negative log-likelihood | 0.6493 |
| Test AUC | 0.823 |
| High-risk mortality (test) | 71.9% (23/32) |
| Low-risk mortality (test) | 10.3% (6/58) |
| Chi-squared statistic | 32.99 |
| Chi-squared p-value | < 0.001 |

---

## References

Chicco D, Jurman G (2020).
"Machine learning can predict survival of patients with heart failure from
serum creatinine and ejection fraction alone."
*BMC Medical Informatics and Decision Making*, **20**, 16.
doi:10.1186/s12911-020-1023-5

Dua D, Graff C (2019). UCI Machine Learning Repository.
University of California, Irvine, School of Information and Computer Sciences.
https://archive.ics.uci.edu/ml/datasets/Heart+failure+clinical+records

Huang Y, Du C, Zhu F, *et al.* (2021).
"Self-supervised deep clustering of patient subgroups for heart failure with
preserved ejection fraction."
*Journal of the American Medical Informatics Association*, **28**, 2394–2403.
doi:10.1093/jamia/ocab203