--- title: "Building treeSS inputs from raw data: Florida mortality 2016" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Building treeSS inputs from raw data: Florida mortality 2016} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE ) ``` ## What this vignette adds The companion vignette (`vignette("introduction", package = "treeSS")`) walks through the Rio de Janeiro dataset, which is **already pre-assembled** in exactly the long format that `treespatial_scan()` expects: one row per (region, leaf), with parallel columns for cases, population, region centroids, and tree node identifiers. In practice, most users will start one step earlier: with a raw case-level table, a separately maintained polygon layer (counties, boroughs, census tracts...), and a classification that needs to be turned into a tree. This vignette goes through exactly that workflow on the **`fl_deaths`** dataset shipped with the package, mirroring the annotated example file `inst/examples/example_florida.R`. What you will see: 1. How to **build the ICD-10 tree** directly from the codes that appear in the raw data. 2. How to **assemble parallel vectors** of `(cases, population, region_id, x, y, node_id)` by joining the case table to a polygon layer fetched on demand. 3. How to run `treespatial_scan()`, `filter_clusters()`, and `sequential_scan()` on the assembled inputs and visualise the results. The dataset is the CDC WONDER Compressed Mortality File 1999–2016 restricted to Florida deaths in 2016. It covers 65 of Florida's 67 counties, 253 ICD-10 codes, and ≈157,000 deaths. --- ## 1. Load the raw data `fl_deaths` ships in **raw long form** — one row per (county_fips, icd10_code) combination — with population and an ICD-10 description column attached: ```{r fl-load} library(treeSS) data(fl_deaths) str(fl_deaths) #> 'data.frame': N obs. of 5 variables: #> $ county_fips: chr ... #> $ icd10_code : chr "C56" "C61" "I10" ... #> $ icd10_desc : chr ... #> $ deaths : int ... #> $ population : int ... c(rows = nrow(fl_deaths), counties = length(unique(fl_deaths$county_fips)), icd10_codes = length(unique(fl_deaths$icd10_code)), total_deaths = sum(fl_deaths$deaths)) ``` Note what `fl_deaths` does **not** carry: * No tree object — we have ICD-10 codes but no parent–child structure. * No spatial coordinates — we have county FIPS codes but no centroids. We need to supply both before calling the scan. --- ## 2. Build the ICD-10 tree from the codes in the data The codes in `fl_deaths` use two granularities: * **3-character codes** (e.g. `"C56"`, `"R54"`) treated as leaves at the 3-character level — the death certificate was coded to that precision and no further. * **5-character codes** with a dot (e.g. `"A04.7"`) — leaves under a 3-character internal parent. We build a three-level tree (Root → ICD-10 chapter → 3-character node → 5-character leaf) directly from the codes that *actually appear* in `fl_deaths`, so the tree contains no unused nodes: ```{r fl-tree} icd_codes <- sort(unique(fl_deaths$icd10_code)) parsed <- data.frame( code = icd_codes, three = ifelse(grepl("\\.", icd_codes), sub("\\..*", "", icd_codes), icd_codes), is_three_leaf = !grepl("\\.", icd_codes) & nchar(icd_codes) == 3, stringsAsFactors = FALSE ) # Map 3-character codes to ICD-10 chapters icd_chapter_of <- function(three) { L <- substr(three, 1, 1) num <- suppressWarnings(as.integer(substr(three, 2, 3))) switch(L, A = "I_AB", B = "I_AB", C = "II_C_D", D = if (!is.na(num) && num <= 48) "II_C_D" else "III_D", E = "IV_E", F = "V_F", G = "VI_G", H = if (!is.na(num) && num <= 59) "VII_H" else "VIII_H", I = "IX_I", J = "X_J", K = "XI_K", L = "XII_L", M = "XIII_M", N = "XIV_N", O = "XV_O", P = "XVI_P", Q = "XVII_Q", R = "XVIII_R", S = "XIX_S_T", T = "XIX_S_T", V = "XX_V_Y", W = "XX_V_Y", X = "XX_V_Y", Y = "XX_V_Y", "Other" ) } parsed$chapter <- sapply(parsed$three, icd_chapter_of) fl_tree <- unique(rbind( data.frame(node_id = "Root", parent_id = NA_character_, stringsAsFactors = FALSE), data.frame(node_id = unique(parsed$chapter), parent_id = "Root", stringsAsFactors = FALSE), unique(data.frame(node_id = parsed$three, parent_id = parsed$chapter, stringsAsFactors = FALSE)), data.frame( node_id = parsed$code[!parsed$is_three_leaf], parent_id = parsed$three[!parsed$is_three_leaf], stringsAsFactors = FALSE ) )) # Sanity check: every code in fl_deaths is a leaf of fl_tree parents <- unique(fl_tree$parent_id[!is.na(fl_tree$parent_id)]) leaves <- setdiff(fl_tree$node_id, parents) stopifnot(length(setdiff(icd_codes, leaves)) == 0) c(total_nodes = nrow(fl_tree), chapters = length(unique(parsed$chapter)), leaves = length(leaves)) ``` The result is a two-column data.frame `fl_tree` with columns `node_id` and `parent_id`. The root has `parent_id = NA`. The tree is acyclic by construction. Two notes on style: * `treeSS` accepts the tree either as this two-column data.frame via the `tree =` argument, or as parallel vectors via the `tree_node_id =` and `tree_parent_id =` arguments. Both produce identical scans. * `node_id` is a string here, but `treeSS` is type-agnostic: any R type that supports `==` and can be coerced to character will work, including integers and factors. --- ## 3. Download polygons and centroids `fl_deaths` carries county FIPS codes but not coordinates. We fetch the 2016 county boundaries from \CRANpkg{tigris}, project to WGS84, and compute centroids: ```{r fl-polygons} library(sf) library(tigris) fl_map <- counties(state = "FL", cb = TRUE, year = 2016) fl_map <- st_transform(fl_map, 4326) centroids <- st_centroid(st_geometry(fl_map)) coords <- st_coordinates(centroids) fl_map$x <- coords[, "X"] fl_map$y <- coords[, "Y"] ``` Two technical points worth stating explicitly: * **Centroids of `sf` objects.** `st_centroid()` is exact under `sf` and respects the polygon geometry; if you use `st_centroid()` after `st_transform()`, the centroid is computed in WGS84 coordinates, which is what `treespatial_scan()` expects for the small spatial extents of US counties / Brazilian municipalities / London boroughs. * **`tigris` requires network access.** A first run downloads cached shapefiles to a user-configurable cache directory; the second run is instant. --- ## 4. Assemble parallel vectors `treespatial_scan()` consumes six parallel vectors of identical length — one row per (region, leaf) observation. We join the case table to the population (which is constant per county in `fl_deaths`) and to the coordinates from `fl_map`: ```{r fl-vectors} # Per-county population: same value across all leaves, so take any one fl_pop <- aggregate(population ~ county_fips, data = fl_deaths, FUN = max) # Per-county centroids from the polygon layer xy <- data.frame(county_fips = fl_map$GEOID, x = fl_map$x, y = fl_map$y, stringsAsFactors = FALSE) dat <- merge(fl_deaths[, c("county_fips", "icd10_code", "deaths")], fl_pop, by = "county_fips") dat <- merge(dat, xy, by = "county_fips") # Integer region_id is convenient downstream; here we use the rank # of county_fips in alphabetical order dat$region_id <- as.integer(factor( dat$county_fips, levels = sort(unique(dat$county_fips)) )) # Keep only rows with at least one death (sparse-by-default scan) dat <- dat[dat$deaths > 0, ] c(long_rows = nrow(dat), counties = length(unique(dat$region_id))) ``` At this point `dat` is exactly the same long-format shape as `rj_mortality` in the companion vignette: one row per (county, ICD-10) combination, with parallel columns of cases, population, coordinates, and node identifier. --- ## 5. Run the tree-spatial scan ```{r fl-scan} result_fl <- treespatial_scan( cases = dat$deaths, population = dat$population, region_id = dat$region_id, x = dat$x, y = dat$y, node_id = dat$icd10_code, tree = fl_tree, max_pop_pct = 0.05, nsim = 999, seed = 2016, n_cores = 4L ) print(result_fl) ``` The most likely cluster typically falls on a specific ICD-10 leaf code (5-character) concentrated in one or two adjacent counties. The Florida dataset gives a different pattern from RJ because the overall mortality counts in Florida are much higher and the geography is much larger; the joint scan picks out narrow, cause-specific excess rather than broad geographic concentrations. `max_pop_pct = 0.05` is more restrictive than the default of `0.50`. Larger study areas like Florida benefit from a tighter cap because the default would otherwise allow zones covering half the state's population, which is usually not the size of cluster the analyst is looking for. --- ## 6. Distinct secondary clusters (paper-faithful) ```{r fl-filter} filter_clusters(result_fl)[1:5, c("node_id", "n_regions", "cases", "expected", "llr", "pvalue")] ``` As in the RJ analysis, `filter_clusters()` performs no additional Monte Carlo; the secondary clusters are read from `result_fl$secondary_clusters` and filtered for distinctness in $O(k^2)$ time. --- ## 7. Sequential scan when the MLC may shadow ```{r fl-seq} seq_fl <- sequential_scan( cases = dat$deaths, population = dat$population, region_id = dat$region_id, x = dat$x, y = dat$y, node_id = dat$icd10_code, tree = fl_tree, max_iter = 5L, alpha = 0.05, nsim = 999, seed = 2016, max_pop_pct = 0.05, n_cores = 4L ) print(seq_fl) ``` Each iteration of `sequential_scan()` runs a fresh Monte Carlo on the dataset with the previous iteration's cluster regions removed, so the $p$-value of each detected cluster is computed against the correct conditional null distribution. The procedure stops at the first non-significant iteration or `max_iter`, whichever comes first. --- ## 8. Visualise the clusters `get_cluster_regions()` returns a tidy data.frame that can be joined to `fl_map` for plotting: ```{r fl-plot} library(ggplot2) library(dplyr) region_info <- unique(dat[, c("region_id", "county_fips")]) cr_seq <- merge( get_cluster_regions(seq_fl, overlap = TRUE), region_info, by = "region_id" ) # Cross-join the polygon layer with every iteration panel so all # counties appear in every panel (those outside the analysis # dataset show as NA-coloured rather than as an extra empty panel) panels <- unique(cr_seq$panel) cr_seq_keys <- unique(cr_seq[, c("county_fips", "cluster", "node_id", "llr", "pvalue", "panel")]) fl_seq_plot <- merge( do.call(rbind, lapply(panels, function(p) cbind(fl_map, panel = p))), cr_seq_keys, by.x = c("GEOID", "panel"), by.y = c("county_fips", "panel"), all.x = TRUE ) ggplot(fl_seq_plot) + geom_sf(aes(fill = factor(cluster)), color = "gray40", linewidth = 0.12, alpha = 0.75) + scale_fill_manual( values = c("#C44E52", "#4C72B0", "#55A868", "#8172B2", "#CCB974"), na.value = "gray95", na.translate = FALSE ) + facet_wrap(~ panel, nrow = 1) + theme_minimal() + theme(legend.position = "none") ``` The cross-join trick on `panels` is what lets every county appear in every iteration panel; without it, a simple `merge(fl_map, cr_seq, all.x = TRUE)` would leave one extra "NA" facet for the two Florida counties not present in the mortality dataset. The complete script with full plot annotations, file output and summary tables is at `inst/examples/example_florida.R`. --- ## Recap The Florida workflow demonstrates the three preprocessing steps that the RJ vignette skips because `rj_mortality` arrives pre-assembled: 1. Build the tree from the codes in the data (no hand-coded hierarchy). 2. Pull polygons and centroids on demand from a spatial source (\CRANpkg{tigris} for US, \CRANpkg{geobr} for Brazil, etc.). 3. Join everything into parallel long-form vectors. After step 3, the rest of the analysis is identical to the RJ vignette: a single `treespatial_scan()` call, optional `filter_clusters()` or `sequential_scan()` post-processing, and visualisation via `get_cluster_regions()`. --- ## References Cançado, A. L. F., Oliveira, G. S., Quadros, A. V. C., & Duczmal, L. H. (2025). A tree-spatial scan statistic. *Environmental and Ecological Statistics*, 32, 953–978. [doi:10.1007/s10651-025-00670-w](https://doi.org/10.1007/s10651-025-00670-w) Kulldorff, M. (1997). A spatial scan statistic. *Communications in Statistics — Theory and Methods*, 26(6), 1481–1496. Zhang, Z., Assunção, R., & Kulldorff, M. (2010). Spatial scan statistics adjusted for multiple clusters. *Journal of Probability and Statistics*, 2010, 642379. Centers for Disease Control and Prevention. *CDC WONDER Compressed Mortality File 1999–2016*. National Center for Health Statistics.