The treeSS package implements the tree-spatial scan statistic of Cançado, Oliveira, Quadros and Duczmal (2025), which detects clusters that are simultaneously anomalous in geographic space and on a hierarchical classification tree. It generalises Kulldorff’s (1997) circular spatial scan by searching jointly over circular spatial zones \(z\) and tree branches \(g\), returning the (zone, branch) pair that maximises the likelihood ratio.
The package targets epidemiological and surveillance applications where the events to be clustered (deaths, hospitalisations, crimes, collisions) carry both a location and a categorical hierarchy (ICD-10 chapter \(\to\) subchapter \(\to\) leaf, NAICS sector, road type, etc.). Detecting clusters jointly in both dimensions can identify specific cause-area combinations that pure spatial or pure tree scans would miss.
The package provides:
| Function | Purpose |
|---|---|
treespatial_scan() |
Tree-spatial scan (main entry point) |
circular_scan() |
Kulldorff’s circular spatial scan (tree-free) |
tree_scan() |
Tree-based scan (space-free) |
iterative_scan() |
Sequential conditional scan with Holm–Bonferroni correction |
filter_clusters() |
Distinct secondary clusters per Cançado et al. (2025), Sec. 5.1.1 |
get_cluster_regions() |
Tidy region-by-cluster table for plotting |
aggregate_tree() |
Roll counts up the tree |
All scans accept the Poisson and Binomial models of the original
paper and run the Monte Carlo step under OpenMP via the
n_cores argument.
The package ships four example datasets:
| Dataset | Country | Domain | Regions | Tree |
|---|---|---|---|---|
rj_mortality + rj_tree |
Brazil | Public health | 89 municipalities | ICD-10 (622 nodes) |
chicago_crimes +
chicago_tree |
USA | Crime | 77 community areas | Type × Description × Location (2841 nodes) |
fl_deaths |
USA | Public health | 65 counties | (built by user from raw data) |
london_collisions +
london_tree |
UK | Road safety | 33 boroughs | Light × Road × Junction (81 nodes) |
This vignette walks through rj_mortality (reproducing
the published analysis) and chicago_crimes (a
different-domain illustration), then shows how to use
iterative_scan() for sequential cluster detection. The
Florida and London examples are shipped in inst/examples/
and discussed in the companion paper.
This reproduces Section 5.2 of Cançado et al. (2025): all live births and infant deaths in the 89 municipalities of Rio de Janeiro state in 2016, classified by the ICD-10 hierarchy. The pre-built dataset combines deaths, live births, ICD-10 leaf codes and centroid coordinates in long format.
library(treeSS)
data(rj_mortality)
data(rj_tree)
str(rj_mortality, max = 1)
#> 'data.frame': 1440 obs. of 9 variables: region_id, ibge_code,
#> name, live_births, x, y, node_id, cases, ...result_rj <- treespatial_scan(
cases = rj_mortality$cases,
population = rj_mortality$live_births,
region_id = rj_mortality$region_id,
x = rj_mortality$x,
y = rj_mortality$y,
node_id = rj_mortality$node_id,
tree = rj_tree,
max_pop_pct = 0.50,
nsim = 999, seed = 2016,
n_cores = 4L
)
print(result_rj)Running this produces the most likely cluster P209 with
18 municipalities, log-likelihood ratio \(LR
\approx 38.83\), \(p <
0.001\), 27 observed deaths against 3.34 expected — closely
matching the published Table 7 (\(LR =
38.48\), 18 municipalities, 27 deaths, 3.37 expected). The minor
LR difference comes from a SIM database update between the paper’s
analysis and the dataset shipped here; the conclusions are
identical.
To extract more than one cluster from a single scan, the paper’s
Section 5.1.1 specifies a filtering rule: a candidate (zone, branch)
pair is retained only if its branch is disjoint from every previously
retained branch (no ancestor/descendant relation), or its zone is
disjoint from every previously retained zone (no region in common). This
is implemented in filter_clusters():
filter_clusters(result_rj)[1:5, c("node_id", "n_regions",
"cases", "expected", "llr")]For visualisation, get_cluster_regions() returns a tidy
data.frame ready to merge with a polygon layer:
# Most likely cluster only (single map)
cr1 <- get_cluster_regions(result_rj, n_clusters = 1, overlap = FALSE)
# Top-2 distinct clusters (faceted map)
cr2 <- get_cluster_regions(result_rj, n_clusters = 2, overlap = TRUE)A complete plotting workflow with geobr and
ggplot2:
library(ggplot2)
library(geobr)
library(sf)
mun <- read_municipality(code_muni = "RJ", year = 2016)
mun$code6 <- as.integer(substr(mun$code_muni, 1, 6))
cr2 <- merge(get_cluster_regions(result_rj, n_clusters = 2, overlap = TRUE),
unique(rj_mortality[, c("region_id", "ibge_code")]),
by = "region_id")
mun_facet <- merge(mun, cr2, by.x = "code6", by.y = "ibge_code")
ggplot(mun_facet) +
geom_sf(aes(fill = factor(cluster)), color = "gray40", alpha = 0.75) +
scale_fill_manual(values = c("1" = "#C44E52", "2" = "#4C72B0"),
na.value = "gray95", na.translate = FALSE) +
facet_wrap(~ panel, nrow = 1) +
theme_minimal() +
theme(legend.position = "none")A complete worked example with detailed annotation is shipped at
system.file("examples", "example_brazil_rj.R", package = "treeSS").
The Chicago dataset illustrates the package on a non-medical domain with a much larger tree (2841 nodes covering crime type, description and location). It also demonstrates a recurring choice users face in surveillance applications: which population should be the denominator?
chicago_crimes ships with two denominator columns:
population — the total incidents in each community
area. This is a compositional denominator: it expresses each
cell as a share of the citywide incident count and answers the question
“which crime types over-occur in which areas, relative to the
citywide profile?”pop_residential — the residential population of each
community area (ACS 2020 5-year, aggregated to community areas by CMAP
Community Data Snapshots). This is the appropriate denominator for an
incidence-rate analysis (incidents per resident), analogous to
deaths per live birth in Example 1.The choice changes both the model and the interpretation. We use the incidence-rate denominator below.
data(chicago_crimes)
data(chicago_tree)
result_chi <- treespatial_scan(
cases = chicago_crimes$cases,
population = chicago_crimes$pop_residential,
region_id = chicago_crimes$region_id,
x = chicago_crimes$x,
y = chicago_crimes$y,
node_id = chicago_crimes$node_id,
tree = chicago_tree,
max_pop_pct = 0.25,
nsim = 999, seed = 2023,
n_cores = 4L
)
print(result_chi)The most likely cluster covers a contiguous group of community areas on Chicago’s South Side (Englewood, West Englewood, Auburn Gresham, Roseland and neighbouring CCAs), driven by an over-occurrence of violent and property crimes against residents. The relative risk for this cluster is approximately 1.58 — a 58 % excess over the citywide rate.
chicago_crimes joins to the bundled
chicago_map polygon layer via the area_number
column:
data(chicago_map)
cr1 <- merge(get_cluster_regions(result_chi, n_clusters = 1, overlap = FALSE),
unique(chicago_crimes[, c("region_id", "area_number", "name")]),
by = "region_id")
chi <- merge(chicago_map, cr1, by.x = "AREA_NUM", by.y = "area_number",
all.x = TRUE)
ggplot(chi) +
geom_sf(aes(fill = factor(cluster)), color = "gray40", alpha = 0.75) +
scale_fill_manual(values = c("1" = "#C44E52"), na.value = "gray95",
name = "Cluster") +
theme_minimal()iterative_scan()filter_clusters() (used in Example 1) returns distinct
clusters extracted from a single run of the scan. An
alternative — popular in the spatial-epidemiology literature since
Zhang, Assunção and Kulldorff (2010) — is to run the scan, remove the
cases assigned to the detected cluster, and re-run on the residual data.
Repeating this yields a sequence of clusters whose subsequent components
are conditionally most likely, given that the prior clusters
have been explained away.
The function iterative_scan() implements this procedure.
Note that it is not part of Cançado et al. (2025);
users seeking strict fidelity to that paper should use
filter_clusters(). The iterative procedure is offered as a
useful extension and is documented as such.
Because each iteration is a separate hypothesis test on data modified
by the previous iteration, the raw p-values overstate significance.
iterative_scan() collects the raw p-values from all
iterations and applies the Holm–Bonferroni correction at the end:
iter_rj <- iterative_scan(
cases = rj_mortality$cases,
population = rj_mortality$live_births,
region_id = rj_mortality$region_id,
x = rj_mortality$x,
y = rj_mortality$y,
node_id = rj_mortality$node_id,
tree = rj_tree,
max_iter = 5, alpha = 0.05,
nsim = 999, seed = 2016,
max_pop_pct = 0.50, n_cores = 4L
)
print(iter_rj)The returned clusters data.frame has one row per
iteration with the raw p-value, the Holm-adjusted p-value, and a
significant flag:
iter_rj$clusters[, c("iteration", "node_id", "n_regions",
"llr", "pvalue", "pvalue_adjusted",
"significant")]Mapping the iterative result uses the same S3 dispatch:
cr_it <- merge(get_cluster_regions(iter_rj, overlap = TRUE),
unique(rj_mortality[, c("region_id", "ibge_code")]),
by = "region_id")
mun_it <- merge(mun, cr_it, by.x = "code6", by.y = "ibge_code",
all.x = TRUE)
ggplot(mun_it) +
geom_sf(aes(fill = factor(cluster)), color = "gray40", alpha = 0.75) +
scale_fill_manual(values = c("1" = "#C44E52", "2" = "#4C72B0",
"3" = "#55A868", "4" = "#8172B2",
"5" = "#CCB974"),
na.value = "gray95", na.translate = FALSE) +
facet_wrap(~ panel) +
theme_minimal() + theme(legend.position = "none")When to use which:
filter_clusters() /
get_cluster_regions(n_clusters = N) when you want to follow
Cançado et al. (2025) exactly.iterative_scan() when residual clustering is
plausible and you want a sequence of conditionally distinct clusters
with proper multiple-testing control.The Monte Carlo step (replicates of the null Poisson/Binomial counts
under proportional intensities) dominates the runtime. All scan
functions accept n_cores, which parallelises the Monte
Carlo step via OpenMP. On a 4-core laptop the Rio de Janeiro analysis
runs in under 20 seconds with nsim = 999; the Chicago
analysis (much larger tree) takes around two minutes.
For tasks where the case counts are concentrated near the leaves and
the tree is deep, aggregate_tree() can be used to compute
the internal-node sums once and reuse them across calls.
The inst/examples/ directory of the installed package
contains fully annotated worked examples for all four datasets:
list.files(system.file("examples", package = "treeSS"))
#> [1] "example_brazil_rj.R" "example_chicago.R"
#> [2] "example_florida.R" "example_london.R"The Florida script demonstrates the full workflow from raw long data
— building the ICD-10 tree from scratch, downloading county polygons via
tigris, assembling parallel input vectors and plotting with
ggplot2. The London script illustrates the use of a
non-medical tree (light × road × junction) on traffic collisions, with
an interactive leaflet map.
These two examples are also the basis of the detailed use cases in the companion R Journal paper.
Cançado, A. L. F., Oliveira, G. S., Quadros, A. V. C., & Duczmal, L. H. (2025). A tree-spatial scan statistic. Environmental and Ecological Statistics, 32, 953–978. doi:10.1007/s10651-025-00670-w
Kulldorff, M. (1997). A spatial scan statistic. Communications in Statistics — Theory and Methods, 26(6), 1481–1496.
Kulldorff, M., Fang, Z., & Walsh, S. J. (2003). A tree-based scan statistic for database disease surveillance. Biometrics, 59(2), 323–331.
Zhang, Z., Assunção, R., & Kulldorff, M. (2010). Spatial scan statistics adjusted for multiple clusters. Journal of Probability and Statistics, 2010, 642379.
Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6(2), 65–70.