--- title: "MEC with blocking" author: "Adam Struzik" output: html_vignette: df_print: kable toc: true number_sections: true fig_width: 6 fig_height: 4 vignette: > %\VignetteIndexEntry{MEC with blocking} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Setup Load required packages. ```{r setup, message = FALSE} library(automatedRecLin) library(data.table) options("text2vec.mc.cores" = 1L) ``` ## Data We use the full example Census and Customer Information System (CIS) datasets from [McLeod et al. (2011)](https://wayback.archive-it.org/12090/20231221144450/https://cros-legacy.ec.europa.eu/content/job-training_en). The goal is to link records from CIS to records from Census. ```{r data} data("census", package = "automatedRecLin") data("cis", package = "automatedRecLin") setDT(census) setDT(cis) NROW(cis) NROW(census) ``` The `person_id` variable identifies the correct linkage. We use this information only to evaluate the result. ```{r true-matches} cis[is.na(cis)] <- "" census[is.na(census)] <- "" cis[, pername1 := gsub("-", "", pername1)] census[, pername1 := gsub("-", "", pername1)] true_matches <- merge( x = cis[, .(a = .I, person_id)], y = census[, .(b = .I, person_id)], by = "person_id" )[, .(a, b)] NROW(true_matches) ``` ## MEC with blocking We compare forename and surname using the Jaro-Winkler distance. These two comparison variables are modeled with the continuous parametric MEC method. Sex and date-of-birth variables use the default binary method. Address fields are used only to construct blocks. ```{r model-specification} variables <- c( "pername1", "pername2", "sex", "dob_day", "dob_mon", "dob_year" ) comparators <- list( "pername1" = jarowinkler_complement(), "pername2" = jarowinkler_complement() ) methods <- list( "pername1" = "continuous_parametric", "pername2" = "continuous_parametric" ) blocking_variables <- c(variables, "enumcap", "enumpc") ``` Run blocked MEC. The model is trained on sampled blocks that contain at least the requested number of pairs and a lower bound on nonmatches. ```{r mec-blocking} set.seed(1) result <- mec_blocking( A = cis, B = census, variables = variables, comparators = comparators, methods = methods, blocking_variables = blocking_variables, blocking_sep = "", controls_blocking = list(seed = 1, n_threads = 1), min_training_pairs = 1000, min_training_nonmatches = 1000, block_sampling_seed = 1, nonmatch_sample_size = 100000, nonmatch_sampling_seed = 1, true_matches = true_matches ) result ``` ## Blocking efficiency and linkage results The full Cartesian product contains `r format(NROW(cis) * NROW(census), big.mark = ",")` record pairs. Blocking reduces this to `r format(result$blocking_eval[["blocked_pairs"]], big.mark = ",")` candidate pairs, while retaining `r sprintf("%.2f%%", 100 * result$blocking_eval[["blocking_recall"]])` of known links. The final linkage set contains `r format(NROW(result$M_est), big.mark = ",")` predicted matches. ```{r results, echo = FALSE} data.table( step = c("Training", "Blocking", "Linkage"), result = c( paste0( result$training_rule, " on ", format(NROW(result$training_blocks), big.mark = ","), " blocks" ), paste0( format(result$blocking_eval[["preserved_matches"]], big.mark = ","), " of ", format(result$blocking_eval[["true_matches"]], big.mark = ","), " known links retained" ), paste0( "FLR = ", sprintf("%.2f%%", 100 * result$eval_metrics[["FLR"]]), "; MMR = ", sprintf("%.2f%%", 100 * result$eval_metrics[["MMR"]]) ) ) ) ```