--- title: "Decoding UKB Column Names and Values" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Decoding UKB Column Names and Values} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE ) ``` ## Overview Raw UKB phenotype data contains encoded column names and values that need to be converted before analysis. | Source | Column names | Column values | |---|---|---| | `extract_pheno()` | `participant.p31` | Raw integer codes — needs `decode_values()` | | `extract_batch()` | `p31`, `p53_i0` | Usually already decoded — `decode_values()` typically not needed | Both outputs need `decode_names()` to convert field ID column names to human-readable snake_case. > **Call order matters**: when using `extract_pheno()` output, always run `decode_values()` before `decode_names()`, because value decoding relies on the numeric field ID still being present in the column name. --- ## Recommended Workflow ```{r workflow} library(ukbflow) df <- extract_pheno(c(31, 54, 20116, 21022)) df <- decode_values(df) # 0/1 → "Female"/"Male", etc. df <- decode_names(df) # participant.p31 → sex ``` --- ## Step 1: Decode Values `decode_values()` converts raw integer codes to human-readable labels for categorical fields that have UKB encoding mappings. Continuous, date, text, and already-decoded fields are left unchanged. ```{r decode-values} df <- decode_values(df) #> ✔ Decoded 3 categorical columns; 2 non-categorical columns unchanged. ``` It requires two metadata files from the UKB Showcase. Download them once with: ```{r fetch-meta} fetch_metadata(dest_dir = "data/metadata") ``` Then point `decode_values()` to the same directory (default matches `fetch_metadata()`): ```{r decode-values-dir} df <- decode_values(df, metadata_dir = "data/metadata") ``` ### What gets decoded | Column | Raw value | Decoded value | |---|---|---| | `p31` | `0` / `1` | `"Female"` / `"Male"` | | `p54` | `11012` | `"Leeds"` | | `p20116_i0` | `0` / `1` / `2` | `"Never"` / `"Previous"` / `"Current"` | Codes absent from the encoding table (including UKB missing codes `-1`, `-3`, `-7`) are returned as `NA`. --- ## Step 2: Decode Names `decode_names()` renames columns from field ID format to snake_case labels using the approved UKB field dictionary available to your project. ```{r decode-names} df <- decode_names(df) #> ✔ Renamed 5 columns. ``` ### Name conversion examples | Raw name | Decoded name | |---|---| | `participant.eid` | `eid` | | `participant.p31` | `sex` | | `participant.p21022` | `age_at_recruitment` | | `participant.p53_i0` | `date_of_attending_assessment_centre_i0` | | `p31` | `sex` | | `p53_i0` | `date_of_attending_assessment_centre_i0` | Both `extract_pheno()` format (`participant.p31`) and `extract_batch()` format (`p31`) are handled automatically. ### Long names Some UKB field titles are verbose. Names exceeding `max_nchar` characters are flagged with a warning (default: 60). Lower the threshold to catch more aggressively: ```{r long-names} df <- decode_names(df, max_nchar = 30) #> ! 1 column name longer than 30 characters - consider renaming manually: #> • date_of_attending_assessment_centre_i0 ``` Rename manually to something concise: ```{r rename} names(df)[names(df) == "date_of_attending_assessment_centre_i0"] <- "date_baseline" ``` --- ## Getting Help - `?decode_values`, `?decode_names` - `vignette("extract")` — extracting phenotype data - [GitHub Issues](https://github.com/evanbio/ukbflow/issues)