--- title: "WID Code Dictionary" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{WID Code Dictionary} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>") library(widr) ``` Every WID variable is identified by a compact code encoding four independent dimensions: series type, concept, age group, and population unit. This vignette documents all six lookup tables and explains how to use the code grammar programmatically. ## Code grammar A full variable code has the structure: ``` [][] ``` All components are validated against the lookup tables. - `wid_decode()` and `wid_encode()` convert between the string and its components; - `wid_validate()` checks components without constructing a code; - `wid_search()` queries any table by regex. ```{r} wid_decode("sptinc992j") wid_encode("s", "ptinc", "992", "j") wid_is_valid(series_type = "s", concept = "ptinc") ``` ## Series types (`wid_series_types`) The first character of every WID code. There are `r nrow(wid_series_types)` types. ```{r} wid_series_types ``` The most commonly used types in distributional analysis: | Code | Meaning | Unit | |------|---------|------| | `s` | share | fraction of 1 (e.g. 0.20 = 20%) | | `a` | average | local currency, last year's prices | | `t` | threshold | local currency, last year's prices | | `m` | total | local currency, last year's prices | | `g` | Gini coefficient | 0–1 | | `b` | inverted Pareto-Lorenz coefficient | dimensionless | | `w` | wealth/income ratio | fraction of national income | | `x` | exchange rate | LCU per foreign currency | | `i` | index | dimensionless | Series types `s`, `g`, `b`, `w`, `r`, `p`, `i`, `y` are dimensionless; `wid_convert()` skips them. All others carry monetary units and can be converted. ```{r} wid_search("share", tables = "series_types") ``` ## Concepts (`wid_concepts`) Characters 2–6 (or 2–7 for six-letter concepts) of the variable code. There are `r nrow(wid_concepts)` concepts covering income, wealth, emissions, and demographic series. ```{r} nrow(wid_concepts) head(wid_concepts, 10) ``` ### Income and wealth The most-used concepts in distributional national accounts: | Code | Description | |------|-------------| | `ptinc` | pre-tax national income | | `plinc` | pre-tax labour income | | `pkink` | pre-tax capital income | | `fiinc` | fiscal income | | `nninc` | net national income | | `hweal` | net personal wealth | | `hwdeb` | personal debt | | `hwnfa` | net financial assets | | `hwhou` | housing wealth | ### Emissions | Code | Description | |------|-------------| | `tco2e` | total CO₂-equivalent emissions | | `tpcem` | per-capita emissions | ### Searching concepts `wid_search()` matches against both code and description: ```{r} wid_search("wealth") wid_search("income", tables = "concepts") wid_search("^ptinc$", tables = "concepts") # exact match ``` To find all share series for wealth: ```{r} wid_search("wealth", type = "s") ``` ## Age groups (`wid_ages`) Age groups are represented by three-digit, zero-padded codes for population brackets. There are `r nrow(wid_ages)` defined age groups. ```{r} wid_ages ``` The most-used groups: | Code | Description | |------|-------------| | `999` | all ages | | `992` | adults 20+ (most common default) | | `996` | adults 20–64 (working age) | | `993` | adults 20–39 | | `994` | adults 40–59 | | `995` | adults 60+ | | `997` | elderly 65+ | | `014` | children 0–14 | The default in `download_wid()` is `ages = "992"`. To download all available age groups pass `ages = "all"`. ```{r} wid_validate(age = 992) # validates and zero-pads: "992" wid_validate(age = "014") ``` ## Population types (`wid_pop_types`) One-letter codes for the statistical unit of observation. There are `r nrow(wid_pop_types)` types. ```{r} wid_pop_types ``` | Code | Description | |------|-------------| | `j` | equal-split adults (income/wealth split equally between spouses) | | `i` | individuals | | `t` | tax units | | `m` | male | | `f` | female | | `e` | employed | `j` (equal-split) is the standard unit for distributional comparisons because it avoids counting couples differently in countries with different filing conventions. The default in `download_wid()` is `pop = "j"`. ## Countries and regions (`wid_countries`) Two-letter ISO codes plus WID regional aggregates. There are `r nrow(wid_countries)` entries. ```{r} head(wid_countries, 10) ``` Country sub-regions follow the pattern `XX-YY` (e.g. `US-CA` for California). Regional aggregates use non-standard codes (e.g. `WO` for world, `QE` for Europe). ```{r} wid_search("United States", tables = "countries") wid_search("^US", tables = "countries") # US and all sub-regions wid_search("Europe", tables = "countries") ``` `wid_validate()` warns on codes not matching the `^[A-Z]{2}(-[A-Z0-9]{1,5})?$` pattern: ```{r} wid_validate(areas = c("US", "FR", "US-CA")) # valid wid_validate(areas = "lowercase") # warning ``` ## Percentiles (`wid_percentiles`) Codes of the form `pXpY` specifying a fraction of the distribution. There are `r nrow(wid_percentiles)` enumerated codes. ```{r} head(wid_percentiles, 10) ``` ### Semantics by series type The meaning of a percentile code depends on the series type: - **Share / average (`s`, `a`):** `pXpY` denotes the group from the X-th to the Y-th percentile. - `p99p100` = top 1%. - **Threshold (`t`):** `pXpY` or `pX` denotes the *minimum* value that places an individual in the group. - The threshold of `p90p100` is the 90th quantile. - **No distributional meaning (`m`, `n`, `x`, `i`):** use `p0p100` (full population). ### Common codes | Code | Meaning | |------|---------| | `p0p100` | whole population | | `p0p50` | bottom 50% | | `p50p90` | middle 40% | | `p90p100` | top 10% | | `p99p100` | top 1% | | `p99.9p100`| top 0.1% | | `p99.99p100`| top 0.01% | | `p0p50`, `p50p90`, `p90p100` | the three standard WID groups | ```{r} wid_search("top 1", tables = "percentiles") wid_search("bottom", tables = "percentiles") ``` ### Validation `wid_validate()` checks that the lower bound is strictly less than the upper: ```{r eval=FALSE} wid_validate(perc = "p99p100") # valid wid_validate(perc = "p90p10") # error: invalid percentile order wid_validate(perc = "bad") # error: invalid format ``` ## Searching across all tables Pass `tables = "all"` to search every table simultaneously: ```{r} wid_search("income", tables = "all") ``` ## Building and validating codes ```{r} # Validate before building wid_validate(series_type = "s", concept = "ptinc", age = 992, pop = "j") # Encode code <- wid_encode("s", "ptinc", "992", "j") code # "sptinc992j" # Round-trip identical(wid_encode(wid_decode(code)), code) # TRUE # Non-throwing check wid_is_valid(series_type = "Z") # FALSE wid_is_valid(series_type = "s") # TRUE ``` ## The exchange rate codes Exchange rate series follow a distinct naming convention. The concept occupies the standard positions but carries a currency and direction suffix: | Code | Description | |------|-------------| | `xlcusx999i` | LCU per USD (market) | | `xlceup999i` | EUR per LCU (market) | | `xlpppx999i` | LCU per USD (PPP) | `wid_convert()` uses these internally. The age `999` (all ages) and pop `i` (individuals) are standard for price/exchange indices, which have no distributional meaning and are always retrieved at `p0p100`. ## Reference | Function | Purpose | |----------|---------| | `wid_decode(x)` | Parse code string into components | | `wid_encode(type, concept, age, pop)` | Build code string from components | | `wid_validate(...)` | Validate one or more components | | `wid_is_valid(...)` | Non-throwing validation check | | `wid_search(query, tables, type)` | Search lookup tables by regex | | `wid_series_types` | Lookup table: series types | | `wid_concepts` | Lookup table: concepts | | `wid_ages` | Lookup table: age groups | | `wid_pop_types` | Lookup table: population types | | `wid_countries` | Lookup table: countries and regions | | `wid_percentiles` | Lookup table: percentile codes |