---
title: "WID Code Dictionary"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{WID Code Dictionary}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
library(widr)
```

Every WID variable is identified by a compact code encoding four independent dimensions: series type, concept, age group, and population unit. This vignette documents all six lookup tables and explains how to use the code grammar programmatically.

## Code grammar

A full variable code has the structure:

```
<type:1><concept:5-6>[<age:3>][<pop:1>]
```

All components are validated against the lookup tables. 

- `wid_decode()` and `wid_encode()` convert between the string and its components; 
- `wid_validate()` checks components without constructing a code; 
- `wid_search()` queries any table by regex.

```{r}
wid_decode("sptinc992j")
wid_encode("s", "ptinc", "992", "j")
wid_is_valid(series_type = "s", concept = "ptinc")
```

## Series types (`wid_series_types`)

The first character of every WID code. There are `r nrow(wid_series_types)` types.

```{r}
wid_series_types
```

The most commonly used types in distributional analysis:

| Code | Meaning | Unit |
|------|---------|------|
| `s`  | share | fraction of 1 (e.g. 0.20 = 20%) |
| `a`  | average | local currency, last year's prices |
| `t`  | threshold | local currency, last year's prices |
| `m`  | total | local currency, last year's prices |
| `g`  | Gini coefficient | 0–1 |
| `b`  | inverted Pareto-Lorenz coefficient | dimensionless |
| `w`  | wealth/income ratio | fraction of national income |
| `x`  | exchange rate | LCU per foreign currency |
| `i`  | index | dimensionless |

Series types `s`, `g`, `b`, `w`, `r`, `p`, `i`, `y` are dimensionless; `wid_convert()` skips them. All others carry monetary units and can be converted.

```{r}
wid_search("share", tables = "series_types")
```

## Concepts (`wid_concepts`)

Characters 2–6 (or 2–7 for six-letter concepts) of the variable code. There are `r nrow(wid_concepts)` concepts covering income, wealth, emissions, and demographic series.

```{r}
nrow(wid_concepts)
head(wid_concepts, 10)
```

### Income and wealth

The most-used concepts in distributional national accounts:

| Code | Description |
|------|-------------|
| `ptinc` | pre-tax national income |
| `plinc` | pre-tax labour income |
| `pkink` | pre-tax capital income |
| `fiinc` | fiscal income |
| `nninc` | net national income |
| `hweal` | net personal wealth |
| `hwdeb` | personal debt |
| `hwnfa` | net financial assets |
| `hwhou` | housing wealth |

### Emissions

| Code | Description |
|------|-------------|
| `tco2e` | total CO₂-equivalent emissions |
| `tpcem` | per-capita emissions |

### Searching concepts

`wid_search()` matches against both code and description:

```{r}
wid_search("wealth")
wid_search("income", tables = "concepts")
wid_search("^ptinc$", tables = "concepts")   # exact match
```

To find all share series for wealth:

```{r}
wid_search("wealth", type = "s")
```

## Age groups (`wid_ages`)

Age groups are represented by three-digit, zero-padded codes for population brackets. There are `r nrow(wid_ages)` defined age groups.

```{r}
wid_ages
```

The most-used groups:

| Code | Description |
|------|-------------|
| `999` | all ages |
| `992` | adults 20+ (most common default) |
| `996` | adults 20–64 (working age) |
| `993` | adults 20–39 |
| `994` | adults 40–59 |
| `995` | adults 60+ |
| `997` | elderly 65+ |
| `014` | children 0–14 |

The default in `download_wid()` is `ages = "992"`. To download all available age groups pass `ages = "all"`.

```{r}
wid_validate(age = 992)    # validates and zero-pads: "992"
wid_validate(age = "014")
```

## Population types (`wid_pop_types`)

One-letter codes for the statistical unit of observation. There are `r nrow(wid_pop_types)` types.

```{r}
wid_pop_types
```

| Code | Description |
|------|-------------|
| `j`  | equal-split adults (income/wealth split equally between spouses) |
| `i`  | individuals |
| `t`  | tax units |
| `m`  | male |
| `f`  | female |
| `e`  | employed |

`j` (equal-split) is the standard unit for distributional comparisons because it avoids counting couples differently in countries with different filing conventions. The default in `download_wid()` is `pop = "j"`.

## Countries and regions (`wid_countries`)

Two-letter ISO codes plus WID regional aggregates. There are `r nrow(wid_countries)` entries.

```{r}
head(wid_countries, 10)
```

Country sub-regions follow the pattern `XX-YY` (e.g. `US-CA` for California). Regional aggregates use non-standard codes (e.g. `WO` for world, `QE` for Europe).

```{r}
wid_search("United States", tables = "countries")
wid_search("^US", tables = "countries")     # US and all sub-regions
wid_search("Europe", tables = "countries")
```

`wid_validate()` warns on codes not matching the `^[A-Z]{2}(-[A-Z0-9]{1,5})?$` pattern:

```{r}
wid_validate(areas = c("US", "FR", "US-CA"))    # valid
wid_validate(areas = "lowercase")               # warning
```

## Percentiles (`wid_percentiles`)

Codes of the form `pXpY` specifying a fraction of the distribution. There are `r nrow(wid_percentiles)` enumerated codes.

```{r}
head(wid_percentiles, 10)
```

### Semantics by series type

The meaning of a percentile code depends on the series type:

- **Share / average (`s`, `a`):** `pXpY` denotes the group from the X-th to the Y-th percentile. 
  - `p99p100` = top 1%.
- **Threshold (`t`):** `pXpY` or `pX` denotes the *minimum* value that places an individual in the group.
  - The threshold of `p90p100` is the 90th quantile.
- **No distributional meaning (`m`, `n`, `x`, `i`):** use `p0p100` (full population).

### Common codes

| Code | Meaning |
|------|---------|
| `p0p100`   | whole population |
| `p0p50`    | bottom 50% |
| `p50p90`   | middle 40% |
| `p90p100`  | top 10% |
| `p99p100`  | top 1% |
| `p99.9p100`| top 0.1% |
| `p99.99p100`| top 0.01% |
| `p0p50`, `p50p90`, `p90p100` | the three standard WID groups |

```{r}
wid_search("top 1", tables = "percentiles")
wid_search("bottom", tables = "percentiles")
```

### Validation

`wid_validate()` checks that the lower bound is strictly less than the upper:

```{r eval=FALSE}
wid_validate(perc = "p99p100")    # valid
wid_validate(perc = "p90p10")     # error: invalid percentile order
wid_validate(perc = "bad")        # error: invalid format
```

## Searching across all tables

Pass `tables = "all"` to search every table simultaneously:

```{r}
wid_search("income", tables = "all")
```

## Building and validating codes

```{r}
# Validate before building
wid_validate(series_type = "s", concept = "ptinc", age = 992, pop = "j")

# Encode
code <- wid_encode("s", "ptinc", "992", "j")
code  # "sptinc992j"

# Round-trip
identical(wid_encode(wid_decode(code)), code)  # TRUE

# Non-throwing check
wid_is_valid(series_type = "Z")   # FALSE
wid_is_valid(series_type = "s")   # TRUE
```

## The exchange rate codes

Exchange rate series follow a distinct naming convention. The concept occupies the standard positions but carries a currency and direction suffix:

| Code | Description |
|------|-------------|
| `xlcusx999i` | LCU per USD (market) |
| `xlceup999i` | EUR per LCU (market) |
| `xlpppx999i` | LCU per USD (PPP) |

`wid_convert()` uses these internally. The age `999` (all ages) and pop `i` (individuals) are standard for price/exchange indices, which have no distributional meaning and are always retrieved at `p0p100`.

## Reference

| Function | Purpose |
|----------|---------|
| `wid_decode(x)` | Parse code string into components |
| `wid_encode(type, concept, age, pop)` | Build code string from components |
| `wid_validate(...)` | Validate one or more components |
| `wid_is_valid(...)` | Non-throwing validation check |
| `wid_search(query, tables, type)` | Search lookup tables by regex |
| `wid_series_types` | Lookup table: series types |
| `wid_concepts` | Lookup table: concepts |
| `wid_ages` | Lookup table: age groups |
| `wid_pop_types` | Lookup table: population types |
| `wid_countries` | Lookup table: countries and regions |
| `wid_percentiles` | Lookup table: percentile codes |