---
title: Using the CEOdata package
author: Joel Ardiaca
date: "`r format(Sys.time(), '%d/%m/%Y')` - Version `r packageVersion('CEOdata')`"
classoption: a4paper,justified
output:
  rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Using the CEOdata package}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---
```{r echo=FALSE, message=FALSE, warning=FALSE}
library(CEOdata)
library(tibble)
library(dplyr)
library(haven)
library(stringr)

example_path <- function(filename) {
  system.file("extdata", filename, package = "CEOdata", mustWork = TRUE)
}

acc_meta <- readRDS(example_path("accumulated_meta_example.rds"))
meta <- readRDS(example_path("REO_meta_example.rds"))

to_factor_tibble <- function(path) {
  haven::read_sav(path) |>
    tibble::as_tibble() |>
    dplyr::mutate(
      dplyr::across(
        where(~ inherits(.x, "haven_labelled")),
        haven::as_factor
      )
    )
}

to_raw_tibble <- function(path) {
  haven::read_sav(path) |>
    tibble::as_tibble()
}

d <- to_factor_tibble(example_path("BOP_presencial_example.sav"))
d_raw <- to_raw_tibble(example_path("BOP_presencial_example.sav"))
d1145 <- to_factor_tibble(example_path("d1145_example.sav"))
d1145_raw <- to_raw_tibble(example_path("d1145_example.sav"))

CEOaccumulated_meta <- function(series = NULL, active_only = FALSE) {
  out <- acc_meta
  if (!is.null(series)) {
    out <- out |>
      dplyr::filter(codi_serie %in% series)
  }
  if (isTRUE(active_only) && "estat" %in% names(out)) {
    estat_chr <- tolower(as.character(out$estat))
    out <- out[stringr::str_detect(estat_chr, "\\bactiva\\b") %in% TRUE, , drop = FALSE]
  }
  out
}

CEOmeta <- function(reo = NULL, search = NULL, date_start = NA, date_end = NA, ...) {
  out <- meta
  if (!is.null(reo)) {
    out <- out |>
      dplyr::filter(as.character(REO) %in% as.character(reo))
  } else if (!is.null(search)) {
    cols <- intersect(c("Titol enquesta", "Titol estudi", "Objectius", "Resum", "Descriptors"), names(out))
    if (length(cols) > 0) {
      pattern <- paste(search, collapse = "|")
      out <- out |>
        dplyr::mutate(dplyr::across(dplyr::all_of(cols), ~ tolower(as.character(.x)))) |>
        dplyr::filter(
          dplyr::if_any(
            dplyr::all_of(cols),
            ~ stringr::str_detect(.x, regex(pattern, ignore_case = TRUE))
          )
        )
    }
  }

  start <- suppressWarnings(as.Date(date_start))
  end <- suppressWarnings(as.Date(date_end))
  if (!is.na(start) && "Data d'alta al REO" %in% names(out)) {
    out <- out |>
      dplyr::filter(`Data d'alta al REO` >= start)
  }
  if (!is.na(end) && "Data d'alta al REO" %in% names(out)) {
    out <- out |>
      dplyr::filter(`Data d'alta al REO` <= end)
  }
  out
}

CEOdata <- function(series = "BOP_presencial", reo = NA, raw = FALSE) {
  if (!is.na(reo)) {
    return(if (isTRUE(raw)) d1145_raw else d1145)
  }
  if (!identical(series, "BOP_presencial")) {
    stop(
      "Offline vignette example includes only series = 'BOP_presencial'.",
      call. = FALSE
    )
  }
  if (isTRUE(raw)) d_raw else d
}
```

# 1. Introduction

`CEOdata` provides convenient access to the microdata (individual-level survey responses) produced by the Centre d'Estudis d'Opinió (CEO), the public opinion institute of the Government of Catalonia.

This vignette is fully offline and uses the bundled small example datasets in the `data/` folder.

The central entry point is the function

- **`CEOdata()`**, which downloads and imports microdata into `R`.

Depending on the arguments provided, `CEOdata()` can retrieve either:

1. An **accumulated microdata series**, identified by a __codi_serie__ (e.g. "BOP_presencial"), or
2. A **single study dataset**, identified by its __REO__ code (e.g. "1145").

In addition to data retrieval, the packages includes:

- **`CEOmeta()`**: which provides metadata for all individual studies, giving a complete list and details of all available surveys, and allowing the user to search for specific topics. 
- **`CEOaccumulated_meta()`**, which provides access to the list of available accumulated microdata series.
- **`CEOsearch()`**, which allows searching variable names, variable labels, and value labels within downloaded datasets.

Together, these functions provide a coherent workflow for discovering, downloading, and exploring CEO survey microdata directly from `R`.

# 2. Accumulated microdata series

## 2.1. What is a series?

An accumulated microdata series is a dataset that combines the individual responses from multiple CEO surveys conducted under a common design and topic.

For example, the series "`BOP_presencial`" contains the accumulated microdata of the __Baròmetres d'Opinió Política__ conducted face-to-face since 2014. Each row corresponds to an individual respondent, while the dataset aggregates responses across several survey waves (each identified by a different REO code).

In contrast to downloading a single study (via its __REO__ code), working with an accumulated series allows users to:
- Analyse trends across time
- Pool observations to increase statistical power
- Work with a harmonised questionnaire structure across waves

Each series is identified by a `codi_serie`, which can be inspected using `CEOaccumulated_meta()`.

## 2.2. List series

The available accumulated microdata series can be inspected using `CEOaccumulated_meta()`:
```{r echo=TRUE, message=FALSE, warning=FALSE}
head(CEOaccumulated_meta())
```

This function returns a tibble where each row corresponds to an accumulated series. The most relevant columns are:

- **`codi_serie`**: identifier user to download the series.
- **`titol_serie`**: descriptive title of the series.
- **`mode_admin`**: mode of administration.
- **`data_inici`** and **`data_fi`**: temporal coverage.
- **`reo`**: the list of __REO__ codes, separeted by commas, that the accumulated series contains.
- **`estat`**: whether the series is inactive or active.
- **`microdades_1`**: direct link to the microdata file (used in `CEOdata()`).

To see only the identifiers:

```{r echo=TRUE, message=FALSE, warning=FALSE}
head(unique(CEOaccumulated_meta()$codi_serie))
```

You can also filter the metadata to inspect a specific series:

```{r echo=TRUE, message=FALSE, warning=FALSE}
head(CEOaccumulated_meta(series = "BOP_presencial"))
```


## 2.3. Load series

Once a `codi_serie` has been identified, the corresponding dataset can be loaded using `CEOdata()`. In this offline vignette, the available accumulated series example is "__BOP_presencial__".
```{r echo=TRUE, message=FALSE, warning=FALSE}
d <- CEOdata()
head(d)
```

This is equivalent to explicitly specifying the series:
```{r echo=TRUE, message=FALSE, warning=FALSE}
d <- CEOdata(series = "BOP_presencial")
head(d)
```

Attempting to load a different accumulated series in this offline vignette returns an informative error. This is available if the computer has internet connection:
```{r echo=TRUE, message=FALSE, warning=FALSE}
try(CEOdata(series = "Longitudinal"))
```

The returned object is a tibble where each row represents an individual respondent and columns correspond to survey variables. Accumulated series typically combine multiple survey waves that share a comparable questionnaire structure.

By default, SPSS labelled variables are converted into standard R factors. To retain the original `haven_labelled` format:
```{r echo=TRUE, message=FALSE, warning=FALSE}
d_raw <- CEOdata(series = "BOP_presencial", raw = TRUE)
head(d_raw)
```


# 3. Individual studies (REO)

## 3.1. List studies

All individual surveys from the Generalitat de Catalunya are identified by a **REO code** (__Registre d'Estudis d'Opinió__). Each REO corresponds to a specific survey wave conducted at a given time. 

The available studies in the offline example can be inspected using `CEOmeta()`:
```{r echo=TRUE, message=FALSE, warning=FALSE}
meta <- CEOmeta()
head(meta)
```

This function returns a tibble where each row corresponds to a study. Among the most relevant columns are:

- **`REO`**: the study identifier.
- **`Títol enquesta`**: descriptive title of the study.
- **`Objectius`**: description of the survey goals and contents.
- **`Dia inici treball de camp`** and **`Dia final treball de camp`**: fieldwork dates.
- **`microdata_available`**: logical indicator of whether microdata are publicly available.
- **`Microdades_1`**: direct link to the microdata file (used in `CEOdata()`).

Internal surveys from the CEO have publicly available microdata, but there are other surveys from different institutions of the catalan government that might not have available microdata to retrieve. To get only the surveys that can be retrieved:

```{r echo=TRUE, message=FALSE, warning=FALSE}
available <- CEOmeta() |> dplyr::filter(microdata_available)
head(available)
```

The `search` argument allows users to look for keywords across several descriptive fields (such as title, summary, objectives...). Search words should be in Catalan.

```{r echo=TRUE, message=FALSE, warning=FALSE}
specific_reo <- CEOmeta(reo = "1145")
head(specific_reo)
```

## 3.2. Load studies

Once you have identified the REO code of a study, you can load its microdata using `CEOdata(reo = ...)`.
```{r echo=TRUE, message=FALSE, warning=FALSE}
d1145 <- CEOdata(reo = "1145")
head(d1145)
```

The returned object is a tibble where each row corresponds to an individual respondent and columns correspond to survey variables. If a REO has not available microdata, `CEOdata()` will return an informative error when retrieving the information.

As with accumulated series, by default the package converts SPSS-labelled variables into standard R factors. To keep the raw `haven_labelled` format:
```{r echo=TRUE, message=FALSE, warning=FALSE}
d1145_raw <- CEOdata(reo = "1145", raw = TRUE)
head(d1145_raw)
```

# 4. Search for keywords in the labels

Once a dataset has been downloaded using `CEOdata()`, the function `CEOsearch()` can be used to look for keywords in the variable labels or value labels. this is especially useful when working with large questionnaires and searching for specific topics.

You can search for keywords in the variable labels, for example, look for "trust" in the last retrieved dataset. Keywords must be typed in catalan language.
```{r echo=TRUE, message=FALSE, warning=FALSE}
head(CEOsearch(d1145, keyword = "democràcia"))
```

Sometimes, information might be on the value labels instead of the variables themselves. You can also search within response categories.
```{r echo=TRUE, message=FALSE, warning=FALSE}
head(CEOsearch(d1145, keyword = "Catalunya", where = "values"))
```

# 5. Working with labelled data (raw vs factors)

CEO microdata are originally distributed as SPSS (`.sav`) files. These files store categorical variables using value labels (e.g. `1 = Yes`, `2 = No`) rather than plain R factors.

By default, `CEOdata()`converts SPSS-labelled variables into standard R factors. This makes the dataset immediately convenient for descriptive statistics, modelling, and plotting in `R`, as most workflows expect factors rather than labelled vectors. If you prefer labelled structure, for example to retain exact numeric codings, you can set the argument `raw = TRUE` when retrieving any dataset.

# 6. Notes on reproducibility and data updates

In online use, `CEOdata` retrieves datasets directly from the official open data platform of the Generalitat de Catalunya. In this vignette, all examples are run offline with fixed local files. Online retrieval has implications for reproducibility:

- Accumulated series may change over time as new survey waves are incorporated.
- Metadata catalogues may be updated
- Minor corrections to datasets may be introduced by the data provider.

As a consequence, repeated calls to `CEOdata()` at different points in time may return slightly different datasets.

## 6.1. Ensuring reproducibility

To enhance reproducibility in applied research, it is recommended to:

- Save a local copy of the downloaded dataset used in the analysis.
- Record the date of the download.
- Report the version of the CEOdata package used.

```{r echo=TRUE, message=FALSE, warning=FALSE}
packageVersion("CEOdata")
```

`CEOdata` aims to provide convenient and transparent access to official survey data, but reproducible research practices remain the responsibility of the analyst.