---
title: "Pathway Enrichment Analysis"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{path-enrichment}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

## Introduction

Pathway enrichment analysis is a vital step in understanding the biological relevance of differential expression results. By mapping significant genes to known functional categories, researchers can gain insights into the cellular processes and pathways affected by their experimental conditions. 

In this vignette, we will explore how to perform pathway enrichment analysis using the `pathdb` package in conjunction with the popular `clusterProfiler` package. We will utilize a sample dataset, `hypoxia_deseq`, which contains differential expression results comparing treatment under hypoxic conditions.

First, let's load the required packages and the example dataset.

```{r setup, message=FALSE, warning=FALSE}
library(pathdb)
library(clusterProfiler)
library(dplyr)

# Preview the first few rows
head(hypoxia_deseq)
```

## Identifying Differentially Expressed Genes (DEGs)

Before performing pathway enrichment, we need to classify our genes based on their expression changes. We typically label genes as "UP" regulated, "DOWN" regulated, or "NO" significant change based on their fold change (`log2FoldChange`) and adjusted p-value (`padj`).

We apply a strict threshold to identify significant genes, such as an adjusted p-value of less than 0.05.

```{r define_deg}
# Define up- and down- regulation based on LFC and padj
df <- hypoxia_deseq |>
  as.data.frame() |>
  mutate(diffexp = case_when(
    log2FoldChange > 0 & padj < 0.05 ~ "UP",
    log2FoldChange < 0 & padj < 0.05 ~ "DOWN",
    TRUE ~ "NO" # using TRUE as catch-all for padj >= 0.1 or NA
  ))

# Filter out non-significant or missing adjusted p-value genes
df <- df[df$diffexp != "NO" & !is.na(df$diffexp), ]
```

Next, to ensure our gene IDs match the standard format expected by the `pathdb` database, we can use the `convert_id()` function. Here our species of interest is human (Species ID: 96).

```{r convert_ids, eval=FALSE}
# Convert ids to Ensembl format to ensure compatibility
df <- convert_id(
  genes = rownames(df),
  data = df,
  species_id = 96
)
```

To facilitate enrichment analysis on both up- and down-regulated genes separately, we split our data frame into a list.

```{r split_lists}
# Split into up or down groups in a list
deg_results_list <- split(df, df$diffexp)
```

## Preparing the TERM2GENE Mapping

The `clusterProfiler` function `enricher` requires background annotation data that maps pathway categories (TERMs) to gene IDs (GENEs). With the `pathdb` package, fetching this information is streamlined using the `T2G_prep()` function.

We will extract the KEGG pathways mapping for all the genes in our original `hypoxia_deseq` dataset, which acts as our background universe. This feature of `pathdb` is especially versatile, as one can use a vector of pathway database names, e.g. `c("KEGG", "GOBP", "GOCC")`. In this way, we can check databases in one function call, rather than pulling from KEGG, GO, and other databases separately.

```{r t2g_prep, eval = FALSE}
bg_genes <- T2G_prep(
  species_id = 96,
  category = c("KEGG"),
  genes = rownames(hypoxia_deseq)
)
```

```{r t2g_load, echo=FALSE}
bg_genes <- hypoxia_T2G
```


## Over-Representation Analysis (ORA)

With our gene lists and background annotation ready, we can perform an Over-Representation Analysis (ORA). ORA determines whether known pathways (biological functions or processes) are over-represented (enriched) in an experimentally-derived gene list, such as our lists of up- and down-regulated genes.

We apply the `enricher()` function to both the "UP" and "DOWN" gene lists. We also set some cutoffs to ensure our results are robust.

```{r run_ora}
padj_cutoff <- 0.1
genecount_cutoff <- 5

# Apply enricher across Up/Down lists
res <- lapply(names(deg_results_list), function(direction) {
  enricher(
    gene = rownames(deg_results_list[[direction]]),
    pvalueCutoff = 0.05,
    TERM2GENE = bg_genes
  )
})

# Name the lists according to regulation
names(res) <- names(deg_results_list)
```

After performing enrichment, we can format and filter our results. It's often helpful to combine the "UP" and "DOWN" results into a single data frame and transform the p-values for easier visualization.

```{r format_results}
# Bind results by row
res_df <- lapply(names(res), function(x) {
  res_chunk <- res[[x]]@result
  res_chunk$diffexp <- x # Keep track of the regulation direction
  return(res_chunk)
})
res_df <- do.call(rbind, res_df)

# Filter rows that meet our significance and count standards
target_pws <- res_df$p.adjust < padj_cutoff & res_df$Count > genecount_cutoff

# Filter and select top terms
res_sig <- res_df[target_pws, ] |>
  arrange(p.adjust) |>
  group_by(diffexp) |>
  slice_head(n = 20) |> # get up to 20 top pathways per direction
  ungroup()

# View enriched pathways for UP-regulated genes
res_up <- res_sig |> filter(diffexp == "UP")
head(res_up[, c("GeneRatio", "FoldEnrichment", "p.adjust")])
# View enriched pathways for DOWN-regulated genes
res_down <- res_sig |> filter(diffexp == "DOWN")
head(res_up[, c("GeneRatio", "FoldEnrichment", "p.adjust")])
```

## Gene Set Enrichment Analysis (GSEA)

Unlike ORA, which primarily relies on an artificial threshold to define a "significant" gene list, Gene Set Enrichment Analysis (GSEA) uses all the genes from your experiment. It determines whether members of a pathway tend to occur toward the top (or bottom) of a ranked gene list.

To perform GSEA, we must rank all valid genes. One typical approach is to calculate a metric based on the fold change and the significance level. However, a simpler metric is often the `log2FoldChange` itself. 

```{r gsea_prep}
# Create a ranked geneList
geneList <- setNames(
  object = hypoxia_deseq$log2FoldChange,
  nm = rownames(hypoxia_deseq)
)

# Order the list from highest to lowest
geneList <- sort(geneList, decreasing = TRUE)
```

With the ranked `geneList`, we can run the `GSEA()` function from clusterProfiler. We will utilize the same background mapping (`bg_genes`) downloaded earlier.

```{r run_gsea, warning=FALSE}
# Execute GSEA
gsea <- GSEA(
  geneList = geneList,
  TERM2GENE = bg_genes,
  pvalueCutoff = 0.05,
  maxGSSize = 2000
)

# Extract and filter top results
gsea_res <- gsea@result |>
  arrange(p.adjust) |>
  slice(1:20)

head(gsea_res[, c("setSize", "enrichmentScore", "NES", "p.adjust")])
```

By leveraging both ORA and GSEA, we can obtain complementary perspectives on the functional changes induced by our experimental conditions. The `pathdb` functions help seamlessly bridge basic count/results data with advanced pathway analysis libraries.