--- title: "Getting Started with pepdiff" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Getting Started with pepdiff} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5 ) ``` ## What is pepdiff? pepdiff performs differential abundance analysis for PRM (Parallel Reaction Monitoring) proteomics data. Given peptide abundance measurements across experimental conditions, it identifies which peptides show significant changes between groups. pepdiff is the analysis companion to [peppwR](https://github.com/TeamMacLean/peppwR), which handles power analysis and experimental planning. The workflow is: - **peppwR**: "How many samples do I need?" (before the experiment) - **pepdiff**: "What's differentially abundant?" (after the experiment) ## Installation ```{r eval = FALSE} # Install from GitHub devtools::install_github("TeamMacLean/pepdiff") ``` Some features require Bioconductor packages: ```{r eval = FALSE} # For heatmaps BiocManager::install("ComplexHeatmap") # For rank products test BiocManager::install("RankProd") ``` ## Setup ```{r setup, message = FALSE} library(pepdiff) library(dplyr) ``` ## Creating example data We'll generate synthetic data to demonstrate the workflow. This represents a well-powered experiment: control vs treatment, with 10 biological replicates per group and 50 peptides measured. We use 3-fold changes for differential peptides - large enough to detect reliably with this sample size. ```{r generate-data} set.seed(123) # Experimental design # 10 reps and 3-fold changes give us good power to detect true effects n_peptides <- 50 n_reps <- 10 conditions <- c("ctrl", "trt") # Generate peptide names and gene IDs peptides <- paste0("PEP_", sprintf("%03d", 1:n_peptides)) genes <- paste0("GENE_", LETTERS[((1:n_peptides - 1) %% 26) + 1]) # 30% of peptides (15 out of 50) will be truly differentially abundant n_diff <- round(n_peptides * 0.3) diff_peptides <- sample(peptides, n_diff) cat("Truly differential peptides:", n_diff, "\n") # Generate abundance data # Proteomics abundances are typically right-skewed (can't go below zero, # multiplicative processes). We use a Gamma distribution to reflect this. sim_data <- expand.grid( peptide = peptides, treatment = conditions, bio_rep = 1:n_reps, stringsAsFactors = FALSE ) %>% mutate( gene_id = genes[match(peptide, peptides)], # Base abundance varies by peptide (some peptides more abundant than others) base_abundance = rep(rgamma(n_peptides, shape = 5, rate = 0.5), each = length(conditions) * n_reps), # Treatment effect for differential peptides (~3-fold change up or down) fc = ifelse(peptide %in% diff_peptides & treatment == "trt", sample(c(0.33, 3), n_diff * n_reps, replace = TRUE, prob = c(0.3, 0.7)), 1), # Final value with biological noise value = rgamma(n(), shape = 10, rate = 10 / (base_abundance * fc)) ) %>% select(peptide, gene_id, treatment, bio_rep, value) head(sim_data) ``` To use `read_pepdiff()`, we need data in a CSV file. Let's write our simulated data to a temporary file: ```{r write-temp-csv} temp_file <- tempfile(fileext = ".csv") write.csv(sim_data, temp_file, row.names = FALSE) ``` ## Importing data The `read_pepdiff()` function imports your data and creates a structured object for analysis. You tell it which columns contain what: ```{r import-data} dat <- read_pepdiff( file = temp_file, id = "peptide", # peptide identifiers gene = "gene_id", # gene identifiers value = "value", # abundance values factors = "treatment", # experimental factors replicate = "bio_rep" # biological replicate IDs ) dat ``` The print output shows you what you've got: number of peptides, experimental design, and data quality summary. The `pepdiff_data` object contains: - **data**: Your measurements in long format - **factors**: The experimental factors you specified - **design**: Summary of your experimental structure - **missingness**: Per-peptide missing data statistics For more detail: ```{r summary-data} summary(dat) ``` ## Quick look at your data Before running any analysis, visualise your data. The `plot()` method gives you a multi-panel diagnostic view: ```{r plot-data, fig.height = 6} plot(dat) ``` What to look for: - **PCA**: Replicates from the same condition should cluster together. If samples scatter randomly, you may have a quality issue. - **Distributions**: Abundance distributions should be similar across samples. Large shifts suggest normalisation problems. - **Missingness**: Random scatter is fine. Systematic patterns (whole peptides missing in one condition) suggest MNAR (missing not at random) - those peptides may not be analysable. ## Running a comparison Now for the analysis. The `compare()` function tests for differential abundance: ```{r compare} results <- compare( dat, compare = "treatment", # which factor to test ref = "ctrl" # reference level (what to compare against) ) results ``` The "Marked significant" count in the output is the number of peptides *detected* as significant by the statistical test - not the number we know to be truly differential. With 15 truly differential peptides in our simulation, we hope to detect most of them while minimising false positives. By default, `compare()` uses a Gamma GLM (generalised linear model). Why Gamma? Proteomics abundances are positive and right-skewed - they arise from multiplicative processes and can't go below zero. The Gamma distribution models this naturally, and the log link means coefficients are interpretable as fold changes. ## Understanding results The results object contains everything you need: ```{r results-structure} # The main results table head(results$results) ``` Key columns: - **fold_change**: Ratio of treatment to control (>1 means higher in treatment) - **log2_fc**: Log2 of fold change (easier to interpret: 1 = doubled, -1 = halved) - **p_value**: Statistical significance - **fdr**: False discovery rate-adjusted p-value (use this for significance calls) - **significant**: TRUE if fdr < alpha (default 0.05) ### Getting significant hits The `significant()` function extracts peptides that passed the significance threshold: ```{r significant} sig_peptides <- significant(results) sig_peptides ``` ### Visualising results The `plot()` method for results shows you the key diagnostic plots: ```{r plot-results, fig.height = 6} plot(results) ``` What to look for: - **Volcano plot**: Points at the edges (large fold change + low p-value) are your confident hits. Symmetric spread is typical; all hits in one direction might indicate a global shift. - **P-value histogram**: Under the null hypothesis (no true signal), p-values are uniformly distributed. A spike near zero indicates true positives. A U-shape suggests p-value inflation (something's wrong). - **Fold change distribution**: Should be centered near zero on log2 scale. A systematic shift suggests normalisation issues. ## Comparing with our simulation Since we generated this data, we know which peptides were truly differential. Let's check how well we did: ```{r check-results} # True positives: correctly identified as significant true_pos <- sig_peptides$peptide[sig_peptides$peptide %in% diff_peptides] # False positives: called significant but not truly differential false_pos <- sig_peptides$peptide[!sig_peptides$peptide %in% diff_peptides] # False negatives: truly differential but not called significant false_neg <- diff_peptides[!diff_peptides %in% sig_peptides$peptide] cat("True positives:", length(true_pos), "\n") cat("False positives:", length(false_pos), "\n") cat("False negatives:", length(false_neg), "\n") ``` With 10 replicates and 3-fold effect sizes, we have good power to detect most true positives. FDR control at 5% means that among peptides called significant, we expect roughly 5% to be false positives - that's the trade-off you're making. With smaller sample sizes or weaker effects, you'd see more false negatives (missed true positives). ## Next steps This vignette covered the basic workflow. For more detail, see: - `vignette("pairwise_tests")`: Two-group comparisons with different statistical tests - `vignette("diagnostic_plots")`: Visual quality control and interpretation - `vignette("glm_analysis")`: Factorial designs with multiple factors - `vignette("art_analysis")`: Non-parametric analysis when GLM assumptions fail ```{r cleanup, include = FALSE} unlink(temp_file) ```