--- title: "Overview of MetaEntropy" bibliography: "overview.bib" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Overview of MetaEntropy} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ```
# Motivation High-performance sequencing has exponentially increased the power of intrapopulation viral diversity study. Short-read methods are particularly popular due to their excellent cost-to-depth ratio and the availability of efficient algorithms, such as the Burrows-Wheeler Transform, to infer read homology within the framework of reference genomes. *MetaEntropy* is designed to process the typical output of variant calling tools, consisting of information about the variants present in a population, relative to a reference genome, and their corresponding frequencies. It generates "informed" estimates of entropy at each genomic position, which, in addition to allele frequencies, take into account the physicochemical characteristics of the amino acids. The calculated values are normalized to a scale of 0 to 1, facilitating the comparison of different loci and different metagenomes.
# Input data Single nucleotide variants (SNVs) data must be provided in a data frame. This should contain columns indicating the reference genome positions that are polymorphic in the virome, the linkage relationships between them, the corresponding bases in the reference genome and in the virome, which protein and amino acid position each mutation affects, the corresponding amino acids in the reference strain and in the metagenome, and mutation frequencies. For example, the *wWater* dataset, which accompanies the package, has this information in the columns *position*, *linkage*, *ref*, *alt*, *protein*, *aa_position*, *ref_aa*, *alt_aa* and *alt_aa_freq*, which are the labels expected by default: ```{r loadMetaEntropy, results = "hide"} library(MetaEntropy) ``` ```{r seeLinked} wWater[105:108,-1] ``` "Linked" mutations are variants that are physically connected in the genome and affect a same codon. There can also be positions with several different mutations: ```{r seewMultiple} wWater[50:51,-1] ``` These variants collectively contribute to the entropy at the affected residue.
# Worked example: SARS-CoV-2 immune escape ## Loading packages First we load some packages used in this example: ```{r loadPackages, results = "hide"} lapply(c("ggplot2", "patchwork"), library, character.only = TRUE) ``` ## Conceptual framework and data This example uses data from wastewater colected during the first and third COVID-19 waves in Argentina [BioProject PRJNA1183343; @manrique-jones2025]. The first wave was driven by ancestral SARS-CoV-2 lineages, while the third was dominated by Omicron subvariants. Omicron is a highly adapted lineage to humans, including hypervariation in the receptor-binding domain (RBD) in response to immunity. We will evidence the RBD relevance by an entropy analysis, and will identify key mutations on it. We begin by building separate datasets for ancestral and Omicron lineages. ```{r datasets} firstWave <- wWater[ wWater$wave == "first", ] thirdWave <- wWater[ wWater$wave == "third", ] ``` A 3% minor variant cutoff was already applied to these data by the variant caller. Following the same logic, here we will also filter out minor variants whose genotype matches that of the reference genome: ```{r filter} firstWave <- firstWave[ firstWave$alt_aa_freq <= 0.97, ] thirdWave <- thirdWave[ thirdWave$alt_aa_freq <= 0.97, ] ``` ## Entropy signatures Now we infer the ancestral and Omicron signatures and compare them graphically. ```{r signatures} ancestral <- getEntropySignature(firstWave) omicron <- getEntropySignature(thirdWave) # Compare signatures anc_plot <- plot(ancestral) + ggtitle("Ancestral") omi_plot <- plot(omicron) + ggtitle("Omicron") anc_plot / omi_plot ``` We see that entropy at the *S* gene is relatively high in the Omicron sublineages. ## Omicron sublineages in more detail Now we look at how entropy is distributed along genome positions in the Omicron sublineages. ```{r omicronEntroScan, fig.width = 7} plot(omicron, chartType = "entroScan") ``` There are a handful of S positions that are responsible for the unusually high entropy. ```{r} omicron$Entropy$position[ omicron$Entropy$entropy > 0.3 ] ``` Let's look at the corresponding mutations: ```{r} showMutations(omicron, c(22882, 22898, 22917, 23013, 23040, 23048, 23055, 23063)) ``` The literature tells us that these SNVs affect the RBD [@Lan2020]. Viral RBDs are targeted by the immune system because they are vital for cell binding and entry. There is solid observational and experimental evidence for S:N440K, S:G446S, S:L452R, S:E484A, S:Q493R, S:G496S, S:Q498R and S:N501Y [@Guigon2022; @Liang2022; @Liu2021; @Motozono2022; @Rani2021; @Starr2021; @Starr2021a; @Weisblum2020; @Zhang2022]. ```{r, eval = FALSE} # Search PUBMED for more studies on these mutations: library(rentrez) # # create search phrase mutations <- showMutations(omicron, c(22882, 22898, 22917, 23013, 23040, 23048, 23055, 23063)) searchPhrase <- paste0(gsub("S:", "", mutations$phenotype, fixed = TRUE), "[TIAB]") searchPhrase <- paste0("SARS-CoV-2[TITLE] AND spike [TIAB] AND (", paste0(searchPhrase, collapse = " OR "), ")") # # search mySearch <- entrez_search(db = "pubmed", term = searchPhrase, retmax = 1000) # # Number of studies mySearch$count [1] 560 # ``` We can do a formal analysis by **`assessHotSpot`()**. The RBD is located between genomic positions 22517 and 23186. ```{r} assessHotSpot(omicron, c(22517, 23186)) ``` ```{r} assessHotSpot(ancestral, c(22517, 23186)) ```
# Integrating Biological Signal via Entropy Entropy analysis categorizes observations and takes into account the corresponding frequencies in natural populations. This adds significant information to sequence data. ```{r} summary(ancestral) ``` In the summary above, we observe that *ORF6* and *N* proteins have high mutational rates, both overall (SNV/Kb) and non-synonymous (non-Syn/Kb). However, the corresponding entropies do not stand out among those of other proteins. This contrasts with *S*, which, despite having average mutation rates, exhibits entropies biased towards high values. This is even more evident for Omicron sublineages: ```{r} summary(omicron) ```
# References