--- title: "Using GxEScanR" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Using GxEScanR} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup, include = FALSE} library(GxEScanR) library(BinaryDosage) ``` ## Introduction GxEScanR is designed to efficiently run genome-wide association study (GWAS) and genome-wide by environment interaction study (GWEIS) scans using imputed genotypes stored in the BinaryDosage format. The phenotype to be analyzed can either be a continuous or binary trait. The GWEIS scan performs multiple tests that can be used in two-step methods. ## Notation and Models There are five models that can be fit using GxEScanR when running a GWEIS, plus one that must be fit before running the GWEIS. $Y$ represents the outcome or phenotype $X$ represents the confounding covariates $E$ represents the covariate that will have its genetic interaction tested $G$ represents the genetic data $\beta$ represents the effect estimate that will be fitted in the model ### Model 0 - Environment-only Model This model contains the phenotype, confounding covariates, and the interaction covariate. It must be fit prior to running the GWEIS using the glm() routine. $$Y = \beta_X X + \beta_E E$$ ### Model 1 - Gene-only Model This model contains all the values from the environment-only model and adds the genetic data. $$Y = \beta_X X + \beta_E E + \beta_G G$$ $\beta_G$ is the only term tested in this model and is represented by `bg_ge` in the output. ### Model 2 - GxE Interaction Model This model contains all the values from the gene-only model and adds the gene-environment interaction term. $$Y = \beta_X X + \beta_E E + \beta_G G + \beta_{GxE} GE$$ $\beta_G$ and $\beta_{GxE}$ are both tested for this model as well as a joint test for both terms. They are represented by `bg_gxe`, `bgxe`, and `joint`, respectively, in the output. ### Model 3 - EG Model This model tests for a relationship between the environmental value of interest and the gene. If $E$ is coded 0,1, a logistic model is fitted; otherwise a linear model is fitted. $$E = \beta_X X + \beta_G G$$ $\beta_G$ is tested in this model and is represented by `bg_eg` in the output. ### Model 4 - Case-only Model This model is the same as model 3 except it is run on cases only. $\beta_G$ is tested and is represented by `bg_case` in the output. This model can only be run on a dichotomous phenotype. ### Model 5 - Control-only Model This model is the same as model 3 except it is run on controls only. $\beta_G$ is tested and is represented by `bg_ctrl` in the output. This model can only be run on a dichotomous phenotype. ## Steps to Run a GWEIS There are five steps to running a GWEIS using GxEScanR: 1. Preparing the Data (Genetic Data + Subject Data) 2. Fitting the Base Model 3. Allocating Memory for the GWEIS 4. Running the GWEIS 5. Processing the Results ## Step 1: Preparing the Data ### Genetic Data GxEScanR uses data stored in the BinaryDosage format. The BinaryDosage package converts files from the VCF format to the BinaryDosage format using the `vcftobd()` routine. The `getbdinfo()` routine is then used to load the information about the BinaryDosage file needed to run the GWEIS. The following example shows how to convert a VCF file to BinaryDosage format. Note that `vcftobd()` requires the `vcfppR` package. ```{r genetic-data-convert, eval = FALSE} vcffile <- system.file("extdata", "gendata.vcf.gz", package = "GxEScanR") exampledir <- tempdir() bdosefile <- file.path(exampledir, "gendata.bdose") BinaryDosage::vcftobd(vcffile = vcffile, bdose_file = bdosefile) bdinfo <- BinaryDosage::getbdinfo(bdosefile) ``` This vignette uses a pre-converted BinaryDosage file included with the package. ```{r genetic-data} bdosefile <- system.file("extdata", "gendata.bdose", package = "GxEScanR") bdinfo <- BinaryDosage::getbdinfo(bdosefile) exampledir <- tempdir() ``` ### Subject Data The subject data consists of subject IDs used to link the genetic data, and the phenotype and covariates used in the GWEIS. The data included with GxEScanR has two phenotypes (one continuous, one dichotomous) and two covariates (x1 and x2). **Important:** All values for phenotypes and covariates must be numeric; factors are not allowed. **Important:** All subject IDs must be character values. **Important:** When using a dichotomous phenotype, it must be coded 0,1. ```{r subject-data} subjectfile <- system.file("extdata", "subdata.rds", package = "GxEScanR") subjectdata <- readRDS(subjectfile) head(subjectdata) ``` ## Step 2: Fitting the Base Model The base model is fit using glm() using data that contains only the subject IDs, phenotype, and covariates of interest — no genetic data. It is important to: - Remove subjects with missing data. - Keep only subjects who have genetic data (listed in `bdinfo$samples`). - Keep the subject IDs in the data frame, as they are used to match genetic data to the correct phenotype and covariates. The formula passed to glm() must list the covariate whose interaction with the genetic data is being tested **last**. In this example we test the interaction with x1, so the formula is `y ~ x2 + x1`. ### Continuous Phenotype ```{r linear-model} # Remove y_logistic from the subject data lineardata <- subjectdata[, c(1, 2, 4, 5)] # Keep only subjects with complete data lineardata <- lineardata[complete.cases(lineardata), ] # Keep only subjects with genetic data lineardata <- lineardata[lineardata$subid %in% bdinfo$samples$sid, ] # Fit the linear model linearmodel <- glm(y_linear ~ x2 + x1, data = lineardata) ``` ### Dichotomous Phenotype ```{r logistic-model} # Remove y_linear from the subject data logisticdata <- subjectdata[, c(1, 3, 4, 5)] # Keep only subjects with complete data logisticdata <- logisticdata[complete.cases(logisticdata), ] # Keep only subjects with genetic data logisticdata <- logisticdata[logisticdata$subid %in% bdinfo$samples$sid, ] # Fit the logistic model logisticmodel <- glm(y_logistic ~ x2 + x1, family = binomial, data = logisticdata) ``` ## Step 3: Allocating Memory for the GWEIS Memory must be allocated before running the GWEIS using `gweis.mem()`. This pre-computation makes the scan substantially faster. Pass the fitted base model, the subject IDs from that model's data, and the list of tests to perform. Typical test sets: - **Continuous phenotype:** `"bg_ge"`, `"bg_gxe"`, `"bgxe"`, `"joint"` - **Dichotomous phenotype:** `"bg_ge"`, `"bg_gxe"`, `"bgxe"`, `"joint"`, `"bg_eg"`, `"bg_case"`, `"bg_ctrl"` ```{r allocate-memory} # Continuous outcome linearmem <- gweis.mem(gemdl = linearmodel, subids = lineardata$subid, tests = c("bg_ge", "bg_gxe", "bgxe", "joint")) # Dichotomous outcome logisticmem <- gweis.mem(gemdl = logisticmodel, subids = logisticdata$subid, tests = c("bg_ge", "bg_gxe", "bgxe", "joint", "bg_eg", "bg_case", "bg_ctrl")) ``` ## Step 4: Running the GWEIS Pass the memory object from `gweis.mem()`, the `bdinfo` object, a vector of SNP indices or IDs to test, and an output file path to `rungweis()`. Results are written as a tab-delimited text file. ```{r run-gweis} snpindex <- 1:nrow(bdinfo$snps) # Continuous outcome linearresults <- file.path(exampledir, "linear.txt") rungweis(gweismem = linearmem, bdinfo = bdinfo, snps = snpindex, outfilename = linearresults) # Dichotomous outcome logisticresults <- file.path(exampledir, "logistic.txt") rungweis(gweismem = logisticmem, bdinfo = bdinfo, snps = snpindex, outfilename = logisticresults) ``` ## Step 5: Processing the Results Read the output file with `read.table()`. The following describes the output columns, illustrated with the first three SNPs. ```{r read-results} lineardf <- read.table(linearresults, header = TRUE, sep = "\t") logisticdf <- read.table(logisticresults, header = TRUE, sep = "\t") ``` ### SNP Information and Allele Frequency Every row contains SNP ID (`snpid`), chromosome (`chr`), position (`loc`), reference allele (`ref`), alternate allele (`alt`), and alternate allele frequency (`aaf`). For a dichotomous outcome, `aaf_case` and `aaf_ctrl` are also included. ```{r snp-info-linear} knitr::kable(lineardf[1:3, 1:6], caption = "SNP information — continuous outcome") ``` ```{r snp-info-logistic} knitr::kable(logisticdf[1:3, 1:8], caption = "SNP information — dichotomous outcome") ``` ### Model 1 Output: `bg_ge` `bg_ge` is the $\beta_G$ estimate from model 1. `bg_ge_lrt` is the LRT 1 df chi-squared statistic for $\beta_G = 0$. Output when `"bg_ge"` is in `tests`. ```{r model1-output} knitr::kable(lineardf[1:3, 7:8], caption = "Model 1 results") ``` ### Model 2 Output: `bg_gxe`, `bgxe`, `joint` - `bg_gxe` / `bg_gxe_lrt`: $\beta_G$ estimate and LRT from model 2. Output when `"bg_gxe"` is in `tests`. - `bgxe` / `bgxe_lrt`: $\beta_{GxE}$ estimate and LRT. Output when `"bgxe"` is in `tests`. - `joint_lrt`: 2 df LRT for $\beta_G = 0$ and $\beta_{GxE} = 0$ jointly. Output when `"joint"` is in `tests`. ```{r model2-output} knitr::kable(lineardf[1:3, 9:13], caption = "Model 2 results") ``` ### Models 3–5 Output: `bg_eg`, `bg_case`, `bg_ctrl` Each test produces a $\beta_G$ estimate and a 1 df LRT statistic. Output when `"bg_eg"`, `"bg_case"`, or `"bg_ctrl"` are in `tests`, respectively. ```{r model3-5-output} knitr::kable(logisticdf[1:3, 16:21], caption = "Models 3-5 results") ```