--- title: "Database Access and Preparation" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{data-access} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) options(rmarkdown.html_vignette.check_title = FALSE) ``` ```{r setup} library(pathdb) ``` ## Introduction This vignette provides a tutorial for accessing the South Dakota State University (SDSU) bioinformatics/genomic database used in Integrated Differential Expression and Pathway Analysis ([iDEP](https://bioinformatics.sdstate.edu/idep/)). We will also discuss how to leverage this package to prepare data for further analysis. In current research scenarios, it is common to work with RNA-Seq counts data. The data shown below, `hypoxia_reads`, is an example of this kind of data that we may come across naturally. This data contains gene count readings for glioblastoma stem-like cells derived from a U87MG cell line under hypoxia, seen in the column names. The species of interest is Human (Homo sapiens). ```{r data} head(hypoxia_reads) ``` ## The `search_species()` function When considering Differential Expression and Pathway Analysis as methods for research, it may first be best for a researcher to convert the gene IDs of their data to the standard Ensembl or STRINGdb format. The SDSU database is great for this use case, but first we need to know: Does the database contain our species? To answer this and discover how we can identify our species in the database, we use the `search_species()` function. ```{r search_species, eval = FALSE} srch_results <- search_species( query = "Human", name_type = "all" ) srch_results ``` ```{r print_srch, echo = FALSE} pathdb:::human_srch ``` This search through the database yields the following species: * Human * Pediculus humanus corporis Human body louse_USDA * Pediculus humanus STRINGdb So we have found that our species of interest, Human, is contained in the bioinformatics database and is labeled with the **species ID 96**. While searching with `name_type = "all"` is most efficient, we can also search by primary name, academic name, and ID number. However, only Ensembl encoded species have an available entry in the `academicName` column, and searching by ID number requires prior knowledge of the database, so searching by "primary" name (`name2` column) or "all" is optimal for the most results. It is also often that the primary name includes the academic name, regardless. Below are examples of these search methods. #### Primary Example ```{r primary_srch, eval=FALSE} search_species( query = "Human", name_type = "primary" ) ``` ```{r primary_print, echo = FALSE} pathdb:::human_primary ``` #### Academic Example ```{r academic_srch, eval=FALSE} search_species( query = "Homo sapiens", name_type = "academic" ) ``` ```{r academic_print, echo = FALSE} pathdb:::human_academic ``` #### ID Example ```{r id_srch, eval=FALSE} search_species( query = 96, name_type = "id" ) ``` ```{r id_print, echo = FALSE} pathdb:::human_id ``` The most important piece of information to retain from the data returned by `search_species()` is the `id` column. This is how many functions in this package will know which species to use for their operations. ## Exploring Species Data Tables Now that we know our species ID is 96, we might want to discover what specific data and metadata is available for this species within the database. The `list_tables()` function allows us to see all the tables present for a given species. ```{r list_tables, eval=FALSE} table_names <- list_tables(species_id = 96) table_names ``` ```{r table_print, echo = FALSE} pathdb:::human_tables ``` Once we have identified a table of interest, we can retrieve its contents using the `get_table()` function. For instance, we can fetch the "geneInfo" table, which contains important gene metadata like chromosomes, start positions, GC content, Ensembl IDs, and more. ```{r get_table, eval=FALSE} gene_info <- get_table( species_id = 96, table = "geneInfo" ) head(gene_info[, 1:5]) ``` ```{r print_genes, echo=FALSE} head(pathdb:::human_genes[, 1:5]) ``` The functions `get_genes()` and `get_pathways()` can also be used to obtain gene metadata and pathway information relative to the list of genes being studied. Below we use the genes from our `hypoxia_reads` data to do so. Note that these functions rely on Ensembl ID conversions from `convert_id()`, discussed later in this vignette. ```{r gene_info, eval=FALSE} gene_info <- get_genes( species_id = 96, genes = rownames(hypoxia_reads) ) head(gene_info[, 1:5]) ``` ```{r print_info, echo=FALSE} head(pathdb:::hypoxia_genes[, 1:5]) ``` When retrieving pathway information, it is important to know which pathway databases (e.g. KEGG, GOBP, GOCC, etc.) are available for the species of interest. For this, we may use `path_categories()`. ```{r get_categories, eval = FALSE} categories <- path_categories(species_id = 96) head(categories, n = 10) ``` ```{r print_categories, echo=FALSE} head(pathdb:::categories) ``` ```{r rows_categories, eval=FALSE} nrow(categories) ``` ```{r print_nrow, echo=FALSE} nrow(pathdb:::categories) ``` We see here that the human species has 140 different pathway databases. To retrieve all pathways from all databases would be very inefficient. We speed up this process by using the `category` argument in `get_pathways()` to filter to specific pathway databases. This argument can be a character vector or only a single string. This feature of `pathdb` is especially versatile, as one can use a vector of pathway database names, e.g. `c("KEGG", "GOBP", "GOCC")`. In this way, we can check databases in one function call, rather than pulling from KEGG, GO, and other databases separately. Now we may obtain the specific set of pathways relative to the genes in our hypoxia data. ```{r gene_paths, eval=FALSE} path_info <- get_pathways( species_id = 96, genes = rownames(hypoxia_reads), category = c("GOBP") ) head(path_info) ``` ```{r print_paths, echo=FALSE} head(pathdb:::hypoxia_paths) ``` ## Converting/Filtering Genes Once we have identified the correct species ID for our data (in this case, our ID is 96), we can use the `convert_id()` function to standardize our gene IDs. This step is crucial for matching the gene names in SDSU's database, allowing us to retrieve pathway information for our genes and proceed to further analysis. If we only provide a vector of gene names to `convert_id()`, it will return a table mapping our original IDs to the standard Ensembl format. If we also provide the raw data frame, it will return the data frame with Ensemble gene names substituted, automatically filtering out genes that cannot be mapped. ```{r conv_vector, eval=FALSE} head(rownames(hypoxia_reads)) # Providing just a vector conv_table <- convert_id( genes = rownames(hypoxia_reads), species_id = 96 ) head(conv_table) ``` ```{r print_conv_vector, echo=FALSE} head(pathdb:::conversion_vector) ``` ```{r conv_data, eval=FALSE} # Providing a vector AND data hypoxia_conv <- convert_id( genes = rownames(hypoxia_reads), data = hypoxia_reads, species_id = 96 ) knitr::kable(head(hypoxia_conv)) ``` ```{r print_conv_data, echo=FALSE} hypoxia_conv <- pathdb:::conversion_data knitr::kable(head(hypoxia_conv)) ``` Please note that, if our genes cannot be mapped to Ensembl IDs, we can lose large portions of our data. To best negate this, ensure that your IDs are well documented, have Ensembl counterparts, and contain few special characters. ## Processing Data After standardizing our gene IDs, the next step prior to analysis is typically to clean our count data. This involves handling any missing values, filtering out genes with consistently low expression across samples, and potentially applying transformations to our count data, making it more suited for further exploratory or differential analysis. The `process_data()` function provides a compact way to complete these steps in our analysis. It can impute missing values using methods like `"geneMedian"` or `"treatAsZero"` and filter genes based on Counts Per Million (CPM). Note that if the given data contains a column of gene IDs, that column will be stored as the rownames of the processed data. Here is an example of processing the converted Human data using the default settings, which include imputing missing values with the overall gene median, filtering for genes with at least 0.5 CPM in at least 1 sample, and returning the raw counts without transformation. ```{r process_example} nrow(hypoxia_conv) processed_data <- process_data( data = hypoxia_conv, missing_value = "geneMedian", min_cpm = 0.5, n_min_samples = 1 ) nrow(processed_data) knitr::kable(head(processed_data[, 1:3])) ``` The above information covers all that is needed to properly access the SDSU bioinformatics database and prepare your own data for further analysis like differential expression and pathway enrichment. Any extra filtering of data will need to be on the user's end or recommended to the developers as a new feature. Feel free to make suggestions!