Practical Differential expression analysis with edgeR

Practical 5: Differential expression analysis with edgeR

In this tutorial, we will perform a basic differential expression analysis with RNA sequencing data using R/Bioconductor. Before class, please download the data set and install the software as explained in the following section.

Download data and install software

For this tutorial, we will use the data set generated by the Sequencing Quality Control (SEQC) project. Please download the data by running the line of R code below. If this fails, you can try changing the method parameter, e.g. method = "auto", method = "wget" or method = "curl". And as a last resort, you can download it by clicking on this link.

```{r download-data} fname <- “http://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE49712&format=file&file=GSE49712_HTSeq.txt.gz” download.file(fname, destfile = “GSE49712_HTSeq.txt.gz”)


Next, we need to download and install the package we will use to perform the analysis: [edgeR][].

[edgeR]: http://www.bioconductor.org/packages/release/bioc/html/edgeR.html

```{r install-packages, eval = FALSE}
source("http://bioconductor.org/biocLite.R")
biocLite("edgeR", dependencies = TRUE)

Introduction

As you learned in class, RNA sequencing (RNA-seq) is an experimental method for measuring RNA expression levels via high-throughput sequencing of small adapter-ligated fragments (see figure below from r citet(c(Wang2009="10.1038/nrg2484"))). The number of reads that map to a gene is the proxy for its expression level.

RNA-seq experiment

There are many steps between receiving the raw reads from the sequencer and gaining biological insight. In this tutorial, we will start with a “Table of counts” and end with a “List of differentially expressed genes”, as diagrammed in the RNA-seq analysis pipeline below (from r citet(c(Oshlack2010="10.1186/gb-2010-11-12-220"))).

RNA-seq analysis pipeline

Unlike the continuous data that is generated by microarrays, RNA-seq generates counts. While it is possible to transform the counts into a continuous distribution and perform an analysis similar to microarrays, e.g. linear regression, multiple approaches have been developed to directly model the data as counts (r citet(c(Marioni2008="10.1101/gr.079558.108"))). We will perform a simple analysis using one popular Bioconductor package for differential expression analysis, [edgeR][]. edgeR (r citet(c(Robinson2007="10.1093/bioinformatics/btm453", Robinson2010="10.1093/bioinformatics/btp616"))) models the count data with a negative binomial model.

The SEQC data set we will analyze contains two different groups for comparison. Group A is five technical replicates of the Stratagene Universal Human Reference RNA. It is an equal mixture of RNA from ten different human cell lines. Group B is five technical replicates of Ambion’s Human Brain Reference RNA. It is RNA that was pooled from several donors from several different regions of the brain.

Setup

First, we need to load the package into R.

```{r load-package} library(“edgeR”)


Second, we need to load the data into R.
In order to use the code below, we need to ensure the data file is in R's working directory.
We can check our current working directory with the command `getwd()`, and we can change it using `setwd("path/to/file")` or using RStudio (type Ctrl-Shift-K).

```{r import-data}
data_raw <- read.table("GSE49712_HTSeq.txt.gz", header = TRUE)

Quality control

Let’s quickly explore the data. Each each row corresponds to a gene, and each column corresponds to a sample.

```{r view} dim(data_raw) head(data_raw) tail(data_raw)


Notice that the last five lines contain summary statistics and thus need to be removed prior to testing.

```{r remove-sum-lines}
data_clean <- data_raw[1:(nrow(data_raw) - 5), ]

We should also remove genes that are unexpressed or very lowly expressed in the samples. One simple method to do this is to choose a cutoff based on the median log~2~-transformed counts per gene per million mapped reads (cpm). edgeR provides the function, cpm, to compute the counts per million.

```{r expr-cutoff} cpm_log <- cpm(data_clean, log = TRUE) median_log2_cpm <- apply(cpm_log, 1, median) hist(median_log2_cpm) expr_cutoff <- -1 abline(v = expr_cutoff, col = “red”, lwd = 3) sum(median_log2_cpm > expr_cutoff) data_clean <- data_clean[median_log2_cpm > expr_cutoff, ]


After removing all genes with a median log~2~ cpm below `r expr_cutoff`, we have `r sum(median_log2_cpm > expr_cutoff)` genes remaining.
A good rule of thumb when analyzing RNA-seq data from a single cell type is to expect 9-12 thousand expressed genes.
In this case, we have many more because the samples include RNA collected from many different cell types, and thus it is not surprising that many more genes are robustly expressed.

**Question:**
What are some ways that we could improve this simple cutoff?
What information are we ignoring?

After filtering lowly expressed genes, we recalculate the log~2~ cpm.

```{r}
cpm_log <- cpm(data_clean, log = TRUE)

We can also explore the relationships between the samples by visualizing a heatmap of the correlation matrix. The heatmap result corresponds to what we know about the data set. First, the samples in group A and B come from very different cell populations, so the two groups are very different. Second, since the samples in each group are technical replicates, the within group variance is very low.

```{r heatmap} heatmap(cor(cpm_log))


Another method to view the relationships between samples is principal components analysis (PCA).

```{r pca}
pca <- prcomp(t(cpm_log), scale. = TRUE)
plot(pca$x[, 1], pca$x[, 2], pch = ".", xlab = "PC1", ylab = "PC2")
text(pca$x[, 1], pca$x[, 2], labels = colnames(cpm_log))
summary(pca)

Two group comparison

Now we start preparing the data for the the test of differential expression. We create a vector called, group, that labels each of the columns as belonging to group A or B. We then use this vector and the gene counts to create a DGEList, which is the object that edgeR uses for storing the data from a differential expression experiment.

```{r make-groups-edgeR} group <- substr(colnames(data_clean), 1, 1) group y <- DGEList(counts = data_clean, group = group) y



edgeR normalizes the genes counts using the method TMM (trimmed means of m values) developed by `r citet(c(Robinson2010="10.1186/gb-2010-11-3-r25"))`.
Recall from lecture that the read counts for moderately to lowly expressed genes can be strongly influenced by small fluctuations in the expression level of highly expressed genes.
In other words, small differences in expression of highly expressed genes between samples can give the appearance that many lower expressed genes are differentially expressed between conditions.
TMM adjusts for this by removing the extremely lowly and highly expressed genes and also those genes that are very different across samples.
It then compares the total counts for this subset of genes between the two samples to get the scaling factor (this is a simplification).
Similar to normalization methods for microarray data, this method assumes the majority of genes are not differentially expressed between any two samples.

```{r normalize-edgeR}
y <- calcNormFactors(y)
y$samples

The next step is to model the variance of the read counts per gene. A natural method for modeling gene counts is the Poisson distribution. However, the Poisson assumes the mean and variance are identical, but it has been found empirically that the variance in RNA-seq measurements of gene expression are larger than the mean (termed “overdispersion”). So instead the negative binomial distribution is used, which has a dispersion parameter for modeling the increase in variance from a Poisson process. edgeR treats the Poisson variance as simple sampling variance, and refers to the dispersion estimate as the “biological coefficient of variation.” Though it should be mentioned that any technical biases are also included in this estimate. edgeR shares information across genes to determine a common dispersion. It extends this to a trended dispersion to model the mean-variance relationship (lowly expressed genes are typically more noisy). Lastly it calculates a dispersion estimate per gene and shrinks it towards the trended dispersion. The gene-specific (referred to in edgeR as tagwise) dispersion estimates are used in the test for differential expression.

```{r model-edgeR} y <- estimateDisp(y) sqrt(y$common.dispersion) # biological coefficient of variation plotBCV(y)


The biological coefficient of variation is lower than normally seen in human studies (~0.4) because the samples are technical replicates.

edgeR tests for differential expression between two classes using a method similar in idea to the Fisher's Exact Test.

```{r test-edgeR}
et <- exactTest(y)
results_edgeR <- topTags(et, n = nrow(data_clean), sort.by = "none")
head(results_edgeR$table)

How many genes are differentially expressed at an FDR of 10%?

```{r count-de-edgeR} sum(results_edgeR$table$FDR < .1) plotSmear(et, de.tags = rownames(results_edgeR)[results_edgeR$table$FDR < .1]) abline(h = c(-2, 2), col = “blue”)


As expected from the description of the samples and the heatmap, there are many differentially expressed genes.
The [MA plot][ma] above plots the log~2~ fold change on the y-axis versus the average log~2~ counts-per-million on the x-axis.
The red dots are genes with an FDR less than 10%.
The blue lines represent a four-fold change in expression.

[ma]: https://en.wikipedia.org/wiki/MA_plot

## Adding covariates

The previous example was a two group comparison, but if you have additional covariates, you'll need to use a generalized linear model (GLM) framework.
Let's say we processed the samples in two batches and also recorded the [RIN scores][rin] to control for differences in RNA quality.

[rin]: http://www.genomics.agilent.com/article.jsp?pageId=2181&_requestid=506775

```{r}
set.seed(123)
batch <- sample(c("one", "two"), 10, replace = TRUE)
rin <- sample(6:10, 10, replace = TRUE)

We need to create a design matrix to describe the statistical model.

y <- DGEList(data_clean)
y <- calcNormFactors(y)
design <- model.matrix(~group + batch + rin)
design

And now we test. The argument coef = 2 corresponds to testing the second column of the design matrix, which in this case is whether the sample is from group A or B.

y <- estimateDisp(y, design)
fit <- glmFit(y, design)
lrt <- glmLRT(fit, coef = 2)
topTags(lrt)

And here is an example gene.

boxplot(as.numeric(data_clean["HBB", ]) ~ group)

To learn more

  • Check out the manual for edgeR and search the Bioconductor support site.
  • For a rigorous comparison of RNA-seq methods, see r citet(c(Rapaport2013="10.1186/gb-2013-14-9-r95")) and r citet(c(Soneson2013="10.1186/1471-2105-14-91")).
  • The source code for this lesson is available as a GitHub Gist.
  • In addition to [edgeR][], there are many statistical tools available for performing differential expression analysis. Other commonly used count-based methods are DESeq2 and limma+voom. For a different approach from the traditional count-based methods, check out Cufflinks.

References

This tutorial was based on this post (link)[https://gist.githubusercontent.com/jdblischak/11384914/raw/851abe82bdf0de443631c4479cfbeca027333232/rnaseq-de-tutorial.Rmd]