RNA-seq data analyss Count
RNA-Seq (RNA sequencing ) also called whole transcriptome sequncing use next-generation sequeincing (NGS) to reveal the presence and quantity of RNA in a biolgical sample at a given moment. As we discuss during the talk we can use different approach and different tools. Here we present the DEseq2 vignette it wwas composed using STAR and HTseqcount and then Deseq2. The other part we show kallisto Unfortunately our computer not allow the work some step was only for demonstration purpose.
RNA-seq workflow: gene-level exploratory analysis and differential expression:
Here we are some examples of working on R on Counts.
We can use R
conda create -n DESEQ
source activate DESEQ
conda install -y r-mixomics bioconductor-htsfilter bioconductor-deseq2
R
library(DESeq2)
library(RColorBrewer)
library(mixOmics)
library(HTSFilter)
directory <- "RNAseq_data/count_table_files/"
dir(directory)
*Exercise 3.1* Read the files:
rawCountTable <- read.table(paste0(directory,"count_table.tsv"), header=TRUE,
row.names=1)
sampleInfo <- read.table(paste0(directory,"pasilla_design.txt"), header=TRUE,
row.names=1)
*Exercise 3.2* Have a look at the count data:
head(rawCountTable)
nrow(rawCountTable)
*Exercise 3.3* Have a look at the sample information and order the count table
in the same way that the sample information:
sampleInfo
rawCountTable <- rawCountTable[,match(rownames(sampleInfo),
colnames(rawCountTable))]
head(rawCountTable)
*Exercise 3.4* Create the 'condition' column
sampleInfo$condition <- substr(rownames(sampleInfo), 1,
nchar(rownames(sampleInfo))-1)
sampleInfo$condition[sampleInfo$condition=="untreated"] <- "control"
sampleInfo$condition <- factor(sampleInfo$condition)
sampleInfo
*Exercise 3.5* Create a 'DESeqDataSet' object
ddsFull <- DESeqDataSetFromMatrix(as.matrix(rawCountTable), sampleInfo,
formula(~condition))
ddsFull
# 3.2. Starting from separate files
*Exercise 3.6* List all files in directory
directory <- "RNAseq_data/separate_files/"
sampleFiles <- list.files(directory)
sampleFiles
*Exercise 3.7* Create an object with file informations
keptFiles <- sampleFiles[-1]
sampleName <- sapply(keptFiles, function(afile)
substr(afile, 1, nchar(afile)-6))
condition<- sapply(keptFiles, function(afile)
substr(afile, 1, nchar(afile)-7))
fileInfo <- data.frame(sampleName = sampleName, sampleFiles = keptFiles,
condition = condition)
rownames(fileInfo) <- NULL
fileInfo
*Exercise 3.8* Construct a 'DESeqDataSet' object
ddsHTSeq <- DESeqDataSetFromHTSeqCount(fileInfo, directory, formula(~condition))
ddsHTSeq
# 3.3. Preparing the data object for the analysis of interest
*Exercise 3.9* Select the subset of paire-end samples
dds <- subset(ddsFull, select=colData(ddsFull)$type=="paired-end")
dds
colData(dds)
# 3.4 Data exploration and quality assessment
*Exercise 3.10* Extract pseudo-counts (*ie* $\log_2(K+1)$)
pseudoCounts <- log2(counts(dds)+1)
head(pseudoCounts)
*Exercise 3.11* Histogram for pseudo-counts (sample ```treated2```)
```{r histoPseudoCounts}
hist(pseudoCounts[,"treated2"])
*Exercise 3.12* Boxplot for pseudo-counts
boxplot(pseudoCounts, col="gray")
*Exercise 3.13* MA-plots between control or treated samples
par(mfrow=c(1,2))
## treated2 vs treated3
# A values
avalues <- (pseudoCounts[,1] + pseudoCounts[,2])/2
# M values
mvalues <- (pseudoCounts[,1] - pseudoCounts[,2])
plot(avalues, mvalues, xlab="A", ylab="M", pch=19, main="treated")
abline(h=0, col="red")
## untreated3 vs untreated4
# A values
avalues <- (pseudoCounts[,3] + pseudoCounts[,4])/2
# M values
mvalues <- (pseudoCounts[,3] - pseudoCounts[,4])
plot(avalues, mvalues, xlab="A", ylab="M", pch=19, main="control")
abline(h=0, col="red")
*Exercise 3.14* PCA for pseudo-counts
vsd <- varianceStabilizingTransformation(dds)
vsd
plotPCA(vsd)
*Exercise 3.15* heatmap for pseudo-counts (using ```mixOmics``` package)
sampleDists <- as.matrix(dist(t(pseudoCounts)))
sampleDists
cimColor <- colorRampPalette(rev(brewer.pal(9, "Blues")))(16)
cim(sampleDists, col=cimColor, symkey=FALSE)
# 3.5. Differential expression analysis
*Exercise 3.16* Run the DESeq2 analysis
dds <- DESeq(dds)
dds
# 3.6. Inspecting the results
*Exercise 3.17* Extract the results
res <- results(dds)
res
*Exercise 3.18* Obtain information on the meaning of the columns
mcols(res)
*Exercise 3.19* Count the number of significant genes at level 1%
sum(res$padj < 0.01, na.rm=TRUE)
*Exercise 3.20* Extract significant genes and sort them by the strongest down
regulation
sigDownReg <- res[!is.na(res$padj), ]
sigDownReg <- sigDownReg[sigDownReg$padj < 0.01, ]
sigDownReg <- sigDownReg[order(sigDownReg$log2FoldChange),]
sigDownReg
*Exercise 3.21* Extract significant genes and sort them by the strongest up
regulation
sigUpReg <- res[!is.na(res$padj), ]
sigUpReg <- sigUpReg[sigUpReg$padj < 0.01, ]
sigUpReg <- sigUpReg[order(sigUpReg$log2FoldChange, decreasing=TRUE),]
*Exercise 3.22* Create permanent storage of results
write.csv(sigDownReg, file="sigDownReg-deseq.csv")
write.csv(sigUpReg, file="sigUpReg-deseq.csv")
# 3.7 Diagnostic plot for multiple testing
*Exercise 3.23* Plot a histogram of unadjusted p-values after filtering
hist(res$pvalue, breaks=50)
# 3.8 Interpreting the DE analysis results
*Exercise 3.24* Create a MA plot showing differentially expressed genes
plotMA(res, alpha=0.01)
*Exercise 3.25* Create a Volcano plot
volcanoData <- cbind(res$log2FoldChange, -log10(res$padj))
volcanoData <- na.omit(volcanoData)
colnames(volcanoData) <- c("logFC", "negLogPval")
head(volcanoData)
plot(volcanoData, pch=19, cex=0.5)
*Exercise 3.26* Transform the normalized counts for variance stabilization
vsnd <- varianceStabilizingTransformation(dds, blind=FALSE)
vsnd
*Exercise 3.27* Extract the transformed data
head(assay(vsnd), 10)
selY <- assay(vsnd)[!is.na(res$pval), ]
selY <- selY[res$pval[!is.na(res$pval)] < 0.01,]
cimColor <- colorRampPalette(rev(brewer.pal(9, "Blues")))(255)[255:1]
cim(t(selY), col=cimColor, symkey=FALSE)
In this session we want to perform some differential expression from two conditions as example (Normal vs tumor RNA-seq). This data use for this tutorial are pubblicaly avaible.
Please download the material here:
Please follow this command
cd ~/Desktop/
tar xvf TUTORIAL_RNA_snakemake.tar.gz
cd TUTORIAL
conda env create -f snakemake_differential.yml
Please launch the analysis in this way:
snakemake -s Snakefile.Kallisto --configfile database3.yml --use-conda
Please Connect and see this tutorial on live sleuth:
[youtube](https://www.youtube.com/watch?v=KEn0CMYk6Wo)
Here antoher way to do the analysis. Please familiarize with the results
cd /home/bioinfo2/Data/kallisto-nf
sudo nextflow run cbcrg/kallisto-nf -with-docker
Please follow this tutorial [link] (http://www.nathalievilla.org/doc/html/solution_edgeR-tomato.html#where-to-start-installation-and-alike) Pratical rnaseq data using tomato data