Practical 4 Hisat2
HISAT2 is a fast and sensitive alignment program for mapping next-generation sequencing reads (both DNA and RNA) to a population of human genomes (as well as to a single reference genome). Based on an extension of BWT for graphs [Sirén et al. 2014], we designed and implemented a graph FM index (GFM), an original approach and its first implementation to the best of our knowledge. In addition to using one global GFM index that represents a population of human genomes, HISAT2 uses a large set of small GFM indexes that collectively cover the whole genome (each index representing a genomic region of 56 Kbp, with 55,000 indexes needed to cover the human population). These small indexes (called local indexes), combined with several alignment strategies, enable rapid and accurate alignment of sequencing reads. This new indexing scheme is called a Hierarchical Graph FM index (HGFM).
This brief tutorial will explain how you can get started using Hisat2 to quantify your RNA-seq data. Our computational power are not enought for elaborate a rela case so we use only a part of the data. We suggest to try at home with the complete stock of data. Computer avaible are not born for this purpose (oly 4G of RAM, It is not enough for the new operation systems) we need more patience
conda activate hisat2
hisat2 -h
Use the UNIX command wget to pull the data off the FTP server hosting the data we will be working with. Use the command cd [Options] [Directory] to change into your desired ~/working_directory and then download these files.
$ wget ftp://ftp.ccb.jhu.edu/pub/RNAseq_protocol/chrX_data.tar.gz
Extract Tutorial Files Using the UNIX command tar xvzf, we can extract the .tar.gz files into our ~/working_directory. The command option ‘x’ extracts the files, ‘v’ lists the processed files verbosely, ‘z’ filters the archive through gzip, and ‘f’ tells tar to used the archived file.
$ tar xvzf chrX_data.tar.gz
Using HISAT2, we can align our sample .fastq.gz files (without the need to unzip them) to the indexed reference genome, that has already been prepared, located in the chrX_data/indexes/ directory. Doing so will generate our SAM (Sequence Alignment Map) files we will use in later steps. First we type out hisat2 to denote the command we are using. The options entered here are ‘-p 8’ denoting the use of 8 threads, ‘–dta’ is used to generate output SAM files that can be directly read into StringTie, ‘-x’ is used to denote the indexed reference genome, ‘-1’ and ‘-2’ are used to denote our fwd and rev samples in a paired-end alignment, and ‘-S’ is used to denote that we would like our output in SAM format.
$ hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188044_chrX_1.fastq.gz -2 chrX_data/samples/ERR188044_chrX_2.fastq.gz -S ERR188044_chrX.sam
Here is a bash script for the above HISAT2 command called hisat2.sh that will run all the .fastq.gz files for you simultaneously.
#!/usr/bin/bash
#bash script for hisat2; align all .fastq.gz files to indexed reference genome to generate .sam files
SAMPLES="ERR188044 ERR188104 ERR188234 ERR188245 ERR188257 ERR188273 ERR188337 ERR188383 ERR188401 ERR188428 ERR188454 ERR204916"
for SAMPLE in $SAMPLES; do
hisat2 -p 11 --dta -x ~/chrX_data/indexes/chrX_tran -1 ~/chrX_data/samples/${SAMPLE}_chrX_1.fastq.gz -2 ~/chrX_data/samples/${SAMPLE}_chrX_2.fastq.gz -S ${SAMPLE}_chrX.sam
done
or you can copy this:
# map each sample using 1 threads
mkdir map
hisat2 -p 1 --rg-id=ERR188044 --rg SM:ERR188044 --rg PL:ILLUMINA --rg PU:aaaaaaa -x chrX_data/indexes/chrX_tran --dta --rna-strandness RF -1 chrX_data/samples/ERR188044_chrX_1.fastq.gz -2 chrX_data/samples/ERR188044_chrX_2.fastq.gz -S map/ERR188044_chrX.sam
hisat2 -p 1 --rg-id=ERR188104 --rg SM:ERR188104 --rg PL:ILLUMINA --rg PU:aaaaaaa -x chrX_data/indexes/chrX_tran --dta --rna-strandness RF -1 chrX_data/samples/ERR188104_chrX_1.fastq.gz -2 chrX_data/samples/ERR188104_chrX_2.fastq.gz -S map/ERR188104_chrX.sam
hisat2 -p 1 -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188234_chrX_1.fastq.gz -2 chrX_data/samples/ERR188234_chrX_2.fastq.gz -S map/ERR188234_chrX.sam
hisat2 -p 1 -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188245_chrX_1.fastq.gz -2 chrX_data/samples/ERR188245_chrX_2.fastq.gz -S map/ERR188245_chrX.sam
hisat2 -p 1 -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188257_chrX_1.fastq.gz -2 chrX_data/samples/ERR188257_chrX_2.fastq.gz -S map/ERR188257_chrX.sam
hisat2 -p 1 -x chrX_data/indexes/chrX_tran-1 chrX_data/samples/ERR188273_chrX_1.fastq.gz -2 chrX_data/samples/ERR188273_chrX_2.fastq.gz -S map/ERR188273_chrX.sam
hisat2 -p 1 -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188337_chrX_1.fastq.gz -2 chrX_data/samples/ERR188337_chrX_2.fastq.gz -S map/ERR188337_chrX.sam
hisat2 -p 1 -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188383_chrX_1.fastq.gz -2 chrX_data/samples/ERR188383_chrX_2.fastq.gz -S map/ERR188383_chrX.sam
hisat2 -p 1 -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188401_chrX_1.fastq.gz -2 chrX_data/samples/ERR188401_chrX_2.fastq.gz -S map/ERR188401_chrX.sam
hisat2 -p 1 -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188428_chrX_1.fastq.gz -2 chrX_data/samples/ERR188428_chrX_2.fastq.gz -S map/ERR188428_chrX.sam
hisat2 -p 1 -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188454_chrX_1.fastq.gz -2 chrX_data/samples/ERR188454_chrX_2.fastq.gz -S map/ERR188454_chrX.sam
hisat2 -p 1 -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR204916_chrX_1.fastq.gz -2 chrX_data/samples/ERR204916_chrX_2.fastq.gz -S map/ERR204916_chrX.sam
In order to generate files that we can use for StringTie and view in IGV (Integrative Genomics Viewer), we need to convert our SAM files to BAM (Binary Alignment Map) files using SAMTools. A BAM file is the binary version of a SAM file and is used to assemble aligned reads into transcripts using StringTie and is also the preferred file format for viewing in IGV. We will use the samtools command with the options: ‘sort’ to sort the alignments by the leftmost coordinates, ‘-@ 8’ to denote the usage of 8 threads, ‘-o’ to denote that we want our outputs to be BAM files in [out.bam] format, and finally we enter our [input.sam] files.
$ samtools sort -@ 8 -o ERR188044_chrX.bam ERR188044_chrX.sam
We also have a bash script for the above command called sort.sh that will do all our .sam files simultaneously.
#!/usr/bin/bash
#bash script for samtools; convert .sam files to .bam files
SAMPLES="ERR188044 ERR188104 ERR188234 ERR188245 ERR188257 ERR188273 ERR188337 ERR188383 ERR188401 ERR188428 ERR188454 ERR204916"
for SAMPLE in $SAMPLES; do
samtools sort -@ 11 -o ${SAMPLE}_chrX.bam ${SAMPLE}_chrX.sam
done
or you can use
# sort all
samtools sort -@ 1 -o map/ERR188044_chrX.bam map/ERR188044_chrX.sam
samtools sort -@ 1 -o map/ERR188104_chrX.bam map/ERR188104_chrX.sam
samtools sort -@ 1 -o map/ERR188234_chrX.bam map/ERR188234_chrX.sam
samtools sort -@ 1 -o map/ERR188245_chrX.bam map/ERR188245_chrX.sam
samtools sort -@ 1 -o map/ERR188257_chrX.bam map/ERR188257_chrX.sam
samtools sort -@ 1 -o map/ERR188273_chrX.bam map/ERR188273_chrX.sam
samtools sort -@ 1 -o map/ERR188337_chrX.bam map/ERR188337_chrX.sam
samtools sort -@ 1 -o map/ERR188383_chrX.bam map/ERR188383_chrX.sam
samtools sort -@ 1 -o map/ERR188401_chrX.bam map/ERR188401_chrX.sam
samtools sort -@ 1 -o map/ERR188428_chrX.bam map/ERR188428_chrX.sam
samtools sort -@ 1 -o map/ERR188454_chrX.bam map/ERR188454_chrX.sam
samtools sort -@ 1 -o map/ERR204916_chrX.bam map/ERR204916_chrX.sam
To view our BAM files in IGV, we need to index them and for this we also use SAMTools. IGV won’t accept our .bam files without an accompanying .bam.bai file, so in order to view our .bam files in IGV, this step is essential. Using the samtools command with the ‘index’ option, we enter out [in.bam] files and receive [out.bam.bai] files. With these two files in hand, we can now view our data using IGV!
$ samtools index ERR188044_chrX.bam ERR188044_chrX.bam.bai
Again, we have a bash script called index.sh that will index all our .bam files and generate .bam.bai files simultaneously.
#!/usr/bin/bash
#bash script for samtools; index our .bam files to obtain .bam.bai files using samtools
SAMPLES="ERR188044 ERR188104 ERR188234 ERR188245 ERR188257 ERR188273 ERR188337 ERR188383 ERR188401 ERR188428 ERR188454 ERR204916"
for SAMPLE in $SAMPLES; do
samtools index ${SAMPLE}_chrX.bam ${SAMPLE}_chrX.bam.bai
done
StringTie is a fast and highly efficient assembler of RNA-Seq alignments into potential transcripts. It uses a novel network flow algorithm as well as an optional de novo assembly step to assemble and quantitate full-length transcripts representing multiple splice variants for each gene locus. Its input can include not only the alignments of raw reads used by other transcript assemblers, but also alignments longer sequences that have been assembled from those reads.In order to identify differentially expressed genes between experiments, StringTie’s output can be processed by specialized software like Ballgown, Cuffdiff or other programs (DESeq2, edgeR, etc.). manual[https://ccb.jhu.edu/software/stringtie/index.shtml?t=manual]
# remove SAM files
rm map/*.sam
mkdir assembly
# create assembly per sample using 8 threads
stringtie map/ERR188044_chrX.bam -l ERR188044 -p 1 -G chrX_data/genes/chrX.gtf -o assembly/ERR188044_chrX.gtf
stringtie map/ERR188104_chrX.bam -l ERR188104 -p 1 -G chrX_data/genes/chrX.gtf -o assembly/ERR188104_chrX.gtf
stringtie map/ERR188234_chrX.bam -l ERR188234 -p 1 -G chrX_data/genes/chrX.gtf -o assembly/ERR188234_chrX.gtf
stringtie map/ERR188245_chrX.bam -l ERR188245 -p 1 -G chrX_data/genes/chrX.gtf -o assembly/ERR188245_chrX.gtf
stringtie map/ERR188257_chrX.bam -l ERR188257 -p 1 -G chrX_data/genes/chrX.gtf -o assembly/ERR188257_chrX.gtf
stringtie map/ERR188273_chrX.bam -l ERR188273 -p 1 -G chrX_data/genes/chrX.gtf -o assembly/ERR188273_chrX.gtf
stringtie map/ERR188337_chrX.bam -l ERR188337 -p 1 -G chrX_data/genes/chrX.gtf -o assembly/ERR188337_chrX.gtf
stringtie map/ERR188383_chrX.bam -l ERR188383 -p 1 -G chrX_data/genes/chrX.gtf -o assembly/ERR188383_chrX.gtf
stringtie map/ERR188401_chrX.bam -l ERR188401 -p 1 -G chrX_data/genes/chrX.gtf -o assembly/ERR188401_chrX.gtf
stringtie map/ERR188428_chrX.bam -l ERR188428 -p 1 -G chrX_data/genes/chrX.gtf -o assembly/ERR188428_chrX.gtf
stringtie map/ERR188454_chrX.bam -l ERR188454 -p 1 -G chrX_data/genes/chrX.gtf -o assembly/ERR188454_chrX.gtf
stringtie map/ERR204916_chrX.bam -l ERR204916 -p 1 -G chrX_data/genes/chrX.gtf -o assembly/ERR204916_chrX.gtf
# before merging we need to modify mergelist.txt
# this is because I created a new directory to store the results
# the modified mergelist.txt should look like this
cat chrX_data/mergelist.txt
assembly/ERR188044_chrX.gtf
assembly/ERR188104_chrX.gtf
assembly/ERR188234_chrX.gtf
assembly/ERR188245_chrX.gtf
assembly/ERR188257_chrX.gtf
assembly/ERR188273_chrX.gtf
assembly/ERR188337_chrX.gtf
assembly/ERR188383_chrX.gtf
assembly/ERR188401_chrX.gtf
assembly/ERR188428_chrX.gtf
assembly/ERR188454_chrX.gtf
assembly/ERR204916_chrX.gtf
Transcript merge mode. This is a special usage mode of StringTie, distinct from the assembly usage. In the merge mode, StringTie takes as input a list of GTF/GFF files and merges/assembles these transcripts into a non-redundant set of transcripts. This mode is used in the new differential analysis pipeline to generate a global, unified set of transcripts (isoforms) across multiple RNA-Seq samples.
# merge all transcripts from the different samples
stringtie --merge -p 1 -G chrX_data/genes/chrX.gtf -o stringtie_merged.gtf chrX_data/mergelist.txt
# check out the transcripts
cat stringtie_merged.gtf | head
# stringtie --merge -p 8 -G chrX_data/genes/chrX.gtf -o stringtie_merged.gtf chrX_data/mergelist.txt
# StringTie version 1.3.3b
chrX StringTie transcript 322514 323718 1000 . . gene_id "MSTRG.1"; transcript_id "MSTRG.1.1";
chrX StringTie exon 322514 323718 1000 . . gene_id "MSTRG.1"; transcript_id "MSTRG.1.1"; exon_number "1";
chrX StringTie transcript 319145 321319 1000 + . gene_id "MSTRG.2"; transcript_id "NR_027232"; gene_name "LINC00685"; ref_gene_id "NR_027232";
chrX StringTie exon 319145 321319 1000 + . gene_id "MSTRG.2"; transcript_id "NR_027232"; exon_number "1"; gene_name "LINC00685"; ref_gene_id "NR_027232";
chrX StringTie transcript 319145 321319 1000 + . gene_id "MSTRG.2"; transcript_id "NR_027231"; gene_name "LINC00685"; ref_gene_id "NR_027231";
chrX StringTie exon 319145 319551 1000 + . gene_id "MSTRG.2"; transcript_id "NR_027231"; exon_number "1"; gene_name "LINC00685"; ref_gene_id "NR_027231";
chrX StringTie exon 320208 321319 1000 + . gene_id "MSTRG.2"; transcript_id "NR_027231"; exon_number "2"; gene_name "LINC00685"; ref_gene_id "NR_027231";
chrX StringTie transcript 304750 318701 1000 - . gene_id "MSTRG.3"; transcript_id "MSTRG.3.1";
# how many transcripts?
cat stringtie_merged.gtf | grep -v "^#" | awk '$3=="transcript" {print}' | wc -l
3491
Let’s compare the StringTie transcripts to known transcripts using gffcompare.
# compare the assembled transcripts to known transcripts
gffcompare -r chrX_data/genes/chrX.gtf -G -o merged stringtie_merged.gtf
cat merged.stats
# gffcompare v0.10.1 | Command line was:
#gffcompare -r chrX_data/genes/chrX.gtf -G -o merged stringtie_merged.gtf
#
#= Summary for dataset: stringtie_merged.gtf
# Query mRNAs : 3281 in 1521 loci (2651 multi-exon transcripts)
# (535 multi-transcript loci, ~2.2 transcripts per locus)
# Reference mRNAs : 2102 in 1086 loci (1856 multi-exon)
# Super-loci w/ reference transcripts: 998
#-----------------| Sensitivity | Precision |
Base level: 100.0 | 77.6 |
Exon level: 100.0 | 85.4 |
Intron level: 99.8 | 91.0 |
Intron chain level: 99.6 | 69.7 |
Transcript level: 99.6 | 63.8 |
Locus level: 100.0 | 70.9 |
Matching intron chains: 1848
Matching transcripts: 2094
Matching loci: 1086
Missed exons: 0/8804 ( 0.0%)
Novel exons: 971/10608 ( 9.2%)
Missed introns: 14/7946 ( 0.2%)
Novel introns: 219/8714 ( 2.5%)
Missed loci: 0/1086 ( 0.0%)
Novel loci: 421/1521 ( 27.7%)
Total union super-loci across all input datasets: 1521
3281 out of 3281 consensus transcripts written in merged.annotated.gtf (0 discarded as redundant)
The high sensitivity means that almost all of the StringTie transcripts match the known transcripts, i.e. low false negative. The precision is much lower indicating that many of the StringTie transcripts are not in the list of known transcripts, which are either false positives or truly de novo transcripts. The novel exons, introns, and loci indicate how many of the sites were not found in the list of known transcripts. Now that we have our assembled transcripts, we can estimate their abundances.
stringtie -e -B -p 8 -G stringtie_merged.gtf -o ballgown/ERR188044/ERR188044_chrX.gtf map/ERR188044_chrX.bam
stringtie -e -B -p 8 -G stringtie_merged.gtf -o ballgown/ERR188104/ERR188104_chrX.gtf map/ERR188104_chrX.bam
stringtie -e -B -p 8 -G stringtie_merged.gtf -o ballgown/ERR188234/ERR188234_chrX.gtf map/ERR188234_chrX.bam
stringtie -e -B -p 8 -G stringtie_merged.gtf -o ballgown/ERR188245/ERR188245_chrX.gtf map/ERR188245_chrX.bam
stringtie -e -B -p 8 -G stringtie_merged.gtf -o ballgown/ERR188257/ERR188257_chrX.gtf map/ERR188257_chrX.bam
stringtie -e -B -p 8 -G stringtie_merged.gtf -o ballgown/ERR188273/ERR188273_chrX.gtf map/ERR188273_chrX.bam
stringtie -e -B -p 8 -G stringtie_merged.gtf -o ballgown/ERR188337/ERR188337_chrX.gtf map/ERR188337_chrX.bam
stringtie -e -B -p 8 -G stringtie_merged.gtf -o ballgown/ERR188383/ERR188383_chrX.gtf map/ERR188383_chrX.bam
stringtie -e -B -p 8 -G stringtie_merged.gtf -o ballgown/ERR188401/ERR188401_chrX.gtf map/ERR188401_chrX.bam
stringtie -e -B -p 8 -G stringtie_merged.gtf -o ballgown/ERR188428/ERR188428_chrX.gtf map/ERR188428_chrX.bam
stringtie -e -B -p 8 -G stringtie_merged.gtf -o ballgown/ERR188454/ERR188454_chrX.gtf map/ERR188454_chrX.bam
stringtie -e -B -p 8 -G stringtie_merged.gtf -o ballgown/ERR204916/ERR204916_chrX.gtf map/ERR204916_chrX.bam
# estimation took just over a minute and a half
# real 1m39.661s
# user 2m0.179s
# sys 0m9.223s
# check out the files
ls -1 ballgown/ERR188044
ERR188044_chrX.gtf
e2t.ctab
e_data.ctab
i2t.ctab
i_data.ctab
t_data.ctab
## Differential expression
We use ballgown: Flexible, isoform-level differential expression analysis.
Tools for statistical analysis of assembled transcriptomes, including flexible differential expression analysis, visualization of transcript structures, and matching of assembled transcripts to annotation. manual(https://www.bioconductor.org/packages/release/bioc/html/ballgown.html)
Set up the exons/splice sites
hisat2_extract_splice_sites.py annotation.gtf > splicesites.tsv hisat2_extract_exons.py annotation.gtf > exons.tsv
Build index
hisat2-build --ss ./splicesites.tsv --exon ./exons.tsv assembly.fa organism_A
hisat2 -> samtools -> stringtie
hisat2 --dta -x ./assembly -1 ./reads/R1.fastq -2 ./reads/R2.fastq | samtools view -Su | samtools sort - | stringtie -G annotation.gtf -A sample_counts.tsv
REFERENCE: