Practical Hisat2

Practical 4 Hisat2

Introduction

HISAT2 is a fast and sensitive alignment program for mapping next-generation sequencing reads (both DNA and RNA) to a population of human genomes (as well as to a single reference genome). Based on an extension of BWT for graphs [Sirén et al. 2014], we designed and implemented a graph FM index (GFM), an original approach and its first implementation to the best of our knowledge. In addition to using one global GFM index that represents a population of human genomes, HISAT2 uses a large set of small GFM indexes that collectively cover the whole genome (each index representing a genomic region of 56 Kbp, with 55,000 indexes needed to cover the human population). These small indexes (called local indexes), combined with several alignment strategies, enable rapid and accurate alignment of sequencing reads. This new indexing scheme is called a Hierarchical Graph FM index (HGFM).

This brief tutorial will explain how you can get started using Hisat2 to quantify your RNA-seq data. Our computational power are not enought for elaborate a rela case so we use only a part of the data. We suggest to try at home with the complete stock of data. Computer avaible are not born for this purpose (oly 4G of RAM, It is not enough for the new operation systems) we need more patience

        conda activate hisat2
        hisat2 -h

Obtain Tutorial Files

Use the UNIX command wget to pull the data off the FTP server hosting the data we will be working with. Use the command cd [Options] [Directory] to change into your desired ~/working_directory and then download these files.


 $ wget ftp://ftp.ccb.jhu.edu/pub/RNAseq_protocol/chrX_data.tar.gz

Extract Tutorial Files Using the UNIX command tar xvzf, we can extract the .tar.gz files into our ~/working_directory. The command option ‘x’ extracts the files, ‘v’ lists the processed files verbosely, ‘z’ filters the archive through gzip, and ‘f’ tells tar to used the archived file.


$ tar xvzf chrX_data.tar.gz

Align Reads Using HISAT2

Using HISAT2, we can align our sample .fastq.gz files (without the need to unzip them) to the indexed reference genome, that has already been prepared, located in the chrX_data/indexes/ directory. Doing so will generate our SAM (Sequence Alignment Map) files we will use in later steps. First we type out hisat2 to denote the command we are using. The options entered here are ‘-p 8’ denoting the use of 8 threads, ‘–dta’ is used to generate output SAM files that can be directly read into StringTie, ‘-x’ is used to denote the indexed reference genome, ‘-1’ and ‘-2’ are used to denote our fwd and rev samples in a paired-end alignment, and ‘-S’ is used to denote that we would like our output in SAM format.

$ hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188044_chrX_1.fastq.gz -2 chrX_data/samples/ERR188044_chrX_2.fastq.gz -S ERR188044_chrX.sam

Here is a bash script for the above HISAT2 command called hisat2.sh that will run all the .fastq.gz files for you simultaneously.

#!/usr/bin/bash

#bash script for hisat2; align all .fastq.gz files to indexed reference genome to generate .sam files

SAMPLES="ERR188044 ERR188104 ERR188234 ERR188245 ERR188257 ERR188273 ERR188337 ERR188383 ERR188401 ERR188428 ERR188454 ERR204916"

for SAMPLE in $SAMPLES; do
    hisat2 -p 11 --dta -x ~/chrX_data/indexes/chrX_tran -1 ~/chrX_data/samples/${SAMPLE}_chrX_1.fastq.gz -2 ~/chrX_data/samples/${SAMPLE}_chrX_2.fastq.gz -S ${SAMPLE}_chrX.sam
done

or you can copy this:

    # map each sample using 1 threads
    mkdir map
    hisat2 -p 1 --rg-id=ERR188044 --rg SM:ERR188044  --rg PL:ILLUMINA --rg PU:aaaaaaa -x chrX_data/indexes/chrX_tran --dta --rna-strandness RF -1 chrX_data/samples/ERR188044_chrX_1.fastq.gz -2 chrX_data/samples/ERR188044_chrX_2.fastq.gz -S map/ERR188044_chrX.sam
    hisat2 -p 1 --rg-id=ERR188104 --rg SM:ERR188104  --rg PL:ILLUMINA --rg PU:aaaaaaa -x chrX_data/indexes/chrX_tran --dta --rna-strandness RF -1 chrX_data/samples/ERR188104_chrX_1.fastq.gz -2 chrX_data/samples/ERR188104_chrX_2.fastq.gz -S map/ERR188104_chrX.sam
    hisat2 -p 1 -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188234_chrX_1.fastq.gz -2 chrX_data/samples/ERR188234_chrX_2.fastq.gz -S map/ERR188234_chrX.sam
    hisat2 -p 1 -x chrX_data/indexes/chrX_tran   -1 chrX_data/samples/ERR188245_chrX_1.fastq.gz -2 chrX_data/samples/ERR188245_chrX_2.fastq.gz -S map/ERR188245_chrX.sam
    hisat2 -p 1 -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188257_chrX_1.fastq.gz -2 chrX_data/samples/ERR188257_chrX_2.fastq.gz -S map/ERR188257_chrX.sam
    hisat2 -p 1 -x chrX_data/indexes/chrX_tran-1 chrX_data/samples/ERR188273_chrX_1.fastq.gz -2 chrX_data/samples/ERR188273_chrX_2.fastq.gz -S map/ERR188273_chrX.sam
    hisat2 -p 1 -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188337_chrX_1.fastq.gz -2 chrX_data/samples/ERR188337_chrX_2.fastq.gz -S map/ERR188337_chrX.sam
    hisat2 -p 1 -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188383_chrX_1.fastq.gz -2 chrX_data/samples/ERR188383_chrX_2.fastq.gz -S map/ERR188383_chrX.sam
    hisat2 -p 1 -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188401_chrX_1.fastq.gz -2 chrX_data/samples/ERR188401_chrX_2.fastq.gz -S map/ERR188401_chrX.sam
    hisat2 -p 1 -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188428_chrX_1.fastq.gz -2 chrX_data/samples/ERR188428_chrX_2.fastq.gz -S map/ERR188428_chrX.sam
    hisat2 -p 1 -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188454_chrX_1.fastq.gz -2 chrX_data/samples/ERR188454_chrX_2.fastq.gz -S map/ERR188454_chrX.sam
    hisat2 -p 1 -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR204916_chrX_1.fastq.gz -2 chrX_data/samples/ERR204916_chrX_2.fastq.gz -S map/ERR204916_chrX.sam

Generate BAM Files Using SAMTools

In order to generate files that we can use for StringTie and view in IGV (Integrative Genomics Viewer), we need to convert our SAM files to BAM (Binary Alignment Map) files using SAMTools. A BAM file is the binary version of a SAM file and is used to assemble aligned reads into transcripts using StringTie and is also the preferred file format for viewing in IGV. We will use the samtools command with the options: ‘sort’ to sort the alignments by the leftmost coordinates, ‘-@ 8’ to denote the usage of 8 threads, ‘-o’ to denote that we want our outputs to be BAM files in [out.bam] format, and finally we enter our [input.sam] files.


$ samtools sort -@ 8 -o ERR188044_chrX.bam ERR188044_chrX.sam

We also have a bash script for the above command called sort.sh that will do all our .sam files simultaneously.

#!/usr/bin/bash

#bash script for samtools; convert .sam files to .bam files

SAMPLES="ERR188044 ERR188104 ERR188234 ERR188245 ERR188257 ERR188273 ERR188337 ERR188383 ERR188401 ERR188428 ERR188454 ERR204916"

for SAMPLE in $SAMPLES; do
    samtools sort -@ 11 -o ${SAMPLE}_chrX.bam ${SAMPLE}_chrX.sam
done

or you can use

    # sort all
    samtools sort -@ 1 -o map/ERR188044_chrX.bam map/ERR188044_chrX.sam
    samtools sort -@ 1 -o map/ERR188104_chrX.bam map/ERR188104_chrX.sam
    samtools sort -@ 1 -o map/ERR188234_chrX.bam map/ERR188234_chrX.sam
    samtools sort -@ 1 -o map/ERR188245_chrX.bam map/ERR188245_chrX.sam
    samtools sort -@ 1 -o map/ERR188257_chrX.bam map/ERR188257_chrX.sam
    samtools sort -@ 1 -o map/ERR188273_chrX.bam map/ERR188273_chrX.sam
    samtools sort -@ 1 -o map/ERR188337_chrX.bam map/ERR188337_chrX.sam
    samtools sort -@ 1 -o map/ERR188383_chrX.bam map/ERR188383_chrX.sam
    samtools sort -@ 1 -o map/ERR188401_chrX.bam map/ERR188401_chrX.sam
    samtools sort -@ 1 -o map/ERR188428_chrX.bam map/ERR188428_chrX.sam
    samtools sort -@ 1 -o map/ERR188454_chrX.bam map/ERR188454_chrX.sam
    samtools sort -@ 1 -o map/ERR204916_chrX.bam map/ERR204916_chrX.sam
    

Index BAM Files Using SAMTools

To view our BAM files in IGV, we need to index them and for this we also use SAMTools. IGV won’t accept our .bam files without an accompanying .bam.bai file, so in order to view our .bam files in IGV, this step is essential. Using the samtools command with the ‘index’ option, we enter out [in.bam] files and receive [out.bam.bai] files. With these two files in hand, we can now view our data using IGV!


$ samtools index ERR188044_chrX.bam ERR188044_chrX.bam.bai

Again, we have a bash script called index.sh that will index all our .bam files and generate .bam.bai files simultaneously.

#!/usr/bin/bash

#bash script for samtools; index our .bam files to obtain .bam.bai files using samtools

SAMPLES="ERR188044 ERR188104 ERR188234 ERR188245 ERR188257 ERR188273 ERR188337 ERR188383 ERR188401 ERR188428 ERR188454 ERR204916"

for SAMPLE in $SAMPLES; do
    samtools index ${SAMPLE}_chrX.bam ${SAMPLE}_chrX.bam.bai
done

Assembli the transcript

StringTie is a fast and highly efficient assembler of RNA-Seq alignments into potential transcripts. It uses a novel network flow algorithm as well as an optional de novo assembly step to assemble and quantitate full-length transcripts representing multiple splice variants for each gene locus. Its input can include not only the alignments of raw reads used by other transcript assemblers, but also alignments longer sequences that have been assembled from those reads.In order to identify differentially expressed genes between experiments, StringTie’s output can be processed by specialized software like Ballgown, Cuffdiff or other programs (DESeq2, edgeR, etc.). manual[https://ccb.jhu.edu/software/stringtie/index.shtml?t=manual]

    # remove SAM files
    rm map/*.sam
    mkdir assembly
    
    # create assembly per sample using 8 threads
    stringtie map/ERR188044_chrX.bam -l ERR188044 -p 1 -G chrX_data/genes/chrX.gtf -o assembly/ERR188044_chrX.gtf
    stringtie map/ERR188104_chrX.bam -l ERR188104 -p 1 -G chrX_data/genes/chrX.gtf -o assembly/ERR188104_chrX.gtf
    stringtie map/ERR188234_chrX.bam -l ERR188234 -p 1 -G chrX_data/genes/chrX.gtf -o assembly/ERR188234_chrX.gtf
    stringtie map/ERR188245_chrX.bam -l ERR188245 -p 1 -G chrX_data/genes/chrX.gtf -o assembly/ERR188245_chrX.gtf
    stringtie map/ERR188257_chrX.bam -l ERR188257 -p 1 -G chrX_data/genes/chrX.gtf -o assembly/ERR188257_chrX.gtf
    stringtie map/ERR188273_chrX.bam -l ERR188273 -p 1 -G chrX_data/genes/chrX.gtf -o assembly/ERR188273_chrX.gtf
    stringtie map/ERR188337_chrX.bam -l ERR188337 -p 1 -G chrX_data/genes/chrX.gtf -o assembly/ERR188337_chrX.gtf
    stringtie map/ERR188383_chrX.bam -l ERR188383 -p 1 -G chrX_data/genes/chrX.gtf -o assembly/ERR188383_chrX.gtf
    stringtie map/ERR188401_chrX.bam -l ERR188401 -p 1 -G chrX_data/genes/chrX.gtf -o assembly/ERR188401_chrX.gtf
    stringtie map/ERR188428_chrX.bam -l ERR188428 -p 1 -G chrX_data/genes/chrX.gtf -o assembly/ERR188428_chrX.gtf
    stringtie map/ERR188454_chrX.bam -l ERR188454 -p 1 -G chrX_data/genes/chrX.gtf -o assembly/ERR188454_chrX.gtf
    stringtie map/ERR204916_chrX.bam -l ERR204916 -p 1 -G chrX_data/genes/chrX.gtf -o assembly/ERR204916_chrX.gtf
    # before merging we need to modify mergelist.txt
    # this is because I created a new directory to store the results
    # the modified mergelist.txt should look like this
    cat chrX_data/mergelist.txt
    assembly/ERR188044_chrX.gtf
    assembly/ERR188104_chrX.gtf
    assembly/ERR188234_chrX.gtf
    assembly/ERR188245_chrX.gtf
    assembly/ERR188257_chrX.gtf
    assembly/ERR188273_chrX.gtf
    assembly/ERR188337_chrX.gtf
    assembly/ERR188383_chrX.gtf
    assembly/ERR188401_chrX.gtf
    assembly/ERR188428_chrX.gtf
    assembly/ERR188454_chrX.gtf
    assembly/ERR204916_chrX.gtf

Transcript merge mode. This is a special usage mode of StringTie, distinct from the assembly usage. In the merge mode, StringTie takes as input a list of GTF/GFF files and merges/assembles these transcripts into a non-redundant set of transcripts. This mode is used in the new differential analysis pipeline to generate a global, unified set of transcripts (isoforms) across multiple RNA-Seq samples.

    # merge all transcripts from the different samples
    stringtie --merge -p 1 -G chrX_data/genes/chrX.gtf -o stringtie_merged.gtf chrX_data/mergelist.txt
    # check out the transcripts
    cat stringtie_merged.gtf | head
    # stringtie --merge -p 8 -G chrX_data/genes/chrX.gtf -o stringtie_merged.gtf chrX_data/mergelist.txt
    # StringTie version 1.3.3b
    chrX    StringTie       transcript      322514  323718  1000    .       .       gene_id "MSTRG.1"; transcript_id "MSTRG.1.1"; 
    chrX    StringTie       exon    322514  323718  1000    .       .       gene_id "MSTRG.1"; transcript_id "MSTRG.1.1"; exon_number "1"; 
    chrX    StringTie       transcript      319145  321319  1000    +       .       gene_id "MSTRG.2"; transcript_id "NR_027232"; gene_name "LINC00685"; ref_gene_id "NR_027232"; 
    chrX    StringTie       exon    319145  321319  1000    +       .       gene_id "MSTRG.2"; transcript_id "NR_027232"; exon_number "1"; gene_name "LINC00685"; ref_gene_id "NR_027232"; 
    chrX    StringTie       transcript      319145  321319  1000    +       .       gene_id "MSTRG.2"; transcript_id "NR_027231"; gene_name "LINC00685"; ref_gene_id "NR_027231"; 
    chrX    StringTie       exon    319145  319551  1000    +       .       gene_id "MSTRG.2"; transcript_id "NR_027231"; exon_number "1"; gene_name "LINC00685"; ref_gene_id "NR_027231"; 
    chrX    StringTie       exon    320208  321319  1000    +       .       gene_id "MSTRG.2"; transcript_id "NR_027231"; exon_number "2"; gene_name "LINC00685"; ref_gene_id "NR_027231"; 
    chrX    StringTie       transcript      304750  318701  1000    -       .       gene_id "MSTRG.3"; transcript_id "MSTRG.3.1";
    
    # how many transcripts?
    cat stringtie_merged.gtf | grep -v "^#" | awk '$3=="transcript" {print}' | wc -l
        3491

Let’s compare the StringTie transcripts to known transcripts using gffcompare.

 # compare the assembled transcripts to known transcripts
gffcompare -r chrX_data/genes/chrX.gtf -G -o merged stringtie_merged.gtf
 
cat merged.stats
# gffcompare v0.10.1 | Command line was:
#gffcompare -r chrX_data/genes/chrX.gtf -G -o merged stringtie_merged.gtf
#
 
#= Summary for dataset: stringtie_merged.gtf 
#     Query mRNAs :    3281 in    1521 loci  (2651 multi-exon transcripts)
#            (535 multi-transcript loci, ~2.2 transcripts per locus)
# Reference mRNAs :    2102 in    1086 loci  (1856 multi-exon)
# Super-loci w/ reference transcripts:      998
#-----------------| Sensitivity | Precision  |
        Base level:   100.0     |    77.6    |
        Exon level:   100.0     |    85.4    |
      Intron level:    99.8     |    91.0    |
Intron chain level:    99.6     |    69.7    |
  Transcript level:    99.6     |    63.8    |
       Locus level:   100.0     |    70.9    |
 
     Matching intron chains:    1848
       Matching transcripts:    2094
              Matching loci:    1086
 
          Missed exons:       0/8804    (  0.0%)
           Novel exons:     971/10608   (  9.2%)
        Missed introns:      14/7946    (  0.2%)
         Novel introns:     219/8714    (  2.5%)
           Missed loci:       0/1086    (  0.0%)
            Novel loci:     421/1521    ( 27.7%)
 
 Total union super-loci across all input datasets: 1521 
3281 out of 3281 consensus transcripts written in merged.annotated.gtf (0 discarded as redundant)

The high sensitivity means that almost all of the StringTie transcripts match the known transcripts, i.e. low false negative. The precision is much lower indicating that many of the StringTie transcripts are not in the list of known transcripts, which are either false positives or truly de novo transcripts. The novel exons, introns, and loci indicate how many of the sites were not found in the list of known transcripts. Now that we have our assembled transcripts, we can estimate their abundances.

 stringtie -e -B -p 8 -G stringtie_merged.gtf -o ballgown/ERR188044/ERR188044_chrX.gtf map/ERR188044_chrX.bam
stringtie -e -B -p 8 -G stringtie_merged.gtf -o ballgown/ERR188104/ERR188104_chrX.gtf map/ERR188104_chrX.bam
stringtie -e -B -p 8 -G stringtie_merged.gtf -o ballgown/ERR188234/ERR188234_chrX.gtf map/ERR188234_chrX.bam
stringtie -e -B -p 8 -G stringtie_merged.gtf -o ballgown/ERR188245/ERR188245_chrX.gtf map/ERR188245_chrX.bam
stringtie -e -B -p 8 -G stringtie_merged.gtf -o ballgown/ERR188257/ERR188257_chrX.gtf map/ERR188257_chrX.bam
stringtie -e -B -p 8 -G stringtie_merged.gtf -o ballgown/ERR188273/ERR188273_chrX.gtf map/ERR188273_chrX.bam
stringtie -e -B -p 8 -G stringtie_merged.gtf -o ballgown/ERR188337/ERR188337_chrX.gtf map/ERR188337_chrX.bam
stringtie -e -B -p 8 -G stringtie_merged.gtf -o ballgown/ERR188383/ERR188383_chrX.gtf map/ERR188383_chrX.bam
stringtie -e -B -p 8 -G stringtie_merged.gtf -o ballgown/ERR188401/ERR188401_chrX.gtf map/ERR188401_chrX.bam
stringtie -e -B -p 8 -G stringtie_merged.gtf -o ballgown/ERR188428/ERR188428_chrX.gtf map/ERR188428_chrX.bam
stringtie -e -B -p 8 -G stringtie_merged.gtf -o ballgown/ERR188454/ERR188454_chrX.gtf map/ERR188454_chrX.bam
stringtie -e -B -p 8 -G stringtie_merged.gtf -o ballgown/ERR204916/ERR204916_chrX.gtf map/ERR204916_chrX.bam
 
# estimation took just over a minute and a half
# real    1m39.661s
# user    2m0.179s
# sys     0m9.223s
 
# check out the files
ls -1 ballgown/ERR188044
ERR188044_chrX.gtf
e2t.ctab
e_data.ctab
i2t.ctab
i_data.ctab
t_data.ctab

## Differential expression

We use ballgown: Flexible, isoform-level differential expression analysis.

Tools for statistical analysis of assembled transcriptomes, including flexible differential expression analysis, visualization of transcript structures, and matching of assembled transcripts to annotation. manual(https://www.bioconductor.org/packages/release/bioc/html/ballgown.html)

TIPS

Set up the exons/splice sites

hisat2_extract_splice_sites.py annotation.gtf > splicesites.tsv hisat2_extract_exons.py annotation.gtf > exons.tsv

Build index

hisat2-build --ss ./splicesites.tsv --exon ./exons.tsv assembly.fa organism_A
hisat2 -> samtools -> stringtie

hisat2 --dta -x ./assembly -1 ./reads/R1.fastq -2 ./reads/R2.fastq | samtools view -Su | samtools sort - | stringtie -G annotation.gtf -A sample_counts.tsv

REFERENCE:

https://davetang.org/muse/2017/10/25/getting-started-hisat-stringtie-ballgown/
https://bioinformatics-core-shared-training.github.io/RNAseq_September_2018/Supplementary_Materials/S1_Getting_raw_reads_from_SRA.html

RNA-seq Analisys Course