Practical Salmon

Practical 3 Salmon tutorial

Introduction

Salmon: Fast, accurate and bias-aware transcript quantification from RNA-seq data
Don't count . . . quantify

This brief tutorial will explain how you can get started using Salmon to quantify your RNA-seq data. Our computational power are not enought for elaborate a rela case so we use only a part of the data. We suggest to try at home with the complete stock of data.

Install Salmon

We use conda for install this tools. Conda allow to use reproducible version of any software. We strat to set the configuration of conda repository and we create the enviroment salmon:

    $ conda config --add channels conda-forge
    $ conda config --add channels bioconda
    $ conda create -n salmon salmon

Then we activate the enviroment:

        $ conda activate salmon
        $ salmon -h

Tutorial

We start to create directory and download all the data we need:

        $ mkdir Salmon_tutorial
        $ cd Salmon_tutorial
        mkdir reference


We prepare the reference :

    $ curl ftp://ftp.ensemblgenomes.org/pub/plants/release-28/fasta/arabidopsis_thaliana/cdna/Arabidopsis_thaliana.TAIR10.28.cdna.all.fa.gz -o reference/athal.fa.gz

Here, we’ve used a reference transcriptome for Arabadopsis. However, one of the benefits of performing quantification directly on the transcriptome (rather than via the host genome), is that one can easily quantify assembled transcripts as well (obtained via software such as StringTie for organisms with a reference or Trinity for de novo RNA-seq experiments).

Next, we’re going to build an index on our transcriptome. The index is a structure that salmon uses to quasi-map RNA-seq reads during quantification. The index need only be constructed once per transcriptome, and it can then be reused to quantify many experiments. We use the index command of salmon to build our index:

        salmon index -t reference/athal.fa.gz -i reference/athal_index

In addition to the index, salmon obviously requires the RNA-seq reads from the experiment to perform quantification. In this tutorial, we’ll be analyzing data from this 4-condition experiment [accession PRJDB2508]. You can use the following shell script to obtain the raw data and place the corresponding read files in the proper locations. Here, we’re simply placing all of the data in a directory called data, and the left and right reads for each sample in a sub-directory labeled with that sample’s ID (i.e. DRR016125_1.fastq.gz and DRR016125_2.fastq.gz go in a folder called data/DRR016125).

        #!/bin/bash
        mkdir data
        cd data
        for i in `seq 25 40`; 
        do 
        mkdir DRR0161${i}; 
        cd DRR0161${i}; 
        wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/DRR016/DRR0161${i}/DRR0161${i}_1.fastq.gz; 
        wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/DRR016/DRR0161${i}/DRR0161${i}_2.fastq.gz; 
        cd ..; 
        done
        cd .. 


Quantifying the samples

Now that we have our index built and all of our data downloaded, we’re ready to quantify our samples. Since we’ll be running the same command on each sample, the simplest way to automate this process is, again, a simple shell script (quant_tut_samples.sh):

            #!/bin/bash
            for fn in data/DRR0161{25..40};
            do
            samp=`basename ${fn}`
            echo "Processing sample ${samp}"
            salmon quant -i reference/athal_index -l A \
                    -1 ${fn}/${samp}_1.fastq.gz \
                    -2 ${fn}/${samp}_2.fastq.gz \
                    -p 8 --validateMappings -o quants/${samp}_quant
            done 

This script simply loops through each sample and invokes salmon using fairly barebone options. The -i argument tells salmon where to find the index -l A tells salmon that it should automatically determine the library type of the sequencing reads (e.g. stranded vs. unstranded etc.). The -1 and -2 arguments tell salmon where to find the left and right reads for this sample (notice, salmon will accept gzipped FASTQ files directly). Finally, the -p 8 argument tells salmon to make use of 8 threads and the -o argument specifies the directory where salmon’s quantification results sould be written. Salmon exposes many different options to the user that enable extra features or modify default behavior. However, the purpose and behavior of all of those options is beyond the scope of this introductory tutorial. You can read about salmon’s many options in the documentation.

After the salmon commands finish running, you should have a directory named quants, which will have a sub-directory for each sample. These sub-directories contain the quantification results of salmon, as well as a lot of other information salmon records about the sample and the run. The main output file (called quant.sf) is rather self-explanatory. For example, take a peek at the quantification file for sample DRR016125 in quants/DRR016125/quant.sf and you’ll see a simple TSV format file listing the name (Name) of each transcript, its length (Length), effective length (EffectiveLength) (more details on this in the documentation), and its abundance in terms of Transcripts Per Million (TPM) and estimated number of reads (NumReads) originating from this transcript. After quantification

That’s it! Quantifying your RNA-seq data with salmon is that simple (and fast). Once you have your quantification results you can use them for downstream analysis with differential expression tools like DESeq2, edgeR, limma, or sleuth. Using the tximport package, you can import salmon’s transcript-level quantifications and optionally aggregate them to the gene level for gene-level differential expression analysis. You can read more about how to import salmon’s results into DESeq2 by reading the tximport section of the excellent DESeq2 vignette. For instructions on importing for use with edgeR or limma, see the tximport vignette. For preparing salmon output for use with sleuth, see the wasabi package.

When you are building a salmon index, please do not build the index on the genome of the organism whose transcripts you want to quantify, this is almost certainly not want you want to do and will not provide you with meaningful results. ↩

REFERENCE