Quality and Trimming data Tutorial

Quality and Trimming data Tutorial

author: Maurizio Polano date: autosize: true

========================================================

Alignment

Locating where each generated sequence came from in the genome

Popular aligners:

  • bwa http://bio-bwa.sourceforge.net/
  • STAR https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3530905/
  • bowtie http://bowtie-bio.sourceforge.net/index.shtml
  • novoalign http://www.novocraft.com/products/novoalign/
  • http://www.well.ox.ac.uk/project-stampy -many, many more…..

Demo to follow after this talk

======================================================== Post-processing of aligned files

  • Marking of PCR duplicates:
    • PCR amplification errors can cause some sequences to be over-represented
    • Chances of any two sequences aligning to the same position are unlikely
    • Caveat: obviously this depends on amount of the genome you are capturing
    • Such reads are marked but not usually removed from the data
    • Most downstream methods will ignore such reads
    • Typically, picard is used
  • Sorting
    • Reads can be sorted according to genomic position *samtools
  • Indexing
    • Allow efficient access
      • samtools

========================================================

Sam file flags

Represent useful QC information:

  • Read is unmapped
  • Read is paired / unpaired
  • Read failed QC
  • Read is a PCR duplicate

    https://broadinstitute.github.io/picard/explain-flags.html

========================================================

Quality Control

   #bash
    pwd
    cd 
    cd ~/Desktop/
    mkdir -p USERNAME_11feb2019/
    cd USERNAME_11feb2019/
    git clone https://github.com/bioinfo-dirty-jobs/quality.git
    ls -ltr

========================================================

    source activate NGSBASE
    cd quality
    mkdir test
    cp trimming/qa/*.fastq test/
    ls test/
    mkdir -p quality_pre
    fastqc -o quality_pre/ test/*.fastq
    firefox quality_pre/*.html

========================================================

So I want to delete the first adapter GATCGGAAGAGCACACGTCTGAACTCCAGTCACACA

  conda install cutadapt -y
  cutadapt -m 20 -e 0.1 -a GATCGGAAGAGCACACGTCTGAACTCCAGTCACACA trimming/qa/sample2.fastq -o sample2--cutadapt.fastq

The following options were used:

  • -a: The sequence of the adaptor is given with the -a option
  • -m 20: throws away processed reads shorter than 20 bases.
  • -e 0.1: The level of error tolerance is adjusted by specifying a maximum 10% error rate. Allowed errors are mismatches, insertions and deletions.
  • -o: output file

========================================================

Total reads processed: 100,000 Reads with adapters: 32,792 (32.8%) Reads that were too short: 31,113 (31.1%) Reads written (passing filters): 68,887 (68.9%)

Total basepairs processed: 3,600,000 bp Total written (filtered): 2,473,750 bp (68.7%)

=== Adapter 1 ===

Sequence: GATCGGAAGAGCACACGTCTGAACTCCAGTCACACA; Type: regular 3’; Length: 36; Trimmed: 32792 times.

No. of allowed errors: 0-9 bp: 0; 10-19 bp: 1; 20-29 bp: 2; 30-36 bp: 3

Bases preceding removed adapters: A: 1.2% C: 1.2% G: 1.0% T: 2.0% none/other: 94.6%

========================================================

  fastqc sample2--cutadapt.fastq

========================================================

Download some fastq

    curl -O -J -L https://osf.io/shqpv/download
    curl -O -J -L https://osf.io/9m3ch/download

Download the adapter

   curl -O -J -L https://osf.io/v24pt/download

======================================================== Now we’ll do some trimming!

Scythe uses a Naive Bayesian approach to classify contaminant substrings in sequence reads. It considers quality information, which can make it robust in picking out 3’-end adapters, which often include poor quality bases.

The first thing we need is the adapters to trim off

           conda install -c faircloth-lab scythe  -y
        scythe -a adapters.fasta -o SRR957824_adapt_R1.fastq SRR957824_500K_R1.fastq.gz
        scythe -a adapters.fasta -o SRR957824_adapt_R2.fastq SRR957824_500K_R2.fastq.gz

========================================================

Sickle

Most modern sequencing technologies produce reads that have deteriorating quality towards the 3’-end and some towards the 5’-end as well. Incorrectly called bases in both regions negatively impact assembles, mapping, and downstream bioinformatics analyses. We will trim each read individually down to the good quality part to keep the bad part from interfering with downstream applications. To do so, we will use sickle. Sickle is a tool that uses sliding windows along with quality and length thresholds to determine when quality is sufficiently low to trim the 3’-end of reads and also determines when the quality is sufficiently high enough to trim the 5’-end of reads. It will also discard reads based upon a length threshold.

========================================================

          conda install sickle-trim -y
          sickle pe -f SRR957824_adapt_R1.fastq -r SRR957824_adapt_R2.fastq \
-t sanger -o SRR957824_trimmed_R1.fastq -p SRR957824_trimmed_R2.fastq \
-s /dev/null -q 25

fastqc SRR957824_trimmed_R1.fastq SRR957824_trimmed_R2.fastq

========================================================