Pratical test

Pratical test

This workshop will show you how to launch individual first steps of a DNA-Seq pipeline

We will be working on a 1000 genome sample, NA12878. You can find the whole raw data on the 1000 genome website: http://www.internationalgenome.org/data

NA12878 is the child of the trio while NA12891 and NA12892 are her parents.

Here we start with one sample and then you need to try with the other samples

Mother Father Child
NA12892 NA12891 NA12878

We’re going to focus on the reads extracted from a 300 kbp stretch of chromosome 1

Chromosome Start End
chr1 17704860 18004860

Pratical

Install enviroment

So we start with the enviroment using conda

   git clone https://github.com/bioinfo-dirty-jobs/test_1000.git
   conda create -n Test_5 -y
   source activate Test_5
   conda install samtools bwa picard bedtools -y

Data files

The initial structure of your folders should look like this:

.
├── bam
│   ├── NA12878.mapped.bam
│   ├── NA12891.mapped.bam
│   └── NA12892.mapped.bam
└── RAW
    ├── NA12878_R1.fq.gz
    ├── NA12878_R2.fq.gz
    ├── NA12891_R1.fq.gz
    ├── NA12891_R2.fq.gz
    ├── NA12892_R1.fq.gz
    └── NA12892_R2.fq.gz

2 directories, 9 files

Cheat sheets

First data glance

So you’ve just received an email saying that your data is ready for download from the sequencing center of your choice.

What should you do ?




Fastq files

Let’s first explore the fastq file.

Try these commands

  zless RAW/NA12878_R1.fq.gz

These are fastq file.

Could you descride the fastq format ?




zcat RAW/NA12878_R1.fq.gz | head -n4
zcat RAW/NA12878_R2.fq.gz | head -n4

What was special about the output and why was it like that?




You could also count the reads

zgrep -c "^@SRR098401" RAW/NA12878_R1.fq.gz

We found 736 reads

Why shouldn’t you just do ?

zgrep -c "^@" RAW/NA12878_R1.fq.gz



\pagebreak

Alignment

The raw reads are now cleaned up of artefacts we can align the read to the reference.

In case you have multiple readsets or library you should align them separatly !

Why should this be done separatly ?




mkdir -p alignment/NA12878/

bwa mem -M -t 2 \
  -R '@RG\tID:NA12878\tSM:NA12878\tLB:NA12878\tPU:runNA12878_1\tCN:Broad Institute\tPL:ILLUMINA' \
   hg19.fa \
  RAW/NA12878_R1.fq.gz
   RAW/NA12878_R1.fq.gz > pipp.sam

Why is it important to set Read Group information ?




The details of the fields can be found in the SAM/BAM specifications Here For most cases, only the sample name, platform unit and library one are important.

Why did we pipe the output of one to the other? Could we have done it differently ?




- What are the ways to detect them ?




\pagebreak