This workshop will show you how to launch individual first steps of a DNA-Seq pipeline
We will be working on a 1000 genome sample, NA12878. You can find the whole raw data on the 1000 genome website: http://www.internationalgenome.org/data
NA12878 is the child of the trio while NA12891 and NA12892 are her parents.
Here we start with one sample and then you need to try with the other samples
Mother | Father | Child |
---|---|---|
NA12892 | NA12891 | NA12878 |
We’re going to focus on the reads extracted from a 300 kbp stretch of chromosome 1
Chromosome | Start | End |
---|---|---|
chr1 | 17704860 | 18004860 |
So we start with the enviroment using conda
git clone https://github.com/bioinfo-dirty-jobs/test_1000.git
conda create -n Test_5 -y
source activate Test_5
conda install samtools bwa picard bedtools -y
The initial structure of your folders should look like this:
.
├── bam
│ ├── NA12878.mapped.bam
│ ├── NA12891.mapped.bam
│ └── NA12892.mapped.bam
└── RAW
├── NA12878_R1.fq.gz
├── NA12878_R2.fq.gz
├── NA12891_R1.fq.gz
├── NA12891_R2.fq.gz
├── NA12892_R1.fq.gz
└── NA12892_R2.fq.gz
2 directories, 9 files
So you’ve just received an email saying that your data is ready for download from the sequencing center of your choice.
What should you do ?
Let’s first explore the fastq file.
Try these commands
zless RAW/NA12878_R1.fq.gz
These are fastq file.
Could you descride the fastq format ?
zcat RAW/NA12878_R1.fq.gz | head -n4
zcat RAW/NA12878_R2.fq.gz | head -n4
What was special about the output and why was it like that?
You could also count the reads
zgrep -c "^@SRR098401" RAW/NA12878_R1.fq.gz
We found 736 reads
Why shouldn’t you just do ?
zgrep -c "^@" RAW/NA12878_R1.fq.gz
\pagebreak
The raw reads are now cleaned up of artefacts we can align the read to the reference.
In case you have multiple readsets or library you should align them separatly !
Why should this be done separatly ?
mkdir -p alignment/NA12878/
bwa mem -M -t 2 \
-R '@RG\tID:NA12878\tSM:NA12878\tLB:NA12878\tPU:runNA12878_1\tCN:Broad Institute\tPL:ILLUMINA' \
hg19.fa \
RAW/NA12878_R1.fq.gz
RAW/NA12878_R1.fq.gz > pipp.sam
Why is it important to set Read Group information ?
The details of the fields can be found in the SAM/BAM specifications Here For most cases, only the sample name, platform unit and library one are important.
Why did we pipe the output of one to the other? Could we have done it differently ?
- What are the ways to detect them ?
\pagebreak