Quality control & Adapter trimming.
FastQC Cutadapt fastq-format-wikipedia miRBase: a searchable database of published miRNA
In this tutorial yu learn some important steps quality and trimming steps
FastQ. See:
Create an empty directory to work in the exercise and copy or download the raw data to it:
mkdir -p ~/Desktop/quality_control_data/
cd ~/Desktop/quality_control_data/
git clone https://github.com/bioinfo-dirty-jobs/Data.git
cd Data
cp -r quality_control/ ~/Desktop/quality_control/
cd ~/Desktop/quality_control_data/
The file f010_raw_mirna.fastq.gz contains reads form a microRNA sequencing experiment.
Use the command head
to have a view of the first lines of the file:
zcat f010_raw_mirna.fastq.gz|head
gunzip f010_raw_mirna.fastq.gz
Use the command wc
to count how many reads are there in the file (remember you have to divide by 4)
wc -l f010_raw_mirna.fastq
First create a directory to store the results of the fastqc analysis:
mkdir f020_res_fastqc
Then execute fastqc
storing the results in the created directory (option -o
):
fastqc -o f020_res_fastqc f010_raw_mirna.fastq
Find the results in the fastqc_report.html file and discus them.
There are many Overrepresented sequences. Explore whether some of them correspond to miRNAs using the miRBase search By sequence utility.
There are 2 known adapters used in this experiment:
CTGGGAAATCACCATAAACGTGAAATGTCTTTGGATTTGGGAATCTTATAAGTTCTGTATGAGACCACTCTAAAAA
CTTTTTTTCGTCCTTTCCACAAGATATATAAAGCCAAGAAATCGAAATACTTTCAAGTTACGGTAAGC
Use the command grep to see whether they are still present in your data:
grep "CTGGGAAATCACCATAAACGTGAAATGTCTTTGGATTTGGGAATCTTATAAGTTCTGTATGAGACCACTCTAAAAA" f010_raw_mirna.fastq
grep "CTTTTTTTCGTCCTTTCCACAAGATATATAAAGCCAAGAAATCGAAATACTTTCAAGTTACGGTAAGC" f010_raw_mirna.fastq
Do the sequences appear systematically at the beginning or at the end of the reads?
But the adapters could also appear in the reverse, complementary or reverse complementary mode.
Compute the reverse, complementary and the reverse complementary sequences of the two adapters, and find out which of them appear in your data.
To compute those sequences you can use some online resources as the one in: SERVER
# use R-Bioconductor to compute their reverse, complementary and reverse complementary.
library (Biostrings)
myseq <- DNAString ("CTTTTTTTCGTCCTTTCCACAAGATATATAAAGCCAAGAAATCGAAATACTTTCAAGTTACGGTAAGC")
reverse (myseq)
complement (myseq)
reverseComplement (myseq)
Use grep form Linux shell to find out which of the versions of the adapter is in your data.
grep CTGGGAAATCACCATAAACGTGAAATGTCTTTGGATTTGGGAATCTTATAAGTTCTGTATGAGACCACTCTAAAAA f010_raw_mirna.fastq | wc -l ## present in the sample (at the beginning of the reads)
grep GACCCTTTAGTGGTATTTGCACTTTACAGAAACCTAAACCCTTAGAATATTCAAGACATACTCTGGTGAGATTTTT f010_raw_mirna.fastq | wc -l
grep TTTTTAGAGTGGTCTCATACAGAACTTATAAGATTCCCAAATCCAAAGACATTTCACGTTTATGGTGATTTCCCAG f010_raw_mirna.fastq | wc -l ## present in the sample (at the end of the read) ... but not so numerous
grep AAAAATCTCACCAGAGTATGTCTTGAATATTCTAAGGGTTTAGGTTTCTGTAAAGTGCAAATACCACTAAAGGGTC f010_raw_mirna.fastq | wc -l
But sometimes the adapter does not appear complete. It may be there just the first part:
grep CTGGGAAATCACCATAAACGTGAAATGTCTTTGGA f010_raw_mirna.fastq | wc -l
grep GACCCTTTAGTGGTATTTGCACTTTACAGAAACCT f010_raw_mirna.fastq | wc -l
grep TTTTTAGAGTGGTCTCATACAGAACTTATAAGATT f010_raw_mirna.fastq | wc -l
grep AAAAATCTCACCAGAGTATGTCTTGAATATTCTAA f010_raw_mirna.fastq | wc -l
or the end part of it:
grep AATCTTATAAGTTCTGTATGAGACCACTCTAAAAA f010_raw_mirna.fastq | wc -l
grep TTAGAATATTCAAGACATACTCTGGTGAGATTTTT f010_raw_mirna.fastq | wc -l
grep TCCAAAGACATTTCACGTTTATGGTGATTTCCCAG f010_raw_mirna.fastq | wc -l
grep AGGTTTCTGTAAAGTGCAAATACCACTAAAGGGTC f010_raw_mirna.fastq | wc -l
NOTE: in the code above I did cut just the 35 first or last nucleotides of the primer in its different arrangements, but this is an arbitrary length. We are just trying to discover which of the arrangements are present in our sample and whether there are allocated in the 5’ or in the 6’ end.
grep CTTTTTTTCGTCCTTTCCACAAGATATATAAAGCCAAGAAATCGAAATACTTTCAAGTTACGGTAAGC f010_raw_mirna.fastq | wc -l ## present in the sample (at the end of the read) ... but not so numerous
grep GAAAAAAAGCAGGAAAGGTGTTCTATATATTTCGGTTCTTTAGCTTTATGAAAGTTCAATGCCATTCG f010_raw_mirna.fastq | wc -l
grep GCTTACCGTAACTTGAAAGTATTTCGATTTCTTGGCTTTATATATCTTGTGGAAAGGACGAAAAAAAG f010_raw_mirna.fastq | wc -l ## present in the sample (at the beginning of the reads)
grep CGAATGGCATTGAACTTTCATAAAGCTAAAGAACCGAAATATATAGAACACCTTTCCTGCTTTTTTTC f010_raw_mirna.fastq | wc -l
As before, sometimes the adapter does not appear complete. It may be there just the first part:
grep CTTTTTTTCGTCCTTTCCACAAGATATATA f010_raw_mirna.fastq | wc -l
grep GAAAAAAAGCAGGAAAGGTGTTCTATATAT f010_raw_mirna.fastq | wc -l
grep GCTTACCGTAACTTGAAAGTATTTCGATTT f010_raw_mirna.fastq | wc -l
grep CGAATGGCATTGAACTTTCATAAAGCTAAA f010_raw_mirna.fastq | wc -l
or the end part of it:
grep AAGCCAAGAAATCGAAATACTTTCAAGTTACGGTAAGC f010_raw_mirna.fastq | wc -l
grep TTCGGTTCTTTAGCTTTATGAAAGTTCAATGCCATTCG f010_raw_mirna.fastq | wc -l
grep CTTGGCTTTATATATCTTGTGGAAAGGACGAAAAAAAG f010_raw_mirna.fastq | wc -l
grep GAACCGAAATATATAGAACACCTTTCCTGCTTTTTTTC f010_raw_mirna.fastq | wc -l
Check the options:
you can find the help of the program typing cutadapt -h
in the shell.
To get read of the the adapters found in our data we run [cutadapt] several times: cutadapt -g CTGGGAAATCACCATAAACGTGAAATGTCTTTGGATTTGGGAATCTTATAAGTTCTGTATGAGACCACTCTAAAAA -o f030_mirna_trim1.fastq f010_raw_mirna.fastq
cutadapt -a TTTTTAGAGTGGTCTCATACAGAACTTATAAGATTCCCAAATCCAAAGACATTTCACGTTTATGGTGATTTCCCAG -o f030_mirna_trim2.fastq f030_mirna_trim1.fastq
cutadapt -g GCTTACCGTAACTTGAAAGTATTTCGATTTCTTGGCTTTATATATCTTGTGGAAAGGACGAAAAAAAG -o f030_mirna_trim3.fastq f030_mirna_trim2.fastq
cutadapt -a CTTTTTTTCGTCCTTTCCACAAGATATATAAAGCCAAGAAATCGAAATACTTTCAAGTTACGGTAAGC -o f030_mirna_trim4.fastq f030_mirna_trim3.fastq
Now you can grep
again searching for the adapters
grep CTGGGAAATCACCATAAACGTGAAATGTCTTTGGATTTGGGAATCTTATAAGTTCTGTATGAGACCACTCTAAAAA f030_mirna_trim4.fastq | wc -l
#grep GACCCTTTAGTGGTATTTGCACTTTACAGAAACCTAAACCCTTAGAATATTCAAGACATACTCTGGTGAGATTTTT f030_mirna_trim4.fastq | wc -l
grep TTTTTAGAGTGGTCTCATACAGAACTTATAAGATTCCCAAATCCAAAGACATTTCACGTTTATGGTGATTTCCCAG f030_mirna_trim4.fastq | wc -l
#grep AAAAATCTCACCAGAGTATGTCTTGAATATTCTAAGGGTTTAGGTTTCTGTAAAGTGCAAATACCACTAAAGGGTC f030_mirna_trim4.fastq | wc -l
grep CTTTTTTTCGTCCTTTCCACAAGATATATAAAGCCAAGAAATCGAAATACTTTCAAGTTACGGTAAGC f030_mirna_trim4.fastq | wc -l
#grep GAAAAAAAGCAGGAAAGGTGTTCTATATATTTCGGTTCTTTAGCTTTATGAAAGTTCAATGCCATTCG f030_mirna_trim4.fastq | wc -l
grep GCTTACCGTAACTTGAAAGTATTTCGATTTCTTGGCTTTATATATCTTGTGGAAAGGACGAAAAAAAG f030_mirna_trim4.fastq | wc -l
#grep CGAATGGCATTGAACTTTCATAAAGCTAAAGAACCGAAATATATAGAACACCTTTCCTGCTTTTTTTC f030_mirna_trim4.fastq | wc -l
Check the data quality again using fastqc:
mkdir f040_res_fastqc_trimmed
fastqc -o f040_res_fastqc_trimmed f030_mirna_trim4.fastq
Some of the reads seems to be too short and some others may not have enough quality.
Check the options:
you can find the help of the program typing cutadapt -h
in the shell.
Run cutadapt for length and quality purge of the reads.
cutadapt -m 17 -M 30 -q 10 -o f040_mirna_cut.fastq f030_mirna_trim4.fastq
Check the data quality again using fastqc:
mkdir f050_res_fastqc_trimmed_purged
fastqc -o f050_res_fastqc_trimmed_purged f040_mirna_cut.fastq
Explore again the Overrepresented sequences in [mirbase] (Search -> By sequence).
Count how many reads are left for the analysis (divide by 4)
wc -l f010_raw_mirna.fastq
wc -l f040_mirna_cut.fastq
Collapse seqeunce for evaluated the complexity of the samples. More differents sequence you have more your experiment are well performed.
fastx_collapse f010_raw_mirna.fastq >f010_raw_mirna.fastq.collpse
Please familiarize with this TOOL KRAKEN
How can detect the adapter contamination:
minion search-adapter -i QUALITY/quality_control/f010_raw_mirna.fastq.gz