Jaffa configuration

Jaffa

JAFFA !

JAFFA is a multi-step pipeline that takes either raw RNA-Seq reads, or pre-assembled transcripts, then searches for gene fusions. It will output the names and locations of candidate gene fusions along with the cDNA sequence of their breakpoints.

JAFFA is based on the idea of comparing a transcriptome (e.g. in a cancer sample) against a reference transcriptome. In this way, it is a transcript-centric approach rather than a genome-centric approach like other fusion finders. In validation studies, JAFFA performed well over a range of read lengths - from 50bp to full-length transcripts and on single and paired-end reads. For more information please see one of the wiki pages below

mkdir REF_JAFFA
cd REF_JAFFA
wget -r -nd http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chromFa.tar.gz

tar xvf hg38.chromFa.tar.gz
cat chroms/chr*.fa > hg38.fa
rm -rf chroms/
wget -r -nd -np https://drive.google.com/open?id=0B1FwZagazjpcd2RDVmtiazVWVk0
tar xvfz JAFFA_REFERENCE_FILES_HG38_GENCODE22.tar.gz



By default JAFFA will be expecting this file to be in the root of the JAFFA code directory. You can either copy it there, create a symbolic link to it (ln -s ), provide the path to your hg38.fa file in the pipeline file JAFFA_stages.groovy or pass it when you run the JAFFA command. Note that JAFFA expects the UCSC version of the genome.

Run JAFFA

Input file

The input to JAFFA should be either reads which have been gzipped. i.e. with an ending like “.fastq.gz” or a fasta file of contigs with an ending like “.fasta”. JAFFA assumes there is one files (single-end) or pair of files (paired-end) per sample.

Running

Create and change into the directory where you intend the output files of JAFFA to be placed. You then have a choice of three JAFFA running modes: Assembly, Hybrid and Direct. Which mode to use will depend on your read length. When to use which mode?

  • For reads shorter than 70bp use the Assembly mode. The reads are too short to be used as contigs directly.
  • For longer reads ( 70 and 95 bp), the hybrid mode is the most sensitive. However, because it involves assembly, it requires a lot of memory and CPU time. If computational resources are a constraint, we recommend using the direct method.
  • For long reads (e.g. 100 bp up), there is no advantage in assembling, therefore we recommend the direct mode.

Running Mode

Assembly

JAFFA will call Velvet and Oases to assemble the reads. It will then search for fusions from amongst the assembled contigs.

Bpipe will giving you information about each stage in the pipeline as it run it. Let go through these in some detail.

Stage run_check- This checks that all the required software is installed. If bpipe fails here, you may need to double check the installation. Next bpipe with create two branches, one for each sample set. In the case of this demo, there will be one for MCF-7 and one for BTF-7. These branches will be run in parallel. You can control the number of branches running in parallel with the bpipe -n option. For each branch, these stages are run: Stage make_dir_using_fastq_names. Create a directory for each sample. ** Stage prepare_reads ** Now you will see that bpipe is running Trimmomatic. With JAFFA’s default setting, no trimming is actually performed, but the fastq files will be unzipped. It is possible to control the amount of trimming. See JAFFA_stages.groovy. During this step you may also see that JAFFA is mapping reads. This is used to filter out any pairs which map to intronic or intergenic regions, or chrM. Stage run_assembly. Next the reads will be assembled. The assembly involved running velveth, velvetg then oases, 6 times: once for each kmer 19, 23, 27,31 and 35, and these once to merge. If something goes wrong with the pipeline, it is often during this stage. For example you will need to ensure that your machine has enough RAM (especially if you are running samples in parallel). The assembly can be adjusted in the JAFFA script assemble.sh. In this demo the assembly will be fast, but in practice this stage can take a long time. Stage align_transcripts_to_annotation. The assembled contigs will then be aligned to the reference transcriptome (GENCODE by default). BLAT is used for this. Stage filter_transcripts. Preliminary filtering will be performed (this uses an R script). Stage extract_fusion_sequences. The fasta sequences of the preliminary candidate will be extracted Stage map_reads. Reads are mapped back to the candidate fusion sequences Stage get_spanning_reads. JAFFA will count the number of spanning read and spanning pairs over the break point of the fusions Stage align_transcripts_to_genome. The candidates will be aligned to the human genome using BLAT Stage get_final_list. A second filtering of the candidates is done using the genomic alignment and read coverage data. A file ending in .summary will contain all the candidates fusions identified by JAFFA Stage compile_all_results. The summary information from all samples is merged. A fasta file of candidate fusions from all samples is created.

If everything runs smoothly, you should see two new files created in your working directory: jaffa_results.csv and jaffa_results.fasta. See OutputDescription for a description of the content of these files.

Direct

JAFFA will map reads to the known reference transcriptome and extract reads which do not map. It will then search for fusions from amongst the unmapped reads.

You’ll see a few different stages in this pipeline, compared to the previous:

Stage cat_reads . Concatenate the paired-end reads into a single file. Stage remove_dup . Remove duplicate reads. Stage get_unmapped. Map the reads to the reference transcriptome and extract the reads that do not map.

Hybrid

This is a combination of the previous two modes. First JAFFA will call Velvet and Oases to assemble the reads. It will then search for fusions from amongst the assembled contigs. Next it will map reads to both the known reference transcriptome and the assembled transcriptome. It will then search for fusions from amongst the unmapped reads.