Fusioncatcher

Agenda, resources and other useful information

Table of Contents

  1. Introduction
  2. Hardware requirements and dependencies
  3. FusionCatcher in scientific articles
  4. Installation and usage examples
  5. Quick installation
  6. Usage
  7. Aligners
  8. Command line options
  9. Methods
  10. Comparisons to other tools
  11. License
  12. Citing
  13. Reporting bugs

1 - INTRODUCTION

FusionCatcher searchers for somatic novel/known fusion genes, translocations and/or chimeras in RNA-seq data (stranded/unstranded paired-end/single-end reads FASTQ files produced by Illumina next-generation sequencing platforms like Illumina Solexa/HiSeq/NextSeq/MiSeq/MiniSeq) from diseased samples. Unfortunately we don’t have resurce and time to discus this amazyng program. It was the first pogram for somatic fusion with biological annotation and one of major contribute for the develope of STAR-Fusion.

The aims of FusionCatcher are:

  • very good detection rate for finding candidate somatic fusion genes (see somatic mutations; using a matched normal sample is optional; several databases of known fusion genes found in healthy samples are used as a list of known false positives; biological knowledge is used, like for example gene fusion between a gene and its pseudogene is filtered out),
  • very good RT-PCR validation rate of found candidate somatic fusion genes (this is very important for us),
  • very easy to use (i.e. no a priori knowledge of bioinformatic databases and bioinformatics is needed in order to run FusionCatcher BUT Linux/Unix knowledge is needed; it allows a very high level of control for expert users),
  • to be as automatic as possible (i.e. the FusionCatcher will choose automatically the best parameters in order to find candidate somatic fusion genes, e.g. finding automatically the adapters, quality trimming of reads, building the exon-exon junctions automatically based on the length of the reads given as input, etc. while giving also full control to expert users) while providing the best possible detection rate for finding somatic fusion genes (with a very low rate of false positives but a very good sensitivity).

6.2 - Output data

FusionCatcher produces a list of candidate fusion genes using the given input data. It is recommended that this list of candidate of fusion genes is further validated in the wet-lab using for example PCR/FISH experiments.

The output files are:

  • final-list_candidate_fusion_genes.txt - final list with the newly found candidates fusion genes (it contains the fusion genes with their junction sequence and points); Starting with version 0.99.3c the coordinates of fusion genes are given here for human genome using only assembly hg38/GRCh38; See Table 1 for columns’ descriptions;
  • final-list_candidate_fusion_genes.GRCh37.txt - final list with the newly found candidates fusion genes (it contains the fusion genes with their junction sequence and points); Starting with version 0.99.3d the coordinates of fusion genes are given here for human genome using assembly hg19/GRCh37; See Table 1 for columns’ descriptions;
  • summary_candidate_fusions.txt - contains an executive summary (meant to be read directly by the medical doctors or biologist) of candidate fusion genes found;
  • final-list_candidate_fusion_genes.caption.md.txt - explains in detail the labels found in column Fusion_description of files final-list_candidate_fusion_genes.txt and final-list_candidate_fusion_genes.GRCh37.txt;
  • supporting-reads_gene-fusions_BOWTIE.zip - sequences of short reads supporting the newly found candidate fusion genes found using only and exclusively the Bowtie aligner;
  • supporting-reads_gene-fusions_BLAT.zip - sequences of short reads supporting the newly found candidate fusion genes found using Bowtie and Blat aligners;
  • supporting-reads_gene-fusions_STAR.zip - sequences of short reads supporting the newly found candidate fusion genes found using Bowtie and STAR aligners;
  • supporting-reads_gene-fusions_BOWTIE2.zip - sequences of short reads supporting the newly found candidate fusion genes found using Bowtie and Bowtie2 aligners;
  • supporting-reads_gene-fusions_BWA.zip - sequences of short reads supporting the newly found candidate fusion genes found using Bowtie and BWA aligners;`
  • viruses_bacteria_phages.txt - (non-zero) reads counts for each virus/bacteria/phage from NCBI database ftp://ftp.ncbi.nlm.nih.gov/genomes/Viruses/
  • info.txt - information regarding genome version, Ensembl database version, versions of tools used, read counts, etc.;
  • fusioncatcher.log - log of the entire run (e.g. all commands/programs which have been run, command line arguments used, running time for each command, etc.).

FusionCatcher reports:

  • multiple times (up to four times) exactly the same candidate fusion gene, which has exactly the same fusion points/junction (i.e. FusionCatcher will output separately the fusions found for each of its four aligners/methods such that it is easy to see what method was used to find a fusion gene)
  • reciprocal fusion genes if they are found (e.g. geneA-geneB and also geneB-geneA)
  • every alternative splicing event found for each fusion gene (i.e. alternative fusion isoforms of the same fusion gene)

Table 1 - Columns description for file final-list_candidate-fusion-genes.txt

Column Description
Gene_1_symbol(5end_fusion_partner) Gene symbol of the 5’ end fusion partner
Gene_2_symbol_2(3end_fusion_partner) Gene symbol of the 3’ end fusion partner
Gene_1_id(5end_fusion_partner) Ensembl gene id of the 5’ end fusion partner
Gene_2_id(3end_fusion_partner) Ensembl gene id of the 3’ end fusion partner
Exon_1_id(5end_fusion_partner) Ensembl exon id of the 5’ end fusion exon-exon junction
Exon_2_id(3end_fusion_partner) Ensembl exon id of the 3’ end fusion exon-exon junction
Fusion_point_for_gene_1(5end_fusion_partner) Chromosomal position of the 5’ end of fusion junction (chromosome:position:strand); 1-based coordinate
Fusion_point_for_gene_2(3end_fusion_partner) Chromosomal position of the 3’ end of fusion junction (chromosome:position:strand); 1-based coordinate
Spanning_pairs Count of pairs of reads supporting the fusion (including also the multimapping reads)
Spanning_unique_reads Count of unique reads (i.e. unique mapping positions) mapping on the fusion junction. Shortly, here are counted all the reads which map on fusion junction minus the PCR duplicated reads.
Longest_anchor_found Longest anchor (hangover) found among the unique reads mapping on the fusion junction
Fusion_finding_method Aligning method used for mapping the reads and finding the fusion genes. Here are two methods used which are: (i) BOWTIE = only Bowtie aligner is used for mapping the reads on the genome and exon-exon fusion junctions, (ii) BOWTIE+BLAT = Bowtie aligner is used for mapping reads on the genome and BLAT is used for mapping reads for finding the fusion junction, (iii) BOWTIE+STAR = Bowtie aligner is used for mapping reads on the genome and STAR is used for mapping reads for finding the fusion junction, (iv) BOWTIE+BOWTIE2 = Bowtie aligner is used for mapping reads on the genome and Bowtie2 is used for mapping reads for finding the fusion junction, and (v) BOWTIE+BWA = Bowtie aligner is used for mapping reads on the genome and Bowtie2 is used for mapping reads for finding the fusion junction.
Fusion_sequence The inferred fusion junction (the asterisk sign marks the junction point)
Fusion_description Type of the fusion gene (see the Table 2)
Counts_of_common_mapping_reads Count of reads mapping simultaneously on both genes which form the fusion gene. This is an indication how similar are the DNA/RNA sequences of the genes forming the fusion gene (i.e. what is their homology because highly homologous genes tend to appear show as candidate fusion genes). In case of completely different sequences of the genes involved in forming a fusion gene then here it is expected to have the value zero.
Predicted_effect Predicted effect of the candidate fusion gene using the annotation from Ensembl database. This is shown in format effect_gene_1/effect_gene_2, where the possible values for effect_gene_1 or effect_gene_2 are: intergenic, intronic, exonic(no-known-CDS), UTR, CDS(not-reliable-start-or-end), CDS(truncated), or CDS(complete). In case that the fusion junction for both genes is within their CDS (coding sequence) then only the values in-frame or out-of-frame will be shown.
Predicted_fused_transcripts All possible known fused transcripts in format ENSEMBL-TRANSCRIPT-1:POSITION-1/ENSEMBLE-TRANSCRIPT-B:POSITION-2, where are fused the sequence 1:POSITION-1 of transcript ENSEMBL-TRANSCRIPT-1 with sequence POSITION-2:END of transcript ENSEMBL-TRANSCRIPT-2
Predicted_fused_proteins Predicted amino acid sequences of all possible fused proteins (separated by “;”).

Table 2 - Labels used to describe the found fusion genes (column Fusion_ description from file final-list_candidate-fusion-genes.txt)

Fusion_description Description
1000genomes fusion gene has been seen in a healthy sample. It has been found in RNA-seq data from some samples from 1000 genomes project. A candidate fusion gene having this label has a very high probability of being a false positive.
18cancers fusion gene found in a RNA-seq dataset of 18 types of cancers from 600 tumor samples published here.
adjacent both genes forming the fusion are adjacent on the genome (i.e. same strand and there is no other genes situated between them on the same strand)
antisense one or both genes is a gene coding for antisense RNA
banned fusion gene is on a list of known false positive fusion genes. These were found with very strong supporting data in healthy samples (i.e. it showed up in file final-list_candidate_fusion_genes.txt). A candidate fusion gene having this label has a very high probability of being a false positive.
bodymap2 fusion gene is on a list of known false positive fusion genes. It has been found in healthy human samples collected from 16 organs from Illumina BodyMap2 RNA-seq database. A candidate fusion gene having this label has a very high probability of being a false positive.
cacg known conjoined genes (that is fusion genes found in samples from healthy patients) from the CACG database (please see CACG database for more information). A candidate fusion gene having this label has a very high probability of being a false positive in case that one looks for fusion genes specific to a disease.
cell_lines known fusion gene from paper: C. Klijn et al., A comprehensive transcriptional portrait of human cancer cell lines, Nature Biotechnology, Dec. 2014, DOI:10.1038/nbt.3080
cgp known fusion gene from the CGP database
chimerdb2 known fusion gene from the ChimerDB 2 database
chimerdb3kb known fusion gene from the ChimerDB 3 KB (literature curration) database
chimerdb3pub known fusion gene from the ChimerDB 3 PUB (PubMed articles) database
chimerdb3seq known fusion gene from the ChimerDB 3 SEQ (TCGA) database
conjoing known conjoined genes (that is fusion genes found in samples from healthy patients) from the ConjoinG database (please use ConjoinG database for more information regarding the fusion gene). A candidate fusion gene having this label has a very high probability of being a false positive in case that one looks for fusion genes specific to a disease.
cosmic known fusion gene from the COSMIC database (please use COSMIC database for more information regarding the fusion gene)
cta one gene or both genes is CTA gene (that is that the gene name starts with CTA-). A candidate fusion gene having this label has a very high probability of being a false positive.
ctb one gene or both genes is CTB gene (that is that the gene name starts with CTB-). A candidate fusion gene having this label has a very high probability of being a false positive.
ctc one gene or both genes is CTC gene (that is that the gene name starts with CTC-). A candidate fusion gene having this label has a very high probability of being a false positive.
ctd one gene or both genes is CTD gene (that is that the gene name starts with CTD-). A candidate fusion gene having this label has a very high probability of being a false positive.
distance1000bp both genes are on the same strand and they are less than 1 000 bp apart. A candidate fusion gene having this label has a very high probability of being a false positive.
distance100kbp both genes are on the same strand and they are less than 100 000 bp apart. A candidate fusion gene having this label has a higher probability than expected of being a false positive.
distance10kbp both genes are on the same strand and they are less than 10 000 bp apart. A candidate fusion gene having this label has a higher probability than expected of being a false positive.
duplicates both genes involved in the fusion gene are paralog for each other. For more see Duplicated Genes Database (DGD) database . A candidate fusion gene having this label has a higher probability than expected of being a false positive.
exon-exon the fusion junction point is exactly at the known exon’s borders of both genes forming the candidate fusion
ensembl_fully_overlapping the genes forming the fusion gene are fully overlapping according to Ensembl database. A candidate fusion gene having this label has a very high probability of being a false positive.
ensembl_partially_overlapping the genes forming the fusion gene are partially overlapping (on same strand or on different strands) according the Ensembl database. *A candidate fusion gene having this label has a good probability of being a false positive.</i> </font>
ensembl_same_strand_overlapping the genes forming the fusion gene are fully/partially overlapping and are both on the same strand according to Ensembl database. *A candidate fusion gene having this label has a very high probability of being a false positive (this is most likely and alternative splicing event).</i> </font>
fragments the genes forming the fusion are supported by only and only one fragment of RNA. A candidate fusion gene having this label has a medium probability of being a false positive.
gliomas fusion gene found in a RNA-seq dataset of 272 glioblastoms published here.
gtex fusion gene has been seen in a healthy sample. It has been found in GTEx database of healthy tissues (thru FusionAnnotator). A candidate fusion gene having this label has a very high probability of being a false positive.
healthy fusion gene has been seen in a healthy sample. These have been found in healthy samples but the support for them is less strong (i.e. paired reads were found to map on both genes but no fusion junction was found) than in the case of banned label (i.e. it showed up in file preliminary list of candidate fusion genes). Also genes which have some degree of sequence similarity may show up marked like this.A candidate fusion gene having this label has a small probability of being a false positive in case that one looks for fusion genes specific to a disease.
hpa fusion gene has been seen in a healthy sample. It has been found in RNA-seq database of 27 healthy tissues. A candidate fusion gene having this label has a very high probability of being a false positive.
known fusion gene which has been previously reported or published in scientific articles/reports/books/abstracts/databases indexed by Google, Google Scholar, PubMed, etc. This label has only the role to answer with YES or NO the question “has ever before a given (candidate) fusion gene been published or reported?”. This label does not have in anyway the role to provide the original references to the original scientific articles/reports/books/abstracts/databases for a given fusion gene.
lincrna one or both genes is a lincRNA
matched-normal candidate fusion gene (which is supported by paired reads mapping on both genes and also by reads mapping on the junction point) was found also in the matched normal sample given as input to the command line option ‘–normal’
metazoa one or both genes is a metazoa_srp gene Metazia_srp
mirna one or both genes is a miRNA
mt one or both genes are situated on mitochondrion. A candidate fusion gene having this label has a very high probability of being a false positive.
mX (where X is a number) count of pairs of reads supporting the fusion (excluding the mutimapping reads).
non_cancer_tissues fusion gene which has been previously reported/found in non-cancer tissues and cell lines in Babiceanu et al, Recurrent chimeric fusion RNAs in non-cancer tissues and cells, Nucl. Acids Res. 2016. These are considered as non-somatic mutation and therefore they may be skipped and not reported.
non_tumor_cells fusion gene which has been previously reported/found in non-tumor cell lines, like for example HEK293. These are considered as non-somatic mutation and therefore may be skipped and not reported.
no_protein one or both genes have no known protein product
oesophagus fusion gene found in a oesophageal tumors from TCGA samples, which are published here.
oncogene one gene or both genes are a known oncogene according to ONGENE database
cancer one gene or both genes are cancer associated according to Cancer Gene database
tumor one gene or both genes are proto-oncogene or tumor suppresor gene according to UniProt database
pair_pseudo_genes one gene is the other’s pseudogene. A candidate fusion gene having this label has a very high probability of being a false positive.
pancreases known fusion gene found in pancreatic tumors from article: P. Bailey et al., Genomic analyses identify molecular subtypes of pancreatic cancer, Nature, Feb. 2016, http://dx.doi.org/110.1038/nature16965
paralogs both genes involved in the fusion gene are paralog for each other (most likely this is a false positive fusion gene). A candidate fusion gene having this label has a very high probability of being a false positive.
multi one of the genes of both have multi-mapping reads mapping (which map simultaneously also on other gene/genes
partial-matched-normal candidate fusion gene (which is supported by paired reads mapping on both genes but no reads were found which map on the junction point) was found also in the matched normal sample given as input to the command line option ‘–normal’. This is much weaker than matched-normal.
prostates known fusion gene found in 150 prostate tumors RNAs from paper: D. Robison et al, Integrative Clinical Genomics of Advanced Prostate Cancer, Cell, Vol. 161, May 2015, http://dx.doi.org/10.1016/j.cell.2015.05.001
pseudogene one or both of the genes is a pseudogene
readthrough the fusion gene is a readthrough event (that is both genes forming the fusion are on the same strand and there is no known gene situated in between); Please notice, that many of readthrough fusion genes might be false positive fusion genes due to errors in Ensembl database annotation (for example, one gene is annotated in Ensembl database as two separate genes). A candidate fusion gene having this label has a high probability of being a false positive.
refseq_fully_overlapping the genes forming the fusion gene are fully overlapping according to RefSeq NCBI database. A candidate fusion gene having this label has a very high probability of being a false positive.
refseq_partially_overlapping the genes forming the fusion gene are partially overlapping (on same strand or on different strands) according the RefSeq NCBI. *A candidate fusion gene having this label has a good probability of being a false positive.</i> </font>
refseq_same_strand_overlapping the genes forming the fusion gene are fully/partially overlapping and are both on the same strand according to RefSeq NCBI database. *A candidate fusion gene having this label has a very high probability of being a false positive (this is most likely and alternative splicing event).</i> </font>
ribosomal one or both gene is a gene encoding for ribosomal protein
rp11 one gene or both genes is RP11 gene (that is that the gene name starts with RP11-). A candidate fusion gene having this label has a very high probability of being a false positive.
rp one gene or both genes is RP?? gene (that is that the gene name starts with RP??-) where ? is a digit. A candidate fusion gene having this label has a very high probability of being a false positive.
rrna one or both genes is a rRNA. A candidate fusion gene having this label has a very high probability of being a false positive.
short_distance both genes are on the same strand and they are less than X bp apart, where X is set using the option ‘–dist-fusion’ and by default it is 200 000 bp. A candidate fusion gene having this label has a higher probability than expected of being a false positive.
similar_reads both genes have the same reads which map simultaneously on both of them (this is an indicator of how similar are the sequences of both genes; ideally this should be zero or as close to zero as possible for a real fusion). A candidate fusion gene having this label has a very high probability of being a false positive.
similar_symbols both genes have the same or very similar gene names (for example: RP11ADF.1 and RP11ADF.2). A candidate fusion gene having this label has a very high probability of being a false positive.
snorna one or both genes is a snoRNA
snrna one or both genes is a snRNA
tcga known fusion gene from the TCGA database (please use Google for more information regarding the fusion gene)
ticdb known fusion gene from the TICdb database (please use TICdb database for more information regarding the fusion gene)
trna one or both genes is a tRNA
ucsc_fully_overlapping the genes forming the fusion gene are fully overlapping according to UCSC database. A candidate fusion gene having this label has a very high probability of being a false positive.
ucsc_partially_overlapping the genes forming the fusion gene are partially overlapping (on same strand or on different strands) according the UCSC database. *A candidate fusion gene having this label has a good probability of being a false positive.</i> </font>
ucsc_same_strand_overlapping the genes forming the fusion gene are fully/partially overlapping and are both on the same strand according to UCSC database. *A candidate fusion gene having this label has a very high probability of being a false positive (this is most likely and alternative splicing event).</i> </font>
yrna one or both genes is a Y RNA

6.3 - Visualization

FusionCatcher outputs also the zipped FASTA files containing the reads which support the found candidate fusions genes. The files are:

  • supporting-reads_gene-fusions_BOWTIE.zip,
  • supporting-reads_gene-fusions_BLAT.zip,
  • supporting-reads_gene-fusions_STAR.zip,
  • supporting-reads_gene-fusions_BOWTIE2.zip, and
  • supporting-reads_gene-fusions_BWA.zip.

The reads which support the:

  • junction of the candidate fusion have their name ending with _supports_fusion_junction, and
  • candidate fusion (i.e. one reads map on one gene and the paired-read maps on the other fusion gene) have their name ending with _supports_fusion_pair.

These supporting reads (given as FASTA and FASTQ files) may be used for further visualization purposes. For example, one may use these supporting reads and align them himself/herself using his/her favourite:

  • aligner (e.g. Bowtie/Bowtie2/TopHat/STAR/GSNAP/etc.),
  • version/assembly of genome,
  • mapping format output (e.g. SAM/BAM), and
  • NGS visualizer (e.g. IGV/UCSC Genome Browser/etc.)

6.3.1 - UCSC Genome Browser

For example, the sequences of supporting reads for a given candidate fusion gene may be visualized using UCSC Genome Browser by aligning them using the UCSC Genome Browser’s BLAT aligner (i.e. copy and paste the reads here: BLAT tool of UCSC Genome Browser –> click the button Submit –> navigate into the UCSC Genome Browser to the genes that form the fusion genes). Also zooming out several times gives better view here.

6.3.2 - PSL format

If one uses the --visualization-psl command line option of the FusionCatcher then the BLAT alignment of the supporting reads will be done automatically by the FusionCatcher and the results are saved in PSL format files with names that are ending with _reads.psl in the:

  • supporting-reads_gene-fusions_BOWTIE.zip,
  • supporting-reads_gene-fusions_BLAT.zip,
  • supporting-reads_gene-fusions_STAR.zip, and
  • supporting-reads_gene-fusions_BOWTIE2.zip, and
  • supporting-reads_gene-fusions_BWA.zip.

The files with names ending in _reads.psl may be used further for visualization of the candidate fusion genes using UCSC Genome Browser, IGV (Integrative Genome Viewer) or any other viewer/browser which supports the PSL format.

Note: If one generated the build files using fusioncatcher-build.py the command line --visualization-psl option should work just fine. If one downloaded the build files then the command line option --visualization-psl will not work an it needs to be enabled by creating manually first the file fusioncatcher/data/current/genome.2bit for FusionCatcher, something like this (here the assumption is that the build files for one’s organism of interest are in fusioncatcher/data/current/):

# re-build the genome index using BLAT where the genome is given FASTA file genome.fa
fusioncatcher/tools/bowtie/bowtie-inspect fusioncatcher/data/current/genome_index/ > fusioncatcher/data/current/genome.fa
fusioncatcher/tools/blat/faToTwoBit fusioncatcher/data/current/genome.fa fusioncatcher/data/current/genome.2bit -noMask

6.3.3 - SAM format

6.3.3.1 - Automatic method

If one uses the --visualization-sam command line option of the FusionCatcher then the BOWTIE2 alignment of the supporting reads will be done automatically by the FusionCatcher and the results are saved as SAM files with names that are ending with _reads.sam in the:

  • supporting-reads_gene-fusions_BOWTIE.zip,
  • supporting-reads_gene-fusions_BLAT.zip,
  • supporting-reads_gene-fusions_STAR.zip,
  • supporting-reads_gene-fusions_BOWTIE2.zip, and
  • supporting-reads_gene-fusions_BWA.zip.

The files with names ending in _reads.sam (please note, that they still needed to be converted to BAM, coordiante sorted and indexed first) may be used further for visualization of the candidate fusion genes using UCSC Genome Browser, IGV (Integrative Genome Viewer) or any other viewer/browser which supports the SAM format.

6.3.3.2 - Manual method

Here is an rough example of manually aligning the supporting reads (that is named as supporting_reads.fq in the below example; the FASTQ files needed here are the files ending in _reads.fq from the ZIP archives supporting-reads_gene-fusions_*.zip produced by FusionCatcher) using different aligners.

  • Bowtie2 aligner (where your_choice_of_genome_bowtie2_index may be for human, for example this)
  • alignment done ignoring the paired-end information (i.e. like single reads): ``` bowtie2 \ –local \ -k 10 \ -x your_choice_of_genome_bowtie2_index \ -U supporting_reads.fq \ -S fusion_genes.sam
samtools view -bS fusion_genes.sam samtools sort - fusion_genes.sorted

samtools index fusion_genes.sorted.bam ```

  • alignment done taking into account the paired-end information: ``` cat supporting_reads.fq | \ paste - - - - - - - - | \ awk ‘{print $1”\n”$2”\n”$3”\n”$4 > “r1.fq”; print $5”\n”$6”\n”$7”\n”$8 > “r2.fq”}’

bowtie2 \ –local \ -k 10 \ -x your_choice_of_genome_bowtie2_index \ -1 r1.fq \ -2 r2.fq \ -S fusion_genes.sam

samtools view -bS fusion_genes.sam samtools sort - fusion_genes.sorted

samtools index fusion_genes.sorted.bam ```

  • STAR aligner (where your_choice_of_genome_star_index should be built according to the STAR Manual)
  • alignment done ignoring the paired-end information (i.e. like single reads): ``` STAR \ –genomeDir your_choice_of_genome_star_index \ –alignSJoverhangMin 9 \ –chimSegmentMin 17 \ –readFilesIn supporting_reads.fq \ –outFileNamePrefix .
samtools view -bS fusion_genes.sam samtools sort - fusion_genes.sorted

samtools index fusion_genes.sorted.bam ```

  • alignment done taking into account the paired-end information: ``` cat supporting_reads.fq | \ paste - - - - - - - - | \ awk ‘{print $1”\n”$2”\n”$3”\n”$4 > “r1.fq”; print $5”\n”$6”\n”$7”\n”$8 > “r2.fq”}’

STAR \ –genomeDir /your_choice_of_genome_star_index/ \ –alignSJoverhangMin 9 \ –chimSegmentMin 17 \ –readFilesIn r1.fq r2.fq\ –outFileNamePrefix .

samtools view -bS Aligned.out.sam samtools sort - fusion_genes.sorted

samtools index fusion_genes.sorted.bam ```

  • BLAT aligner (where your_choice_of_genome_blat_index should be built according to the BLAT’s examples) ```

    build the genome index using BLAT where the genome is given FASTA file genome.fa

    faToTwoBit genome.fa genome.2bit -noMask

align the supporting reads given by FusionCatcher (the FASTA

file for your fusion of interest can be found in ZIP files

generated as output by FusionCatcher,

e.g. EML4–ALK__42264951–29223528_reads.fa) using BLAT aligner

blat -stepSize=5 -repMatch=2253 -minScore=0 -minIdentity=0 genome.2bit supporting_reads.fa supporting_reads_mapped.psl

visualize the PSL file supporting_reads_mapped.pslin IGV or run psl2sam.pl to convert it into SAM format

psl2sam.pl supporting_reads_mapped.psl > supporting_reads_mapped.sam

```

Further, the files fusion_genes.sorted.bam and fusion_genes.sorted.bam.bai may be used with your favourite NGS visualizer!

6.3.4 - Chimera R/BioConductor package

For visualization of fusion genes found by FusionCatcher one may use also the R/BioConductor package Chimera, which supports FusionCatcher.

6.4 - Examples

6.4.1 - Example 1

Here, is an example of how FusionCatcher can be used to search for fusion genes in human RNA-seq sample where:

  1. any distance at chromosomal level between the candidate fusion genes is acceptable, and
  2. the candidate fusion genes are allowed to be readthroughs (i.e. the genes forming a fusion gene maybe adjacent on the chromosome)
  3. the candidate fusion genes are not allowed to be less the 1000 bp apart on the same strand
  4. use two methods to find the fusion genes (i.e. use BOWTIE, BLAT, STAR, and BOWTIE2 aligners for mapping the reads and this allows to find the fusion genes even in the case that the annotation from Ensembl database is not entirely correct, like for example find a fusion junction even if it is in the middle of a exon or intron)
    fusioncatcher \
    -d /some/human/data/directory/ \
    -i /some/input/directory/containing/fastq/files/ \
    -o /some/output/directory/
    

6.4.2 - Example 2

Here, is an example of how FusionCatcher can be used to search for fusion genes in human RNA-seq sample where:

  1. any distance at chromosomal level between the candidate fusion genes is acceptable, and
  2. the candidate fusion genes are not allowed to be readthroughs (i.e. there is still at least one known gene situated one the same strand in between the genes which form the candidate fusion gene)
  3. the candidate fusion genes are not allowed to be less the 1000 bp apart on the same strand
  4. use only one method to find the fusion genes (i.e. use only BOWTIE aligner for mapping the reads and this allows to find the fusion genes only in the case that the annotation from Ensembl database is correct, like for example find a fusion junction only if it matches perfectly the known exon borders)
    fusioncatcher \
    -d /some/human/data/directory/ \
    -i /some/input/directory/containing/fastq/files/ \
    -o /some/output/directory/ \
    --skip-readthroughs \
    --skip-blat
    

7 - ALIGNERS

7.1 - Bowtie

By default, FusionCatcher its the Bowtie aligner for finding candidate fusion genes. This approach relies heavily on good is the annotation data for the given organism in the Ensembl database. If, for example, a gene is not annotated well and has several exons which are not annotated in the Ensembl database and if one of these exons is the one involved in the fusion point then this fusion gene will not be found by using only the Bowtie aligner. In order to find also the fusion genes where the the junction point is in the middle of exons or introns, *FusionCatcher* is using by default the BLAT, and STAR aligners in addition to Bowtie aligner. The command line options ‘--skip-blat’,’--skip-star’, ‘--skip-bowtie2’, or ‘--skip-bwa’ should be used in order to specify what aligners should not be used. The command line option ‘--aligners’ specifies which aligners should be used by default. For example, ‘--aligners=blat,star,bowtie2,bwa’ forces FusionCatcher too use all aligners for finding fusion genes

7.2 - Bowtie and Blat

The use of Bowtie and Blat aligners is the default approach of FusionCatcher for finding fusion genes.

In order not to use this approach the command line option ‘--skip-blat’ should be added (or remove the string blat from line aligners from file fusioncatcher/etc/configuration.cfg), as following:

fusioncatcher \
-d /some/human/data/directory/ \ 
-i /some/input/directory/containing/fastq/files/ \
-o /some/output/directory/ \
--skip-blat

Please, read the license of Blat aligner before using this approach in order to see if you may use Blat! FusionCatcher will use Blat aligner when using this approach!

7.3 - Bowtie and STAR

The use of Bowtie and STAR aligners is the default approach of FusionCatcher for finding fusion genes.

In order not to use this approach the command line option ‘--skip-star’ should be added, as following:

fusioncatcher \
-d /some/human/data/directory/ \ 
-i /some/input/directory/containing/fastq/files/ \
-o /some/output/directory/ \
--skip-star

7.4 - Bowtie and Bowtie2

The use of Bowtie and Bowtie2 aligners is not the default approach of FusionCatcher for finding fusion genes.

In order not to use this approach the command line option ‘--skip-bowtie2’ should be added, as following:

fusioncatcher \
-d /some/human/data/directory/ \ 
-i /some/input/directory/containing/fastq/files/ \
-o /some/output/directory/ \
--skip-bowtie2

In order to use this approach the command line option ‘--aligners’ should contain the string ‘bowtie2’, like for example

fusioncatcher \
-d /some/human/data/directory/ \ 
-i /some/input/directory/containing/fastq/files/ \
-o /some/output/directory/ \
--aligners blat,star,bowtie2

7.5 - Bowtie and BWA

The use of Bowtie and BWA aligners is not the default approach of FusionCatcher for finding fusion genes.

In order not to use this approach the command line option ‘--skip-bowtie2’ should be added, as following:

fusioncatcher \
-d /some/human/data/directory/ \ 
-i /some/input/directory/containing/fastq/files/ \
-o /some/output/directory/ \
--skip-bowtie2

In order to use this approach the command line option ‘--aligners’ should contain the string ‘bwa’, like for example

fusioncatcher \
-d /some/human/data/directory/ \ 
-i /some/input/directory/containing/fastq/files/ \
-o /some/output/directory/ \
--aligners blat,star,bwa

8 - Command line options

fusioncatcher

It searchers for fusion genes and/or translocations in RNA-seq data (paired-end reads FASTQ files produced by Illumina next-generation sequencing platforms like Illumina Solexa and Illumina HiSeq) in diseased samples. Its command line is:

fusioncatcher [options]

and the command line options are:

  --version             show program's version number and exit
  -h, --help            show this help message and exit
  -i INPUT_FILENAME, --input=INPUT_FILENAME
                        The input file(s) or directory. The files should be in
                        FASTQ or SRA format and may be or not compressed using
                        gzip or zip. A list of files can be specified by given
                        the filenames separated by comma. If a directory is
                        given then it will analyze all the files found with
                        the following extensions: .sra, .fastq, .fastq.zip,
                        .fastq.gz, .fastq.bz2, fastq.xz, .fq, .fq.zip, .fq.gz,
                        .fq.bz2, fz.xz, .txt, .txt.zip, .txt.gz, .txt.bz2 .
  --batch               If this is used then batch mode is used and the input
                        specified using '--input' or '-i' is: (i) a tab-
                        separated text file containing a each line such that
                        there is one sample per line and first column are the
                        FASTQ files' full pathnames/URLs, separated by commas,
                        corresponding to the sample and an optional second
                        column containing the name for the sample, or (ii) a
                        input directory which contains a several
                        subdirectories such that each subdirectory corresponds
                        to only one sample and it contains all the FASTQ files
                        corresponding to that sample. This is useful when
                        several samples needs to be analyzed.
  --single-end          If this is used then it is assumed that all the input
                        reads are single-end reads which must be longer than
                        130 bp. Be default it is assumed that all input reads
                        come from a paired-end reads.
  -I NORMAL_MATCHED_FILENAME, --normal=NORMAL_MATCHED_FILENAME
                        The input file(s) or directory containing the healthy
                        normal-matched data. They should be given in the same
                        format as for '--input'. In case that this option is
                        used then the files/directory given to '--input' is
                        considered to be from the sample of a patient with
                        disease. This is optional.
  -o OUTPUT_DIRECTORY, --output=OUTPUT_DIRECTORY
                        The output directory where all the output files
                        containing information about the found candidate
                        fusiongenes are written. Default is 'none'.
  -d DATA_DIRECTORY, --data=DATA_DIRECTORY
                        The data directory where all the annotations files
                        from Ensembl database are placed, e.g. 'data/'. This
                        directory should be built using 'fusioncatcher-build'.
                        If it is not used then it is read from configuration
                        file specified with '--config' from 'data = ...' line.
  -T TMP_DIRECTORY, --tmp=TMP_DIRECTORY
                        The temporary directory where all the outputs files
                        and directories will be written. Default is directory
                        'tmp' in the output directory specified with '--
                        output'.
  -p PROCESSES, --threads=PROCESSES
                        Number or processes/threads to be used for running
                        SORT, Bowtie, BLAT, STAR, BOWTIE2 and other
                        tools/programs. If it is 0 (as it is by default) then
                        the number of processes/threads will be read first
                        from 'fusioncatcher/etc/configuration.cfg' file. If
                        even there it is still set to 0 then 'min(number-of-
                        CPUs-found,16)' processes will be used. Setting number
                        of threads in 'fusioncatcher/etc/configuration.cfg'
                        might be usefull in situations where one server is
                        shared between several users and in order to limit
                        FusionCatcher using all the CPUs/resources.Default is
                        '0'.
  --config=CONFIGURATION_FILENAME
                        Configuration file containing the paths to external
                        tools (e.g. Bowtie, Blat, fastq-dump.) in case that
                        they are not specified in PATH variable! Default is '/
                        apps/fusioncatcher/etc/configuration.cfg,/apps/fusionc
                        atcher/bin/configuration.cfg'.
  -z, --skip-update-check
                        Skips the automatic routine that contacts the
                        FusionCatcher server to check for a more recent
                        version. Default is 'False'.
  -V, --keep-viruses-alignments
                        If it is set then the SAM alignments files of reads
                        mapping on viruses genomes are saved in the output
                        directory for later inspection by the user. Default is
                        'False'.
  -U, --keep-unmapped-reads
                        If it is set then the FASTQ files, containing the
                        unmapped reads (i.e. reads which do not map on genome
                        and transcriptome), are saved in the output directory
                        for later inspection by the user. Default is 'False'.
  --aligners=ALIGNERS   The aligners to be used on Bowtie aligner. By default
                        always BOWTIE aligner is used and it cannot be
                        disabled. The choices are:
                        ['blat','star','bowtie2','bwa']. Any combination of
                        these is accepted if the aligners' names are comma
                        separated. For example, if one wants to used all four
                        aligners then 'blat,star,bowtie2,bwa' should be given.
                        The command line options '--skip-blat', '--skip-star',
                        and '--skip-bowtie2' have priority over this option.
                        If the first element in the list is the configuration
                        file (that is '.cfg' file) of FusionCatcher then the
                        aligners specified in the list of aligners specified
                        in the configuration file will be used (and the rest
                        of aligner specified here will be ignored). In case
                        that the configuration file is not found then the
                        following aligners from the list will be used. Default
                        is
                        '/apps/fusioncatcher/etc/configuration.cfg,blat,star'.
  --skip-blat           If it is set then the pipeline will NOT use the BLAT
                        aligner and all options and methods which make use of
                        BLAT will be disabled. BLAT aligner is used by
                        default. Please, note that BLAT license does not allow
                        BLAT to be used for commercial activities. Fore more
                        information regarding BLAT please see its license:
                        <http://users.soe.ucsc.edu/~kent/src/>. Default is
                        'False'.
  --skip-star           If it is set then the pipeline will NOT use the STAR
                        aligner and all options and methods which make use of
                        STAR will be disabled. STAR aligner is used by
                        default. Default is 'False'.
  --sort-buffer-size=SORT_BUFFER_SIZE
                        It specifies the buffer size for command SORT. Default
                        is '80%' if less than 32GB installed RAM else is set 
                        to 26 GB.
  --start=START_STEP    It re-starts executing the workflow/pipeline from the
                        given step number. This can be used when the pipeline
                        has crashed/stopped and one wants to re-run it from
                        from the step where it stopped without re-running from
                        the beginning the entire pipeline. 0 is for restarting
                        automatically and 1 is the first step. Default is '0'.


fusioncatcher-build

It downloads the necessary data for a given organism from the Ensembl database and it builds the necessary files/indexes which are needed to running FusionCatcher. Its command line is:

fusioncatcher-build [options]

and the command line options are:

  --version             show program's version number and exit
  -h, --help            show this help message and exit
  -o OUTPUT_DIRECTORY, --output=OUTPUT_DIRECTORY
                        The output directory where all the outputs files  and
                        directories will be written.
  -c CONFIGURATION_FILENAME, --config=CONFIGURATION_FILENAME
                        Configuration file containing the paths to external
                        tools (e.g. Bowtie, etc.) in case that they are not in
                        PATH! Default is '/apps/fusioncatcher/bin/../etc/confi
                        guration.cfg,/apps/fusioncatcher/bin/configuration.cfg
                        '.
  -g ORGANISM, --organism=ORGANISM
                        Organism for which the data is downloaded from Ensembl
                        database and built, for example: 'homo_sapiens',
                        'mus_musculus', 'rattus_norvegicus',
                        'canis_familiaris', etc. Default is 'homo_sapiens'.
  -w WEB_ENSEMBL, --web=WEB_ENSEMBL
                        Ensembl database web site from where the data is
                        downloaded.  e.g. 'www.ensembl.org',
                        'uswest.ensembl.org', 'useast.ensembl.org',
                        'asia.ensembl.org', etc. Default is 'www.ensembl.org'.
  -e FTP_ENSEMBL, --ftp-ensembl=FTP_ENSEMBL
                        Ensembl database FTP site from where the data is
                        downloaded. Default is 'ftp.ensembl.org'.
  --ftp-ensembl-path=FTP_ENSEMBL_PATH
                        The path for Ensembl database FTP site from where the
                        data is downloaded.
  -x FTP_UCSC, --ftp-ucsc=FTP_UCSC
                        UCSC database FTP site from where the data is
                        downloaded. Default is 'hgdownload.cse.ucsc.edu'.
  -n FTP_NCBI, --ftp-ncbi=FTP_NCBI
                        NCBI database FTP site from where the data is
                        downloaded. Default is 'ftp.ncbi.nlm.nih.gov'.
  --skip-blat           If it is set then the pipeline will NOT use the BLAT
                        aligner and all options and methods which make use of
                        BLAT will be disabled. BLAT aligner is used by
                        default. Please, note that BLAT license does not allow
                        BLAT to be used for commercial activities. Fore more
                        information regarding BLAT please see its license:
                        <http://users.soe.ucsc.edu/~kent/src/>. Default is
                        'False'.
  --enlarge-genes       If it is set then the genes are enlarged (i.e. their
                        introns include also in the transcriptome). Default is
                        'False'.
  -p PROCESSES, --threads=PROCESSES
                        Number or processes/threads to be used. Default is
                        '0'.
  --skip-database=SKIP_DATABASE
                        If it is set then the pipeline will skip the specified
                        database(s). The choices are ['cosmic','conjoing','chi
                        merdb2','ticdb','cgp','cacg']. If several databases
                        should be skipped, then their names shall be separated
                        by comma. Default is ''.
  -s START_STEP, --start=START_STEP
                        It starts executing the workflow from the given step
                        number. This can be used when the pipeline has
                        crashed/stopped and one wants to re-run it from from
                        the step where it stopped without re-running from the
                        beginning the entire pipeline. 0 is for restarting
                        automatically and 1 is the first step. This is
                        intended to be used for debugging. Default is '0'.
  -l HASH, --hash=HASH  Hash to be used for computing checksum. The choices
                        are ['no','crc32','md5','adler32','sha512','sha256'].
                        If it is set up to 'no' then no checksum is used and
                        the entire pipeline is executed as a normal shell
                        script. For more information see 'hash_library' in
                        'workflow.py'. This is intended to be used for
                        debugging. Default is 'no'.
  -k, --keep            Preserve intermediate files produced during the run.
                        By default, they are NOT deleted upon exit. This is
                        intended to be used for debugging. Default value is
                        'False'.
  -u CHECKSUMS_FILENAME, --checksums=CHECKSUMS_FILENAME
                        The name of the checksums file. This is intended to be
                        used for debugging. Default value is 'checksums.txt'.


9 - Methods

The main goal of FusionCatcher is to find somatic (and/or pathogenic) fusion genes in RNA-seq data.

FusionCatcher is doing its own quality filtering/trimming of reads. This is needed because most a very important factor for finding fusion genes in RNA-seq experiment is the length of RNA fragments. Ideally the RNA fragment size for finding fusion genes should be over 300 bp. Most of the RNA-seq experiments are designed for doing differentially expression analyses and not for finding fusion genes and therefore the RNA fragment size many times is less than 300bp and the trimming and quality filtering should be done in such a way that it does not decrease even more the RNA fragment size.

FusionCatcher is able to finding fusion genes even in cases where the fusion junction is within known exon or within known intron (for example in the middle of an intron) due to the use of BLAT aligner. The minimum condition for FusionCatcher to find a fusion gene is that both genes involved in the fusion are annotated in Ensembl database (even if their gene structure is “wrong”).

FusionCatcher is spending most of computational analysis on the most promising fusion genes candidate and tries as early as possible to filter out the candidate fusion genes which do not look promising, like for example:

  • candidate fusion gene is composed of a gene and its pseudogene, or
  • candidate fusion gene is composed of a gene and its paralog gene, or
  • candidate fusion gene is composed of a gene and a miRNA gene (but a gene which contains miRNA genes are not skipped), or
  • candidate fusion gene is composed of two genes which have a very sequence similarity (i.e. FusionCatcher is computing its homology score), or
  • candidate fusion gene is known to be found in samples from healthy persons (using the 16 organs RNA-seq data from the Illumina BodyMap2), or
  • candidate fusion gene is in one of the known databases of fusion genes found in healthy persons, i.e. ChimerDB2, CACG, and ConjoinG.

FusionCatcher is using by default three aligners for mapping the reads. The aligners are Bowtie, BLAT, and STAR. STAR is used here only and only for “splitting” the reads while aligning them.