Agenda, resources and other useful information
FusionCatcher searchers for somatic novel/known fusion genes, translocations and/or chimeras in RNA-seq data (stranded/unstranded paired-end/single-end reads FASTQ files produced by Illumina next-generation sequencing platforms like Illumina Solexa/HiSeq
/NextSeq
/MiSeq
/MiniSeq
) from diseased samples. Unfortunately we don’t have resurce and time to discus this amazyng program. It was the first pogram for somatic fusion with biological annotation and one of major contribute for the develope of STAR-Fusion.
The aims of FusionCatcher are:
FusionCatcher produces a list of candidate fusion genes using the given input data. It is recommended that this list of candidate of fusion genes is further validated in the wet-lab using for example PCR/FISH experiments.
The output files are:
final-list_candidate_fusion_genes.txt
- final list with the newly found candidates fusion genes (it contains the fusion genes with their junction sequence and points); Starting with version 0.99.3c the coordinates of fusion genes are given here for human genome using only assembly hg38/GRCh38; See Table 1 for columns’ descriptions;final-list_candidate_fusion_genes.GRCh37.txt
- final list with the newly found candidates fusion genes (it contains the fusion genes with their junction sequence and points); Starting with version 0.99.3d the coordinates of fusion genes are given here for human genome using assembly hg19/GRCh37; See Table 1 for columns’ descriptions;summary_candidate_fusions.txt
- contains an executive summary (meant to be read directly by the medical doctors or biologist) of candidate fusion genes found;final-list_candidate_fusion_genes.caption.md.txt
- explains in detail the labels found in column Fusion_description
of files final-list_candidate_fusion_genes.txt
and final-list_candidate_fusion_genes.GRCh37.txt
;supporting-reads_gene-fusions_BOWTIE.zip
- sequences of short reads supporting the newly found candidate fusion genes found using only and exclusively the Bowtie aligner;supporting-reads_gene-fusions_BLAT.zip
- sequences of short reads supporting the newly found candidate fusion genes found using Bowtie and Blat aligners;supporting-reads_gene-fusions_STAR.zip
- sequences of short reads supporting the newly found candidate fusion genes found using Bowtie and STAR aligners;supporting-reads_gene-fusions_BOWTIE2.zip
- sequences of short reads supporting the newly found candidate fusion genes found using Bowtie and Bowtie2 aligners;supporting-reads_gene-fusions_BWA.zip
- sequences of short reads supporting the newly found candidate fusion genes found using Bowtie and BWA aligners;`viruses_bacteria_phages.txt
- (non-zero) reads counts for each virus/bacteria/phage from NCBI database ftp://ftp.ncbi.nlm.nih.gov/genomes/Viruses/info.txt
- information regarding genome version, Ensembl database version, versions of tools used, read counts, etc.;fusioncatcher.log
- log of the entire run (e.g. all commands/programs which have been run, command line arguments used, running time for each command, etc.).FusionCatcher reports:
Table 1 - Columns description for file final-list_candidate-fusion-genes.txt
Column | Description |
---|---|
Gene_1_symbol(5end_fusion_partner) | Gene symbol of the 5’ end fusion partner |
Gene_2_symbol_2(3end_fusion_partner) | Gene symbol of the 3’ end fusion partner |
Gene_1_id(5end_fusion_partner) | Ensembl gene id of the 5’ end fusion partner |
Gene_2_id(3end_fusion_partner) | Ensembl gene id of the 3’ end fusion partner |
Exon_1_id(5end_fusion_partner) | Ensembl exon id of the 5’ end fusion exon-exon junction |
Exon_2_id(3end_fusion_partner) | Ensembl exon id of the 3’ end fusion exon-exon junction |
Fusion_point_for_gene_1(5end_fusion_partner) | Chromosomal position of the 5’ end of fusion junction (chromosome:position:strand); 1-based coordinate |
Fusion_point_for_gene_2(3end_fusion_partner) | Chromosomal position of the 3’ end of fusion junction (chromosome:position:strand); 1-based coordinate |
Spanning_pairs | Count of pairs of reads supporting the fusion (including also the multimapping reads) |
Spanning_unique_reads | Count of unique reads (i.e. unique mapping positions) mapping on the fusion junction. Shortly, here are counted all the reads which map on fusion junction minus the PCR duplicated reads. |
Longest_anchor_found | Longest anchor (hangover) found among the unique reads mapping on the fusion junction |
Fusion_finding_method | Aligning method used for mapping the reads and finding the fusion genes. Here are two methods used which are: (i) BOWTIE = only Bowtie aligner is used for mapping the reads on the genome and exon-exon fusion junctions, (ii) BOWTIE+BLAT = Bowtie aligner is used for mapping reads on the genome and BLAT is used for mapping reads for finding the fusion junction, (iii) BOWTIE+STAR = Bowtie aligner is used for mapping reads on the genome and STAR is used for mapping reads for finding the fusion junction, (iv) BOWTIE+BOWTIE2 = Bowtie aligner is used for mapping reads on the genome and Bowtie2 is used for mapping reads for finding the fusion junction, and (v) BOWTIE+BWA = Bowtie aligner is used for mapping reads on the genome and Bowtie2 is used for mapping reads for finding the fusion junction. |
Fusion_sequence | The inferred fusion junction (the asterisk sign marks the junction point) |
Fusion_description | Type of the fusion gene (see the Table 2) |
Counts_of_common_mapping_reads | Count of reads mapping simultaneously on both genes which form the fusion gene. This is an indication how similar are the DNA/RNA sequences of the genes forming the fusion gene (i.e. what is their homology because highly homologous genes tend to appear show as candidate fusion genes). In case of completely different sequences of the genes involved in forming a fusion gene then here it is expected to have the value zero. |
Predicted_effect | Predicted effect of the candidate fusion gene using the annotation from Ensembl database. This is shown in format effect_gene_1/effect_gene_2, where the possible values for effect_gene_1 or effect_gene_2 are: intergenic, intronic, exonic(no-known-CDS), UTR, CDS(not-reliable-start-or-end), CDS(truncated), or CDS(complete). In case that the fusion junction for both genes is within their CDS (coding sequence) then only the values in-frame or out-of-frame will be shown. |
Predicted_fused_transcripts | All possible known fused transcripts in format ENSEMBL-TRANSCRIPT-1:POSITION-1/ENSEMBLE-TRANSCRIPT-B:POSITION-2, where are fused the sequence 1:POSITION-1 of transcript ENSEMBL-TRANSCRIPT-1 with sequence POSITION-2:END of transcript ENSEMBL-TRANSCRIPT-2 |
Predicted_fused_proteins | Predicted amino acid sequences of all possible fused proteins (separated by “;”). |
Table 2 - Labels used to describe the found fusion genes (column Fusion_ description from file final-list_candidate-fusion-genes.txt
)
Fusion_description | Description |
---|---|
1000genomes | fusion gene has been seen in a healthy sample. It has been found in RNA-seq data from some samples from 1000 genomes project. A candidate fusion gene having this label has a very high probability of being a false positive. |
18cancers | fusion gene found in a RNA-seq dataset of 18 types of cancers from 600 tumor samples published here. |
adjacent | both genes forming the fusion are adjacent on the genome (i.e. same strand and there is no other genes situated between them on the same strand) |
antisense | one or both genes is a gene coding for antisense RNA |
banned | fusion gene is on a list of known false positive fusion genes. These were found with very strong supporting data in healthy samples (i.e. it showed up in file final-list_candidate_fusion_genes.txt). A candidate fusion gene having this label has a very high probability of being a false positive. |
bodymap2 | fusion gene is on a list of known false positive fusion genes. It has been found in healthy human samples collected from 16 organs from Illumina BodyMap2 RNA-seq database. A candidate fusion gene having this label has a very high probability of being a false positive. |
cacg | known conjoined genes (that is fusion genes found in samples from healthy patients) from the CACG database (please see CACG database for more information). A candidate fusion gene having this label has a very high probability of being a false positive in case that one looks for fusion genes specific to a disease. |
cell_lines | known fusion gene from paper: C. Klijn et al., A comprehensive transcriptional portrait of human cancer cell lines, Nature Biotechnology, Dec. 2014, DOI:10.1038/nbt.3080 |
cgp | known fusion gene from the CGP database |
chimerdb2 | known fusion gene from the ChimerDB 2 database |
chimerdb3kb | known fusion gene from the ChimerDB 3 KB (literature curration) database |
chimerdb3pub | known fusion gene from the ChimerDB 3 PUB (PubMed articles) database |
chimerdb3seq | known fusion gene from the ChimerDB 3 SEQ (TCGA) database |
conjoing | known conjoined genes (that is fusion genes found in samples from healthy patients) from the ConjoinG database (please use ConjoinG database for more information regarding the fusion gene). A candidate fusion gene having this label has a very high probability of being a false positive in case that one looks for fusion genes specific to a disease. |
cosmic | known fusion gene from the COSMIC database (please use COSMIC database for more information regarding the fusion gene) |
cta | one gene or both genes is CTA gene (that is that the gene name starts with CTA-). A candidate fusion gene having this label has a very high probability of being a false positive. |
ctb | one gene or both genes is CTB gene (that is that the gene name starts with CTB-). A candidate fusion gene having this label has a very high probability of being a false positive. |
ctc | one gene or both genes is CTC gene (that is that the gene name starts with CTC-). A candidate fusion gene having this label has a very high probability of being a false positive. |
ctd | one gene or both genes is CTD gene (that is that the gene name starts with CTD-). A candidate fusion gene having this label has a very high probability of being a false positive. |
distance1000bp | both genes are on the same strand and they are less than 1 000 bp apart. A candidate fusion gene having this label has a very high probability of being a false positive. |
distance100kbp | both genes are on the same strand and they are less than 100 000 bp apart. A candidate fusion gene having this label has a higher probability than expected of being a false positive. |
distance10kbp | both genes are on the same strand and they are less than 10 000 bp apart. A candidate fusion gene having this label has a higher probability than expected of being a false positive. |
duplicates | both genes involved in the fusion gene are paralog for each other. For more see Duplicated Genes Database (DGD) database . A candidate fusion gene having this label has a higher probability than expected of being a false positive. |
exon-exon | the fusion junction point is exactly at the known exon’s borders of both genes forming the candidate fusion |
ensembl_fully_overlapping | the genes forming the fusion gene are fully overlapping according to Ensembl database. A candidate fusion gene having this label has a very high probability of being a false positive. |
ensembl_partially_overlapping | the genes forming the fusion gene are partially overlapping (on same strand or on different strands) according the Ensembl database. *A candidate fusion gene having this label has a good probability of being a false positive.</i> </font> |
ensembl_same_strand_overlapping | the genes forming the fusion gene are fully/partially overlapping and are both on the same strand according to Ensembl database. *A candidate fusion gene having this label has a very high probability of being a false positive (this is most likely and alternative splicing event).</i> </font> |
fragments | the genes forming the fusion are supported by only and only one fragment of RNA. A candidate fusion gene having this label has a medium probability of being a false positive. |
gliomas | fusion gene found in a RNA-seq dataset of 272 glioblastoms published here. |
gtex | fusion gene has been seen in a healthy sample. It has been found in GTEx database of healthy tissues (thru FusionAnnotator). A candidate fusion gene having this label has a very high probability of being a false positive. |
healthy | fusion gene has been seen in a healthy sample. These have been found in healthy samples but the support for them is less strong (i.e. paired reads were found to map on both genes but no fusion junction was found) than in the case of banned label (i.e. it showed up in file preliminary list of candidate fusion genes). Also genes which have some degree of sequence similarity may show up marked like this.A candidate fusion gene having this label has a small probability of being a false positive in case that one looks for fusion genes specific to a disease. |
hpa | fusion gene has been seen in a healthy sample. It has been found in RNA-seq database of 27 healthy tissues. A candidate fusion gene having this label has a very high probability of being a false positive. |
known | fusion gene which has been previously reported or published in scientific articles/reports/books/abstracts/databases indexed by Google, Google Scholar, PubMed, etc. This label has only the role to answer with YES or NO the question “has ever before a given (candidate) fusion gene been published or reported?”. This label does not have in anyway the role to provide the original references to the original scientific articles/reports/books/abstracts/databases for a given fusion gene. |
lincrna | one or both genes is a lincRNA |
matched-normal | candidate fusion gene (which is supported by paired reads mapping on both genes and also by reads mapping on the junction point) was found also in the matched normal sample given as input to the command line option ‘–normal’ |
metazoa | one or both genes is a metazoa_srp gene Metazia_srp |
mirna | one or both genes is a miRNA |
mt | one or both genes are situated on mitochondrion. A candidate fusion gene having this label has a very high probability of being a false positive. |
mX (where X is a number) | count of pairs of reads supporting the fusion (excluding the mutimapping reads). |
non_cancer_tissues | fusion gene which has been previously reported/found in non-cancer tissues and cell lines in Babiceanu et al, Recurrent chimeric fusion RNAs in non-cancer tissues and cells, Nucl. Acids Res. 2016. These are considered as non-somatic mutation and therefore they may be skipped and not reported. |
non_tumor_cells | fusion gene which has been previously reported/found in non-tumor cell lines, like for example HEK293. These are considered as non-somatic mutation and therefore may be skipped and not reported. |
no_protein | one or both genes have no known protein product |
oesophagus | fusion gene found in a oesophageal tumors from TCGA samples, which are published here. |
oncogene | one gene or both genes are a known oncogene according to ONGENE database |
cancer | one gene or both genes are cancer associated according to Cancer Gene database |
tumor | one gene or both genes are proto-oncogene or tumor suppresor gene according to UniProt database |
pair_pseudo_genes | one gene is the other’s pseudogene. A candidate fusion gene having this label has a very high probability of being a false positive. |
pancreases | known fusion gene found in pancreatic tumors from article: P. Bailey et al., Genomic analyses identify molecular subtypes of pancreatic cancer, Nature, Feb. 2016, http://dx.doi.org/110.1038/nature16965 |
paralogs | both genes involved in the fusion gene are paralog for each other (most likely this is a false positive fusion gene). A candidate fusion gene having this label has a very high probability of being a false positive. |
multi | one of the genes of both have multi-mapping reads mapping (which map simultaneously also on other gene/genes |
partial-matched-normal | candidate fusion gene (which is supported by paired reads mapping on both genes but no reads were found which map on the junction point) was found also in the matched normal sample given as input to the command line option ‘–normal’. This is much weaker than matched-normal. |
prostates | known fusion gene found in 150 prostate tumors RNAs from paper: D. Robison et al, Integrative Clinical Genomics of Advanced Prostate Cancer, Cell, Vol. 161, May 2015, http://dx.doi.org/10.1016/j.cell.2015.05.001 |
pseudogene | one or both of the genes is a pseudogene |
readthrough | the fusion gene is a readthrough event (that is both genes forming the fusion are on the same strand and there is no known gene situated in between); Please notice, that many of readthrough fusion genes might be false positive fusion genes due to errors in Ensembl database annotation (for example, one gene is annotated in Ensembl database as two separate genes). A candidate fusion gene having this label has a high probability of being a false positive. |
refseq_fully_overlapping | the genes forming the fusion gene are fully overlapping according to RefSeq NCBI database. A candidate fusion gene having this label has a very high probability of being a false positive. |
refseq_partially_overlapping | the genes forming the fusion gene are partially overlapping (on same strand or on different strands) according the RefSeq NCBI. *A candidate fusion gene having this label has a good probability of being a false positive.</i> </font> |
refseq_same_strand_overlapping | the genes forming the fusion gene are fully/partially overlapping and are both on the same strand according to RefSeq NCBI database. *A candidate fusion gene having this label has a very high probability of being a false positive (this is most likely and alternative splicing event).</i> </font> |
ribosomal | one or both gene is a gene encoding for ribosomal protein |
rp11 | one gene or both genes is RP11 gene (that is that the gene name starts with RP11-). A candidate fusion gene having this label has a very high probability of being a false positive. |
rp | one gene or both genes is RP?? gene (that is that the gene name starts with RP??-) where ? is a digit. A candidate fusion gene having this label has a very high probability of being a false positive. |
rrna | one or both genes is a rRNA. A candidate fusion gene having this label has a very high probability of being a false positive. |
short_distance | both genes are on the same strand and they are less than X bp apart, where X is set using the option ‘–dist-fusion’ and by default it is 200 000 bp. A candidate fusion gene having this label has a higher probability than expected of being a false positive. |
similar_reads | both genes have the same reads which map simultaneously on both of them (this is an indicator of how similar are the sequences of both genes; ideally this should be zero or as close to zero as possible for a real fusion). A candidate fusion gene having this label has a very high probability of being a false positive. |
similar_symbols | both genes have the same or very similar gene names (for example: RP11ADF.1 and RP11ADF.2). A candidate fusion gene having this label has a very high probability of being a false positive. |
snorna | one or both genes is a snoRNA |
snrna | one or both genes is a snRNA |
tcga | known fusion gene from the TCGA database (please use Google for more information regarding the fusion gene) |
ticdb | known fusion gene from the TICdb database (please use TICdb database for more information regarding the fusion gene) |
trna | one or both genes is a tRNA |
ucsc_fully_overlapping | the genes forming the fusion gene are fully overlapping according to UCSC database. A candidate fusion gene having this label has a very high probability of being a false positive. |
ucsc_partially_overlapping | the genes forming the fusion gene are partially overlapping (on same strand or on different strands) according the UCSC database. *A candidate fusion gene having this label has a good probability of being a false positive.</i> </font> |
ucsc_same_strand_overlapping | the genes forming the fusion gene are fully/partially overlapping and are both on the same strand according to UCSC database. *A candidate fusion gene having this label has a very high probability of being a false positive (this is most likely and alternative splicing event).</i> </font> |
yrna | one or both genes is a Y RNA |
FusionCatcher outputs also the zipped FASTA files containing the reads which support the found candidate fusions genes. The files are:
supporting-reads_gene-fusions_BOWTIE.zip
,supporting-reads_gene-fusions_BLAT.zip
,supporting-reads_gene-fusions_STAR.zip
,supporting-reads_gene-fusions_BOWTIE2.zip
, andsupporting-reads_gene-fusions_BWA.zip
.The reads which support the:
_supports_fusion_junction
, and_supports_fusion_pair
.These supporting reads (given as FASTA and FASTQ files) may be used for further visualization purposes. For example, one may use these supporting reads and align them himself/herself using his/her favourite:
Bowtie/Bowtie2/TopHat/STAR/GSNAP/etc.
),For example, the sequences of supporting reads for a given candidate fusion gene may be visualized using UCSC Genome Browser by aligning them using the UCSC Genome Browser’s BLAT aligner (i.e. copy and paste the reads here: BLAT tool of UCSC Genome Browser –> click the button Submit –> navigate into the UCSC Genome Browser to the genes that form the fusion genes). Also zooming out several times gives better view here.
If one uses the --visualization-psl
command line option of the FusionCatcher then the BLAT alignment of the supporting reads will be done automatically by the FusionCatcher and the results are saved in PSL format files with names that are ending with _reads.psl
in the:
supporting-reads_gene-fusions_BOWTIE.zip
,supporting-reads_gene-fusions_BLAT.zip
,supporting-reads_gene-fusions_STAR.zip
, andsupporting-reads_gene-fusions_BOWTIE2.zip
, andsupporting-reads_gene-fusions_BWA.zip
.The files with names ending in _reads.psl
may be used further for visualization of the candidate fusion genes using UCSC Genome Browser, IGV (Integrative Genome Viewer) or any other viewer/browser which supports the PSL format.
Note: If one generated the build files using fusioncatcher-build.py
the command line --visualization-psl
option should work just fine. If one downloaded the build files then the command line option --visualization-psl
will not work an it needs to be enabled by creating manually first the file fusioncatcher/data/current/genome.2bit
for FusionCatcher, something like this (here the assumption is that the build files for one’s organism of interest are in fusioncatcher/data/current/
):
# re-build the genome index using BLAT where the genome is given FASTA file genome.fa
fusioncatcher/tools/bowtie/bowtie-inspect fusioncatcher/data/current/genome_index/ > fusioncatcher/data/current/genome.fa
fusioncatcher/tools/blat/faToTwoBit fusioncatcher/data/current/genome.fa fusioncatcher/data/current/genome.2bit -noMask
If one uses the --visualization-sam
command line option of the FusionCatcher then the BOWTIE2 alignment of the supporting reads will be done automatically by the FusionCatcher and the results are saved as SAM files with names that are ending with _reads.sam
in the:
supporting-reads_gene-fusions_BOWTIE.zip
,supporting-reads_gene-fusions_BLAT.zip
,supporting-reads_gene-fusions_STAR.zip
,supporting-reads_gene-fusions_BOWTIE2.zip
, andsupporting-reads_gene-fusions_BWA.zip
.The files with names ending in _reads.sam
(please note, that they still needed to be converted to BAM, coordiante sorted and indexed first) may be used further for visualization of the candidate fusion genes using UCSC Genome Browser, IGV (Integrative Genome Viewer) or any other viewer/browser which supports the SAM format.
Here is an rough example of manually aligning the supporting reads (that is named as supporting_reads.fq
in the below example; the FASTQ files needed here are the files ending in _reads.fq
from the ZIP archives supporting-reads_gene-fusions_*.zip
produced by FusionCatcher) using different aligners.
your_choice_of_genome_bowtie2_index
may be for human, for example this)samtools view -bS fusion_genes.sam | samtools sort - fusion_genes.sorted |
samtools index fusion_genes.sorted.bam ```
bowtie2 \ –local \ -k 10 \ -x your_choice_of_genome_bowtie2_index \ -1 r1.fq \ -2 r2.fq \ -S fusion_genes.sam
samtools view -bS fusion_genes.sam | samtools sort - fusion_genes.sorted |
samtools index fusion_genes.sorted.bam ```
your_choice_of_genome_star_index
should be built according to the STAR Manual)samtools view -bS fusion_genes.sam | samtools sort - fusion_genes.sorted |
samtools index fusion_genes.sorted.bam ```
STAR \ –genomeDir /your_choice_of_genome_star_index/ \ –alignSJoverhangMin 9 \ –chimSegmentMin 17 \ –readFilesIn r1.fq r2.fq\ –outFileNamePrefix .
samtools view -bS Aligned.out.sam | samtools sort - fusion_genes.sorted |
samtools index fusion_genes.sorted.bam ```
your_choice_of_genome_blat_index
should be built according to the BLAT’s examples)
```
faToTwoBit genome.fa genome.2bit -noMask
blat -stepSize=5 -repMatch=2253 -minScore=0 -minIdentity=0 genome.2bit supporting_reads.fa supporting_reads_mapped.psl
psl2sam.pl supporting_reads_mapped.psl > supporting_reads_mapped.sam
```
Further, the files fusion_genes.sorted.bam
and fusion_genes.sorted.bam.bai
may be used with your favourite NGS visualizer!
R/BioConductor
packageFor visualization of fusion genes found by FusionCatcher one may use also the R/BioConductor
package Chimera, which supports FusionCatcher.
Here, is an example of how FusionCatcher can be used to search for fusion genes in human RNA-seq sample where:
fusioncatcher \
-d /some/human/data/directory/ \
-i /some/input/directory/containing/fastq/files/ \
-o /some/output/directory/
Here, is an example of how FusionCatcher can be used to search for fusion genes in human RNA-seq sample where:
fusioncatcher \
-d /some/human/data/directory/ \
-i /some/input/directory/containing/fastq/files/ \
-o /some/output/directory/ \
--skip-readthroughs \
--skip-blat
By default, FusionCatcher its the Bowtie aligner for finding candidate fusion genes. This approach relies heavily on good is the annotation data for the given organism in the Ensembl database. If, for example, a gene is not annotated well and has several exons which are not annotated in the Ensembl database and if one of these exons is the one involved in the fusion point then this fusion gene will not be found by using only the Bowtie aligner. In order to find also the fusion genes where the the junction point is in the middle of exons or introns, *FusionCatcher*
is using by default the BLAT, and STAR aligners in addition to Bowtie aligner. The command line options ‘--skip-blat
’,’--skip-star
’, ‘--skip-bowtie2
’, or ‘--skip-bwa
’ should be used in order to specify what aligners should not be used. The command line option ‘--aligners
’ specifies which aligners should be used by default. For example, ‘--aligners=blat,star,bowtie2,bwa
’ forces FusionCatcher too use all aligners for finding fusion genes
The use of Bowtie and Blat aligners is the default approach of FusionCatcher for finding fusion genes.
In order not to use this approach the command line option ‘--skip-blat
’ should be added (or remove the string blat
from line aligners
from file fusioncatcher/etc/configuration.cfg
), as following:
fusioncatcher \
-d /some/human/data/directory/ \
-i /some/input/directory/containing/fastq/files/ \
-o /some/output/directory/ \
--skip-blat
Please, read the license of Blat aligner before using this approach in order to see if you may use Blat! FusionCatcher will use Blat aligner when using this approach!
The use of Bowtie and STAR aligners is the default approach of FusionCatcher for finding fusion genes.
In order not to use this approach the command line option ‘--skip-star
’ should be added, as following:
fusioncatcher \
-d /some/human/data/directory/ \
-i /some/input/directory/containing/fastq/files/ \
-o /some/output/directory/ \
--skip-star
The use of Bowtie and Bowtie2 aligners is not the default approach of FusionCatcher for finding fusion genes.
In order not to use this approach the command line option ‘--skip-bowtie2
’ should be added, as following:
fusioncatcher \
-d /some/human/data/directory/ \
-i /some/input/directory/containing/fastq/files/ \
-o /some/output/directory/ \
--skip-bowtie2
In order to use this approach the command line option ‘--aligners
’ should contain the string ‘bowtie2
’, like for example
fusioncatcher \
-d /some/human/data/directory/ \
-i /some/input/directory/containing/fastq/files/ \
-o /some/output/directory/ \
--aligners blat,star,bowtie2
The use of Bowtie and BWA aligners is not the default approach of FusionCatcher for finding fusion genes.
In order not to use this approach the command line option ‘--skip-bowtie2
’ should be added, as following:
fusioncatcher \
-d /some/human/data/directory/ \
-i /some/input/directory/containing/fastq/files/ \
-o /some/output/directory/ \
--skip-bowtie2
In order to use this approach the command line option ‘--aligners
’ should contain the string ‘bwa
’, like for example
fusioncatcher \
-d /some/human/data/directory/ \
-i /some/input/directory/containing/fastq/files/ \
-o /some/output/directory/ \
--aligners blat,star,bwa
It searchers for fusion genes and/or translocations in RNA-seq data (paired-end reads FASTQ files produced by Illumina next-generation sequencing platforms like Illumina Solexa and Illumina HiSeq
) in diseased samples. Its command line is:
fusioncatcher [options]
and the command line options are:
--version show program's version number and exit
-h, --help show this help message and exit
-i INPUT_FILENAME, --input=INPUT_FILENAME
The input file(s) or directory. The files should be in
FASTQ or SRA format and may be or not compressed using
gzip or zip. A list of files can be specified by given
the filenames separated by comma. If a directory is
given then it will analyze all the files found with
the following extensions: .sra, .fastq, .fastq.zip,
.fastq.gz, .fastq.bz2, fastq.xz, .fq, .fq.zip, .fq.gz,
.fq.bz2, fz.xz, .txt, .txt.zip, .txt.gz, .txt.bz2 .
--batch If this is used then batch mode is used and the input
specified using '--input' or '-i' is: (i) a tab-
separated text file containing a each line such that
there is one sample per line and first column are the
FASTQ files' full pathnames/URLs, separated by commas,
corresponding to the sample and an optional second
column containing the name for the sample, or (ii) a
input directory which contains a several
subdirectories such that each subdirectory corresponds
to only one sample and it contains all the FASTQ files
corresponding to that sample. This is useful when
several samples needs to be analyzed.
--single-end If this is used then it is assumed that all the input
reads are single-end reads which must be longer than
130 bp. Be default it is assumed that all input reads
come from a paired-end reads.
-I NORMAL_MATCHED_FILENAME, --normal=NORMAL_MATCHED_FILENAME
The input file(s) or directory containing the healthy
normal-matched data. They should be given in the same
format as for '--input'. In case that this option is
used then the files/directory given to '--input' is
considered to be from the sample of a patient with
disease. This is optional.
-o OUTPUT_DIRECTORY, --output=OUTPUT_DIRECTORY
The output directory where all the output files
containing information about the found candidate
fusiongenes are written. Default is 'none'.
-d DATA_DIRECTORY, --data=DATA_DIRECTORY
The data directory where all the annotations files
from Ensembl database are placed, e.g. 'data/'. This
directory should be built using 'fusioncatcher-build'.
If it is not used then it is read from configuration
file specified with '--config' from 'data = ...' line.
-T TMP_DIRECTORY, --tmp=TMP_DIRECTORY
The temporary directory where all the outputs files
and directories will be written. Default is directory
'tmp' in the output directory specified with '--
output'.
-p PROCESSES, --threads=PROCESSES
Number or processes/threads to be used for running
SORT, Bowtie, BLAT, STAR, BOWTIE2 and other
tools/programs. If it is 0 (as it is by default) then
the number of processes/threads will be read first
from 'fusioncatcher/etc/configuration.cfg' file. If
even there it is still set to 0 then 'min(number-of-
CPUs-found,16)' processes will be used. Setting number
of threads in 'fusioncatcher/etc/configuration.cfg'
might be usefull in situations where one server is
shared between several users and in order to limit
FusionCatcher using all the CPUs/resources.Default is
'0'.
--config=CONFIGURATION_FILENAME
Configuration file containing the paths to external
tools (e.g. Bowtie, Blat, fastq-dump.) in case that
they are not specified in PATH variable! Default is '/
apps/fusioncatcher/etc/configuration.cfg,/apps/fusionc
atcher/bin/configuration.cfg'.
-z, --skip-update-check
Skips the automatic routine that contacts the
FusionCatcher server to check for a more recent
version. Default is 'False'.
-V, --keep-viruses-alignments
If it is set then the SAM alignments files of reads
mapping on viruses genomes are saved in the output
directory for later inspection by the user. Default is
'False'.
-U, --keep-unmapped-reads
If it is set then the FASTQ files, containing the
unmapped reads (i.e. reads which do not map on genome
and transcriptome), are saved in the output directory
for later inspection by the user. Default is 'False'.
--aligners=ALIGNERS The aligners to be used on Bowtie aligner. By default
always BOWTIE aligner is used and it cannot be
disabled. The choices are:
['blat','star','bowtie2','bwa']. Any combination of
these is accepted if the aligners' names are comma
separated. For example, if one wants to used all four
aligners then 'blat,star,bowtie2,bwa' should be given.
The command line options '--skip-blat', '--skip-star',
and '--skip-bowtie2' have priority over this option.
If the first element in the list is the configuration
file (that is '.cfg' file) of FusionCatcher then the
aligners specified in the list of aligners specified
in the configuration file will be used (and the rest
of aligner specified here will be ignored). In case
that the configuration file is not found then the
following aligners from the list will be used. Default
is
'/apps/fusioncatcher/etc/configuration.cfg,blat,star'.
--skip-blat If it is set then the pipeline will NOT use the BLAT
aligner and all options and methods which make use of
BLAT will be disabled. BLAT aligner is used by
default. Please, note that BLAT license does not allow
BLAT to be used for commercial activities. Fore more
information regarding BLAT please see its license:
<http://users.soe.ucsc.edu/~kent/src/>. Default is
'False'.
--skip-star If it is set then the pipeline will NOT use the STAR
aligner and all options and methods which make use of
STAR will be disabled. STAR aligner is used by
default. Default is 'False'.
--sort-buffer-size=SORT_BUFFER_SIZE
It specifies the buffer size for command SORT. Default
is '80%' if less than 32GB installed RAM else is set
to 26 GB.
--start=START_STEP It re-starts executing the workflow/pipeline from the
given step number. This can be used when the pipeline
has crashed/stopped and one wants to re-run it from
from the step where it stopped without re-running from
the beginning the entire pipeline. 0 is for restarting
automatically and 1 is the first step. Default is '0'.
It downloads the necessary data for a given organism from the Ensembl database and it builds the necessary files/indexes which are needed to running FusionCatcher. Its command line is:
fusioncatcher-build [options]
and the command line options are:
--version show program's version number and exit
-h, --help show this help message and exit
-o OUTPUT_DIRECTORY, --output=OUTPUT_DIRECTORY
The output directory where all the outputs files and
directories will be written.
-c CONFIGURATION_FILENAME, --config=CONFIGURATION_FILENAME
Configuration file containing the paths to external
tools (e.g. Bowtie, etc.) in case that they are not in
PATH! Default is '/apps/fusioncatcher/bin/../etc/confi
guration.cfg,/apps/fusioncatcher/bin/configuration.cfg
'.
-g ORGANISM, --organism=ORGANISM
Organism for which the data is downloaded from Ensembl
database and built, for example: 'homo_sapiens',
'mus_musculus', 'rattus_norvegicus',
'canis_familiaris', etc. Default is 'homo_sapiens'.
-w WEB_ENSEMBL, --web=WEB_ENSEMBL
Ensembl database web site from where the data is
downloaded. e.g. 'www.ensembl.org',
'uswest.ensembl.org', 'useast.ensembl.org',
'asia.ensembl.org', etc. Default is 'www.ensembl.org'.
-e FTP_ENSEMBL, --ftp-ensembl=FTP_ENSEMBL
Ensembl database FTP site from where the data is
downloaded. Default is 'ftp.ensembl.org'.
--ftp-ensembl-path=FTP_ENSEMBL_PATH
The path for Ensembl database FTP site from where the
data is downloaded.
-x FTP_UCSC, --ftp-ucsc=FTP_UCSC
UCSC database FTP site from where the data is
downloaded. Default is 'hgdownload.cse.ucsc.edu'.
-n FTP_NCBI, --ftp-ncbi=FTP_NCBI
NCBI database FTP site from where the data is
downloaded. Default is 'ftp.ncbi.nlm.nih.gov'.
--skip-blat If it is set then the pipeline will NOT use the BLAT
aligner and all options and methods which make use of
BLAT will be disabled. BLAT aligner is used by
default. Please, note that BLAT license does not allow
BLAT to be used for commercial activities. Fore more
information regarding BLAT please see its license:
<http://users.soe.ucsc.edu/~kent/src/>. Default is
'False'.
--enlarge-genes If it is set then the genes are enlarged (i.e. their
introns include also in the transcriptome). Default is
'False'.
-p PROCESSES, --threads=PROCESSES
Number or processes/threads to be used. Default is
'0'.
--skip-database=SKIP_DATABASE
If it is set then the pipeline will skip the specified
database(s). The choices are ['cosmic','conjoing','chi
merdb2','ticdb','cgp','cacg']. If several databases
should be skipped, then their names shall be separated
by comma. Default is ''.
-s START_STEP, --start=START_STEP
It starts executing the workflow from the given step
number. This can be used when the pipeline has
crashed/stopped and one wants to re-run it from from
the step where it stopped without re-running from the
beginning the entire pipeline. 0 is for restarting
automatically and 1 is the first step. This is
intended to be used for debugging. Default is '0'.
-l HASH, --hash=HASH Hash to be used for computing checksum. The choices
are ['no','crc32','md5','adler32','sha512','sha256'].
If it is set up to 'no' then no checksum is used and
the entire pipeline is executed as a normal shell
script. For more information see 'hash_library' in
'workflow.py'. This is intended to be used for
debugging. Default is 'no'.
-k, --keep Preserve intermediate files produced during the run.
By default, they are NOT deleted upon exit. This is
intended to be used for debugging. Default value is
'False'.
-u CHECKSUMS_FILENAME, --checksums=CHECKSUMS_FILENAME
The name of the checksums file. This is intended to be
used for debugging. Default value is 'checksums.txt'.
The main goal of FusionCatcher is to find somatic (and/or pathogenic) fusion genes in RNA-seq data.
FusionCatcher is doing its own quality filtering/trimming of reads. This is needed because most a very important factor for finding fusion genes in RNA-seq experiment is the length of RNA fragments. Ideally the RNA fragment size for finding fusion genes should be over 300 bp. Most of the RNA-seq experiments are designed for doing differentially expression analyses and not for finding fusion genes and therefore the RNA fragment size many times is less than 300bp and the trimming and quality filtering should be done in such a way that it does not decrease even more the RNA fragment size.
FusionCatcher is able to finding fusion genes even in cases where the fusion junction is within known exon or within known intron (for example in the middle of an intron) due to the use of BLAT aligner. The minimum condition for FusionCatcher to find a fusion gene is that both genes involved in the fusion are annotated in Ensembl database (even if their gene structure is “wrong”).
FusionCatcher is spending most of computational analysis on the most promising fusion genes candidate and tries as early as possible to filter out the candidate fusion genes which do not look promising, like for example:
FusionCatcher is using by default three aligners for mapping the reads. The aligners are Bowtie, BLAT, and STAR. STAR is used here only and only for “splitting” the reads while aligning them.