Data repositories

To avoid duplication of genome, annotation, additional annotation and genome index files, Eoulsan handles data repositories. It is very useful for genome indexes used in the mapping step witch computation is quite long for large genomes. The genome index repository store the result an index computation for the next analysis using this genome.

Genome, annotation and additional annotation repositories

The configuration of this repositories are quite the same. You must define the path of the root of the repository by setting the following global parameters (In configuration file or in the globals section of the workflow file) :

Parameter Type Description
main.genome.storage.path string Path to the genomes repository
main.gff.storage.path string Path to the GFF annotations repository
main.gtf.storage.path string Path to the GTF annotations repository
main.additional.annotation.storage.path string Path to the additional annotations repository

The path of the repositories can be URL (e.g. on webserver or on ftp server).

In following example, we can see the content of a genome repository. Using symbolic links allow to define several alias to the same genome.

-rw-r--r-- 1 nobody nobody   4123941 2010-02-15 15:45 mouse-37.fasta.bz2
lrwxrwxrwx 1 nobody nobody        16 2011-12-25 17:42 mouse -> mouse-37.fasta.bz2
-rw-r--r-- 1 nobody nobody   4123327 2010-02-15 15:45 mouse-36.fasta.bz2
lrwxrwxrwx 1 nobody nobody 513422555 2012-01-09 17:04 hg19.fasta.bz2

To access repositories from design file, user must use dedicated protocols:

Repository type Protocol Protocol usage
genome genome genome://<genome name> (e.g. genome://mouse-37)
GFF annotation gff gff://<annotation name> (e.g. gff://mouse-37)
GTF annotation gtf gtf://<annotation name> (e.g. gtf://mouse-37)
additional annotation additionalannotation additionalannotation://<annotation name> (e.g. additionalannotation://mouse-37)

File extension (e.g. .fasta, .gff) and file compression extensions must be avoided in the genome and annotation URL. Eoulsan automatically add the file extension and check if a compressed file exists in the repository.

Genome index repository

Unlike previous repositories, the genome index repository have no dedicated protocol. The only user of this repository is the genome index creation step. When a genome index must be computed, this step check if a genome index has been already computed for this genome and mapper. If true, the previous computed genome is used, if false, the genome index is computed and then stored for a next usage.

To use genome index repository, user must only define the following global parameter (In configuration file or in the globals section of the workflow file) :

Parameter Type Description
main.genome.mapper.index.storage.path string Path to the genome indexes repository

Note: The path to the genome indexes cannot be an URL. The path must be writtable for the user to allow Eoulsan storing genome indexes.

Genome description repository

The genome description file contains some basic informations about the genome like the chromosome lengths. This file is useful for creation of valid SAM/BAM files and using the genome index repository. This file is created from the genome file at each new analysis. However creating this file is quite long for large genome (like mouse or human genome) when compressed. The genome description repository allow to avoid useless genome sequence parsing once it has been already parsed in a previous analysis.

To use genome description repository, user must only define the following global parameter (In configuration file or in the globals section of the workflow file) :

Parameter Type Description
main.genome.desc.storage.path string Path to the genome descriptions repository

Note: The path to the genome description cannot be an URL. The path must be writtable for the user to allow Eoulsan storing genome descriptions.