Skip to content

Download Sources

javild edited this page Apr 12, 2016 · 11 revisions

(Deprecated: currently under construction)

This tutorial will first guide you to download a set of raw files from several data sources. These raw files shall contain the core data that will populate the Cellbase knowledgebase. Then, the tutorial will show you how to build the json documents that should be loaded into the Cellbase knowledgebase. However, we have already processed all these data and json documents are available through our FTP server for those users who wish to skip these two sections below. Thus, if you want to skip the sections below, you can directly download json documents from ftp://ftp.ebi.ac.uk/pub/databases/eva/opencb/cellbase/ and jump to the Load Data Models tutorial. For example, you can download gene, genome sequence, variation, conservation and protein change predictions scores human data models from:

ftp://ftp.ebi.ac.uk/pub/databases/eva/opencb/cellbase/v3/homo_sapiens/gene.json.gz
ftp://ftp.ebi.ac.uk/pub/databases/eva/opencb/cellbase/v3/homo_sapiens/genome_sequence.json.gz
ftp://ftp.ebi.ac.uk/pub/databases/eva/opencb/cellbase/v3/homo_sapiens/variation_chr*.json.gz
ftp://ftp.ebi.ac.uk/pub/databases/eva/opencb/cellbase/v3/homo_sapiens/conservation_*.json
ftp://ftp.ebi.ac.uk/pub/databases/eva/opencb/cellbase/v3/homo_sapiens/prot_func_pred_chr_*.json.gz

For those users willing to build CellBase knowledgbase from scratch, please follow the sections below.

Download data sources

Download can be done through the Cellbase CLI:

prompt$ /tmp/cellbase/cellbase-app/build/bin/cellbase.sh download --help

Three main datasets will be downloaded in for the human genome: genome sequence, gene annotation, variant annotation. By using the download command of the cellbase.sh script, an example of a full command line could be:

`prompt$ /tmp/cellbase/cellbase-app/build/bin/cellbase.sh download -o /tmp/downloadTest --sequence --variation --gene --species "Homo sapiens"

Heavy files will be downloaded and therefore the time needed for completion may vary between minutes and even 1 hour. Downloaded data should look like these:

/tmp/downloadTest/
└── homo_sapiens
    ├── gene
    │   ├── gene_extra_info_cellbase.log
    │   └── protein_function_prediction_matrices.log
    ├── sequence
    │   ├── genome_info.log
    │   ├── Homo_sapiens.GRCh37.p13.fa.gz
    │   └── Homo_sapiens.GRCh37.p13.fa.gz.log
    └── variation
        ├── allele_code.txt.gz
        ├── allele_code.txt.gz.log
        ├── allele.txt.gz
        ├── allele.txt.gz.log
        ├── attrib.txt.gz
        ├── attrib.txt.gz.log
        ├── attrib_type.txt.gz
        ├── attrib_type.txt.gz.log
        ├── genotype_code.txt.gz
        ├── genotype_code.txt.gz.log
        ├── motif_feature_variation.txt.gz
        ├── motif_feature_variation.txt.gz.log
        ├── phenotype_feature_attrib.txt.gz
        ├── phenotype_feature_attrib.txt.gz.log
        ├── phenotype_feature.txt.gz
        ├── phenotype_feature.txt.gz.log
        ├── phenotype.txt.gz
        ├── phenotype.txt.gz.log
        ├── population_genotype.txt.gz
        ├── population_genotype.txt.gz.log
        ├── population.txt.gz
        ├── population.txt.gz.log
        ├── seq_region.txt.gz
        ├── seq_region.txt.gz.log
        ├── source.txt.gz
        ├── source.txt.gz.log
        ├── structural_variation_feature.txt.gz
        ├── structural_variation_feature.txt.gz.log
        ├── study.txt.gz
        ├── study.txt.gz.log
        ├── transcript_variation.txt.gz
        ├── transcript_variation.txt.gz.log
        ├── variation_feature.txt.gz
        ├── variation_feature.txt.gz.log
        ├── variation_synonym.txt.gz
        ├── variation_synonym.txt.gz.log
        ├── variation.txt.gz
        └── variation.txt.gz.log

If download was successful, you can proceed to building the json objects that should be loaded into the corresponding database: Build & Load Data.