-
Notifications
You must be signed in to change notification settings - Fork 53
Download Sources
This tutorial will first guide you to download a set of raw files from several data sources. These raw files shall contain the core data that will populate the Cellbase knowledgebase. Then, the tutorial will show you how to build the json documents that should be loaded into the Cellbase knowledgebase. However, we have already processed all these data and json documents are available through our FTP server for those users who wish to skip these two sections below. Thus, if you want to skip the sections below, you can directly download json documents from ftp://ftp.ebi.ac.uk/pub/databases/eva/opencb/cellbase/
and jump to the Load Data Models tutorial. For example, you can download gene, genome sequence, variation, conservation and protein change predictions scores human data models from:
ftp://ftp.ebi.ac.uk/pub/databases/eva/opencb/cellbase/v3/homo_sapiens/gene.json.gz
ftp://ftp.ebi.ac.uk/pub/databases/eva/opencb/cellbase/v3/homo_sapiens/genome_sequence.json.gz
ftp://ftp.ebi.ac.uk/pub/databases/eva/opencb/cellbase/v3/homo_sapiens/variation_chr*.json.gz
ftp://ftp.ebi.ac.uk/pub/databases/eva/opencb/cellbase/v3/homo_sapiens/conservation_*.json
ftp://ftp.ebi.ac.uk/pub/databases/eva/opencb/cellbase/v3/homo_sapiens/prot_func_pred_chr_*.json.gz
For those users willing to build CellBase knowledgbase from scratch, please follow the sections below.
Download can be done through the Cellbase CLI:
prompt$ /tmp/cellbase/cellbase-app/build/bin/cellbase.sh download --help
Three main datasets will be downloaded in for the human genome: genome sequence, gene annotation, variant annotation. By using the download
command of the cellbase.sh
script, an example of a full command line could be:
`prompt$ /tmp/cellbase/cellbase-app/build/bin/cellbase.sh download -o /tmp/downloadTest --sequence --variation --gene --species "Homo sapiens"
Heavy files will be downloaded and therefore the time needed for completion may vary between minutes and even 1 hour. Downloaded data should look like these:
/tmp/downloadTest/
└── homo_sapiens
├── gene
│ ├── gene_extra_info_cellbase.log
│ └── protein_function_prediction_matrices.log
├── sequence
│ ├── genome_info.log
│ ├── Homo_sapiens.GRCh37.p13.fa.gz
│ └── Homo_sapiens.GRCh37.p13.fa.gz.log
└── variation
├── allele_code.txt.gz
├── allele_code.txt.gz.log
├── allele.txt.gz
├── allele.txt.gz.log
├── attrib.txt.gz
├── attrib.txt.gz.log
├── attrib_type.txt.gz
├── attrib_type.txt.gz.log
├── genotype_code.txt.gz
├── genotype_code.txt.gz.log
├── motif_feature_variation.txt.gz
├── motif_feature_variation.txt.gz.log
├── phenotype_feature_attrib.txt.gz
├── phenotype_feature_attrib.txt.gz.log
├── phenotype_feature.txt.gz
├── phenotype_feature.txt.gz.log
├── phenotype.txt.gz
├── phenotype.txt.gz.log
├── population_genotype.txt.gz
├── population_genotype.txt.gz.log
├── population.txt.gz
├── population.txt.gz.log
├── seq_region.txt.gz
├── seq_region.txt.gz.log
├── source.txt.gz
├── source.txt.gz.log
├── structural_variation_feature.txt.gz
├── structural_variation_feature.txt.gz.log
├── study.txt.gz
├── study.txt.gz.log
├── transcript_variation.txt.gz
├── transcript_variation.txt.gz.log
├── variation_feature.txt.gz
├── variation_feature.txt.gz.log
├── variation_synonym.txt.gz
├── variation_synonym.txt.gz.log
├── variation.txt.gz
└── variation.txt.gz.log
If download was successful, you can proceed to building the json objects that should be loaded into the corresponding database: Build & Load Data.