Skip to content

Latest commit

 

History

History
 
 

submit

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

Manifest-based Data Submission

Q. How do I submit pre-existing backlog of big data, like 1 PetaByte?

A. You will need file object manifest, minimally. Continue reading for how.

Q. Do we physically need to move pre-existing data buckets into Gen3 instance?

A. No. Your existing data can be kept as-is where they are. This includes some data volume in HPC. You just index them using manifest. Continue reading for how.

Context

  • In Gen3, there are couple ways to make data submission.
  • Typical end-user data submission for smaller dataset (couple of files) from your desktop/workstation follow normal data submission procedure.
  • However, this is not always the case such that there exists backlog of data stores in Cloud Storage buckets or data volumes on HPC cluster.
  • You can drive all these data that sit in elsewhere into Gen3 Indexing and catalogue metadata Graph model.
  • This kind of data submission in Gen3 is known as couple names:
    • manifest-based data submission
    • out-of-band data ingestion
    • DIIRM indexing (DIIRM - Data Ingestion, Integration, and Release Management)

Notes

Each sub-directory contains demo example on this manifest-based data submission with some mock data or public dataset for exploration; including Consent & Data Access (ACL), interoperability (DRS & Htsget), external bucket, so on.

REF

For more details on technical:

Bucket manifest:

Note that, this essentially entails how you generate manifest file for existing Big data. This process can be outside Gen3 with some Batch Job running in Cloud (AWS/GCP) or HPC.

Or, you could also utilise Gen3 EKS Kubernetes cluster. If so, technical pointers are as follows.