Data Manager Worker

We need to manage a data lake, that stores and serves queries to blockchain data.

The data is divided into chunks. Each chunk corresponds to particular blockchain and block range, and stores block related data.

There is a set of workers. Each worker downloads an assigned subset of data chunks from a large persistent storage and is responsible for serving queries to assigned data.

This is a data management component of a worker, with the following interface:

use std::collections::HashMap;
use std::ops::Range;
use std::path::{Path, PathBuf};


pub type DatasetId = [u8; 32];
pub type ChunkId = [u8; 32];


/// data chunk description
pub struct DataChunk {
    id: ChunkId,
    /// Dataset (blockchain) id
    dataset_id: DatasetId,
    /// Block range this chunk is responsible for (around 100 - 10000 blocks)
    block_range: Range<u64>,
    /// Data chunk files. 
    /// A mapping between file names and HTTP URLs to download files from.
    /// Usually contains 1 - 10 files of various sizes. 
    /// The total size of all files in the chunk is about 200 MB.
    files: HashMap<String, String>
}


pub trait DataManager: Send + Sync {
    /// Create a new `DataManager` instance, that will use `data_dir` to store the data.
    /// 
    /// When `data_dir` is not empty, this method should create a list of fully downloaded chunks
    /// and use it as initial state.
    fn new(data_dir: PathBuf) -> Self;
    /// Schedule `chunk` download in background
    fn download_chunk(&self, chunk: DataChunk);
    /// List chunks, that are currently available
    fn list_chunks(&self) -> Vec<ChunkId>;
    /// Find a chunk from a given dataset, that is responsible for `block_number`.
    fn find_chunk(&self, dataset_id: DatasetId, block_number: u64) -> Option<impl DataChunkRef>;
    /// Schedule data chunk for deletion in background
    fn delete_chunk(&self, chunk_id: ChunkId);
}


// Data chunk must remain available and untouched till this reference is not dropped
pub trait DataChunkRef: Send + Sync + Clone {
    // Data chunk directory
    fn path(&self) -> &Path;
}

Requirements

DataManager functions must not block on IO or hold mutexes for a long time and should be cheap in general
DataManager must be reaching the desired state formed by download-delete call sequence in a fast and smooth manner
DataManager must make no assumptions about download-delete call sequence, except the one below
We assume, that at any point of time there is enough disk space to hold the currently desired state
Only fully downloaded chunks must be available for queries
Each chunk has approximate size of 200 MB and each worker can store up to 1 TB of data on disk.

Implementation

Read the documentation: Data Manager

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
docs		docs
local_catalogue_dir		local_catalogue_dir
local_data_dir		local_data_dir
remote_data_dir/dataset_id=1111111111111111111111111111111111111111111111111111111111111111		remote_data_dir/dataset_id=1111111111111111111111111111111111111111111111111111111111111111
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Manager Worker

Requirements

Implementation

About

Releases

Packages

Languages

skecskes/data-manager

Folders and files

Latest commit

History

Repository files navigation

Data Manager Worker

Requirements

Implementation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages