[Design Proposal] Data Transfer API in Task DAGs #4226

andylizf · 2024-10-31T05:35:01Z

andylizf
Oct 31, 2024

Background

Currently, data transfer between tasks lacks proper abstraction:

Manual data handling in scripts
No automatic data transfer
Lack of transparency in data paths
Incomplete cost calculation

Proposed Approaches

Approach 1: Task Output Registration

This approach introduces automatic data transfer by establishing default paths and enabling customization with set_output and set_input. Environment variables provide easy path access, reducing hardcoded paths.

Custom Path Configuration with `set_output` and `set_input`

For cases where users prefer not to modify existing scripts with fixed paths, set_output and set_input allow specifying custom paths directly in the DAG. Suppose preprocess.py saves data to /mnt/volume1/features.parquet and train.py expects the data in /mnt/data/features.parquet:

# preprocess.py - Fixed path
data.save('/mnt/volume1/features.parquet')

# train.py - Fixed path
features = pd.read_parquet('/mnt/data/features.parquet')

Previously, a manual script was required to move data between clusters, but now we can configure this directly and enables automatic data transfer without modifying existing scripts:

# DAG configuration with custom paths
preprocess.set_output('/mnt/volume1', estimated_size_gb=2.0)
train.set_input(preprocess, '/mnt/data')

preprocess >> train

Default Path Mapping with Environment Variables

For newly created scripts, the default path setup minimizes the need for custom configurations. By default, each task binds its output to .sky/{task_name}/output, making this path available to downstream tasks. In the following DAG, preprocess has its output bound to .sky/preprocess/output, and this same path structure is mapped to train as its input:

preprocess = Task(name='preprocess', run='python preprocess.py')
train = Task(name='train', run='python train.py')
preprocess.set_output(estimated_size_gb=2.0)
preprocess >> train

In this setup, users should configure preprocess.py to write output data to .sky/preprocess/output and train.py to read from this location. Additionally, environment variables provide dynamic access to paths, allowing user scripts to remain unaffected by any changes to input or output paths set in the DAG. By referencing TASK_{TASK_NAME}_OUTPUT_PATH, users avoid hardcoded paths, creating a setup where scripts are path-agnostic:

# preprocess.py - Using environment variable for output path
output_dir = os.environ['TASK_PREPROCESS_OUTPUT_PATH']
data.save(f'{output_dir}/features.parquet')

# train.py - Using environment variable for input path
input_dir = os.environ['TASK_PREPROCESS_OUTPUT_PATH']
features = pd.read_parquet(f'{input_dir}/features.parquet')

This configuration allows users to modify paths solely within the DAG while scripts automatically adapt, providing seamless data transfer with minimal setup.

Pros & Cons

✓ Simple, intuitive API
✓ Transparent path handling
✓ Automatic data transfer
✓ Clean separation of mechanism and convenience
✗ Lacks support for partitioned outputs to multiple downstreams

Approach 2: Edge-Based Data Flow

Specify data transfer on edges between tasks, enabling different paths for different downstream tasks.

# DAG definition
preprocess = Task(name='preprocess', run='python preprocess.py')
train_a = Task(name='train_a', run='python train.py')
train_b = Task(name='train_b', run='python train.py')

(preprocess >> train_a).with_data('/data/model_a', size_gb=2.0)
(preprocess >> train_b).with_data('/data/model_b', size_gb=2.0)

# train.py
input_path = os.environ['EDGE_DATA_PATH']
data = pd.read_parquet(input_path)

Pros & Cons

✓ Explicit data routing
✓ Different paths for different downstream tasks
✓ Clear data flow visualization
✗ More verbose for simple cases

Recommendation

Approach 1 provides the best balance of simplicity and functionality.

Possible Extensions & Discussion

Dynamic Resource Requirements: Scale downstream resources based on actual data size. Particularly useful when data size varies significantly between runs.

train.set_resources(lambda size_gb: Resources(
    memory=f'{size_gb * 2}GB'
))

Python-Level Data API: Replace file operations with Python objects for cleaner data access.

data = task.get_data('features')
processed = transform(data)
task.write_data(processed)

Computation Hiding: Built on Python-level API, enable downstream tasks to start as soon as upstream produces partial results.

# Python API enables batch-level control
loader = task.get_data('features').as_stream()
for batch in loader:  # Upstream compute -> transfer -> downstream compute
    train(batch)

Data Contracts: Validate and filter data dependencies between tasks. Also enhances approach 1 with a cleaner abstraction for multi-downstream data routing.

@inputs
class TrainData:
    features: Data = path('features/*')
    metrics: Data = path('metrics.json')

train.use_inputs(TrainData.features)  # Validates and filters
eval.use_inputs(TrainData.metrics)    # Different data for different tasks

Discussion

We welcome feedback on the proposed designs and extensions.

cblmemo · 2024-10-31T21:27:23Z

cblmemo
Oct 31, 2024
Collaborator

Thanks for the proposal @andylizf ! It is awesome. Several suggestions:

In Approach 1,

Does that mean by set_output(path), it will use path for both output in upstream task and input in downstream task? What if two upstream tasks registered same path? Should we add the task name to the path when downstream task is reading it?
For Option 2, should we clarify how the user set the path? e.g. usually user's code has an output directory and you can specify it to make it a bucket.
Could you elaborate on how those env vars' name is determined?

In Approach 2, can we quickly investigate how does airflow implements data movement? I think it would help us to design the API ;)

Also, one design principle would be imagining an user who dont want to change their existing code structure (including where to output the data). Can we support this situation?

Related to the extensions, I think those are awesome but one concern is that the user needs to deeply integrate with our library in the code. cc @Michaelvll for some inputs here.

3 replies

andylizf Oct 31, 2024
Author

Thanks for the feedback! The Approach 1 section has been updated for clarity. It now details how set_output(path) establishes paths for upstream outputs and downstream inputs, along with handling cases where scripts have fixed paths. This allows users to configure paths directly in the DAG without modifying their existing code.

For Approach 2, we found that Airflow doesn’t implement an edge-based syntax like ours; instead, it sets paths at the task level only, without defining data flow on edges.

Regarding the principle, Approach 1 now emphasizes ease of migration, especially for users who may not want to deeply integrate the library into their code. The higher-level API in the Extensions is supplementary, allowing us to support both migration-friendly options and more integrated functionalities, depending on user needs.

cblmemo Oct 31, 2024
Collaborator

Please add citation links to the airflow doc ;)

andylizf Nov 1, 2024
Author

After looking into Airflow, I found that XComs use a key-value system for small Python objects, accessible across tasks without dependency requirements, which feels unconventional.

In Airflow 2.0, the TaskFlow API provides syntactic sugar over xcom_push/xcom_pull, letting users pass task outputs directly as function arguments. This simplifies data flow and execution order without needing >>. For example:

@dag(schedule_interval=None, start_date=datetime(2021, 1, 1), catchup=False)
def example_taskflow_api():
    @task()
    def extract():
        return {"info": "extracted_data"}

    @task()
    def transform(data):
        return f"transformed_{data['info']}"

    @task()
    def load(transformed_data):
        print(transformed_data)

    load(transform(extract()))

One limitation: this approach assumes Python-based tasks, so it’s not fully aligned with our scenario where tasks may require specific paths. If handling files, users must manually upload to external storage (e.g., S3Hook) and download in downstream tasks.

For more on XComs and data storage, see this Astronomer guide.

andylizf · 2024-11-01T00:17:12Z

andylizf
Nov 1, 2024
Author

After extensive offline discussion with @cblmemo, we found Approach 1 has clear issues with multi-upstream and multi-downstream setups. For example, specifying train.set_input(preprocess, '/mnt/data') for each upstream becomes tedious. Additionally, even with data contracts extension, Approach 1 fails when upstream outputs go to paths like /pretrained/ and /data/; it’s impractical to set / as a base and filter with use_inputs.

We believe Approach 2’s added complexity is worthwhile. With syntax like

(preprocess >> [train_a, train_b]).with_data('/data', size_gb=2.0)

we can simplify common cases while reducing verbosity—addressing Approach 2's main drawback in simple scenarios.

This also means modifying the current YAML format to specify data flow edge by edge, instead of listing downstreams per node.

0 replies

KerneyJ · 2024-11-09T01:08:17Z

KerneyJ
Nov 9, 2024

Approach 0: The Ideal

Claim: Users understand dependencies, not necessarily dags.
So the ideal dag has implicit edges.

@Task
def preprocess():
  # preprocessing ...

@Task
def train_a():
  preprocess()
  # do training

@Task
def train_b():
  preprocess()
  # do training

launch(train_b)
launch(train_a)

Other examples of this approach, regent/legion, parsl, dask

Pros:

Implicit dag definition
Each function is an encoding of the DAG

Cons:

Costly to implement

Approach 3: Reasonable

Inspired by Approach 2 in andy's proposal

In yaml dependencies should be specified within the task defintion

name: preprocess

resources:
  cloud: aws

setup: |
  pip install -r requirements.txt

run: |
  python3 preprocess.py

name: train_a

resources:
  cloud: aws

setup: |
  pip install -r requirements.txt

dependson:
  preprocess:
    - /data/train_a

run: |
  python3 train_a.py

name: train_b

resources:
  cloud: aws

setup: |
  pip install -r requirements.txt

dependson:
  preprocess:
    - /data/train_b

run: |
  python3 train_b.py

Similarly, in the Python API, task dependencies should be specified within the task definition

preprocess = Task(name="preprocess", run="python3 preprocess.py")
train_a = Task(name="train_a", run="python3 train_a.py", depends_on=["preprocess:/data/train_a"])
train_b = Task(name="train_b", run="python3 train_a.py", depends_on=["preprocess:/data/train_b"])

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Design Proposal] Data Transfer API in Task DAGs #4226

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

[Design Proposal] Data Transfer API in Task DAGs #4226

andylizf Oct 31, 2024

Background

Proposed Approaches

Approach 1: Task Output Registration

Custom Path Configuration with set_output and set_input

Default Path Mapping with Environment Variables

Pros & Cons

Approach 2: Edge-Based Data Flow

Pros & Cons

Recommendation

Possible Extensions & Discussion

Discussion

Replies: 3 comments · 3 replies

cblmemo Oct 31, 2024 Collaborator

andylizf Oct 31, 2024 Author

cblmemo Oct 31, 2024 Collaborator

andylizf Nov 1, 2024 Author

andylizf Nov 1, 2024 Author

KerneyJ Nov 9, 2024

Approach 0: The Ideal

Pros:

Cons:

Approach 3: Reasonable

andylizf
Oct 31, 2024

Custom Path Configuration with `set_output` and `set_input`

Replies: 3 comments 3 replies

cblmemo
Oct 31, 2024
Collaborator

andylizf Oct 31, 2024
Author

cblmemo Oct 31, 2024
Collaborator

andylizf Nov 1, 2024
Author

andylizf
Nov 1, 2024
Author

KerneyJ
Nov 9, 2024