How do I enable parallel processing in `source.read()` ? #413

EgorKraevTransferwise · 2024-10-07T11:25:00Z

EgorKraevTransferwise
Oct 7, 2024

When trying to extract from a Confluence source, it looks like it's extracting one page at a time; is there a way to make multiple calls in parallel, for example via something like this?
I looked around in docs, issues and discussions for PyAirbyte and couldn't find anything like that.

aaronsteers · 2024-10-22T03:56:46Z

aaronsteers
Oct 22, 2024
Maintainer

Hi, @EgorKraevTransferwise - Thanks for raising this. There are really three paths forward here, unfortunately none are quick/easy solutions...

Source-Managed Parallelism

Sources can drive their own parallelism, and we've recently invested significantly in concurrency improvements. This doesn't help if rate-limiting is the major factor, but it benefits all calling applications (PyAirbyte, Airbyte Cloud, OSS, etc.).

PyAirbyte-Managed Parallelism

We've considered options by which PyAirbyte would run streams in parallel - basically invoking the connector multiple times, once per stream, so that we basically implement the parallelism ourselves. One problem with this approach is that some streams have parent-child relationships between each other - and these will be less efficient if we break them apart into separate workloads.

Caller-Managed Parallelism

The PyAirbyte caller in theory could manage their own parallelism, implementing something like the PyAirbyte-managed option above, but implemented by the calling Python code.

While this seems the most expedient implementation option for PyAirbyte users, we have some reports that this could cause some race conditions and/or clobbering of results against each other. I have not personally been able to repro this condition but I logged it here:

Running multiple copies of PyAirbyte for the same stream will lead to errors #311

0 replies

ZmeiGorynych · 2024-10-28T14:17:30Z

ZmeiGorynych
Oct 28, 2024

Thanks for the detailed answer! So right now, do I have any options for parallelizing Confluence extraction using PyAirbyte?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do I enable parallel processing in `source.read()` ? #413

{{title}}

Replies: 2 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

How do I enable parallel processing in source.read() ? #413

EgorKraevTransferwise Oct 7, 2024

Replies: 2 comments

aaronsteers Oct 22, 2024 Maintainer

Source-Managed Parallelism

PyAirbyte-Managed Parallelism

Caller-Managed Parallelism

ZmeiGorynych Oct 28, 2024

How do I enable parallel processing in `source.read()` ? #413

EgorKraevTransferwise
Oct 7, 2024

aaronsteers
Oct 22, 2024
Maintainer

ZmeiGorynych
Oct 28, 2024