How do I enable parallel processing in source.read()
?
#413
Replies: 2 comments
-
Hi, @EgorKraevTransferwise - Thanks for raising this. There are really three paths forward here, unfortunately none are quick/easy solutions... Source-Managed ParallelismSources can drive their own parallelism, and we've recently invested significantly in concurrency improvements. This doesn't help if rate-limiting is the major factor, but it benefits all calling applications (PyAirbyte, Airbyte Cloud, OSS, etc.). PyAirbyte-Managed ParallelismWe've considered options by which PyAirbyte would run streams in parallel - basically invoking the connector multiple times, once per stream, so that we basically implement the parallelism ourselves. One problem with this approach is that some streams have parent-child relationships between each other - and these will be less efficient if we break them apart into separate workloads. Caller-Managed ParallelismThe PyAirbyte caller in theory could manage their own parallelism, implementing something like the PyAirbyte-managed option above, but implemented by the calling Python code. While this seems the most expedient implementation option for PyAirbyte users, we have some reports that this could cause some race conditions and/or clobbering of results against each other. I have not personally been able to repro this condition but I logged it here: |
Beta Was this translation helpful? Give feedback.
-
Thanks for the detailed answer! So right now, do I have any options for parallelizing Confluence extraction using PyAirbyte? |
Beta Was this translation helpful? Give feedback.
-
When trying to extract from a Confluence source, it looks like it's extracting one page at a time; is there a way to make multiple calls in parallel, for example via something like this?
I looked around in docs, issues and discussions for PyAirbyte and couldn't find anything like that.
Beta Was this translation helpful? Give feedback.
All reactions