[Python] Throw an error if a RecordBatchReader is read more than once #118
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Since PyArrow RecordBatchReader objects are destructive (because they are stateful), they don't behave the same as native DuckDB tables, or Pandas/Polars DataFrames.
When we detect that a RecordBatchReader is encountered twice in a replacement scan, we throw an error.
Remaining Issues
State is Connection-local
This state is kept inside the PythonContextState, which is created for every separate connection, meaning that this error does not occur when two separate connections read from the same record batch reader.
We should instead make the RecordBatchReaderRegistry global on the module level, using locks to ensure multiple connections can make use of it at the same time.
Canceled Relations
We record the RecordBatchReader the moment it is encountered in the replacement scan, relations could theoretically be canceled, so they never run.
Subsequent relations that reference the RecordBatchReader then cause an error to be thrown.
We should differentiate between found and consumed RecordBatchReaders so this doesn't happen.
Another somewhat related issue to "Canceled Relations", using a LIMIT it could be possible to read from the same record batch reader and have it not be an error, if queries are issued to partially read from the RecordBatchReader