Fix docstring & concurrency issue with duckdb #42

antoinejeannot · 2024-07-23T19:44:57Z

Discovered lea this afternoon after reading a Carbonfact job opening and wanted to know more about it!

So here is my attempt at fixing the main branch, hope you do not mind 😊
I read and setup my environment as specified in CONTRIBUTING.md ✅

The first issue is due to a docstring typo, fixed in 9cee9d2

The second one was introduced in 0ed11a9 when bumping duckdb to 1.0.
By bisecting, the issue was actually introduced by duckdb==0.10.1 (i.e. it works with 0.10.0) and is likely related to a deadlock between threads (using only one thread fixes the issue, two threads seems flaky, and more => 💀)
This seems to be also discussed in DuckerDB's docs:

Using Connections in Parallel Python Programs
The DuckDBPyConnection object is not thread-safe. If you would like to write to the same database from multiple threads, create a cursor for each thread with the DuckDBPyConnection.cursor() method.

The guilty: (l.66):

    def materialize_python_view(self, view):
        dataframe = self.read_python_view(view)  # noqa: F841
        # here v 
        self.con.sql(
            f"CREATE OR REPLACE TABLE {view.table_reference} AS SELECT * FROM dataframe"
        )

This inevitably leads to tests hanging, in CI & locally, which end up killed after a few hours.
Duplicating the connection using self.con.cursor() looks like the easiest short term way to fix this issue since the DuckDB client is likely to be used in concurrent scenarios. It is also used by theread_sql function:

    def materialize_python_view(self, view):
        dataframe = self.read_python_view(view)  # noqa: F841
        self.con.cursor().sql(
            f"CREATE OR REPLACE TABLE {view.table_reference} AS SELECT * FROM dataframe"
        )

    ...

    def read_sql(self, query: str) -> pd.DataFrame:
        return self.con.cursor().sql(query).df()

One might also create a connection on-the-fly without storing it using context managers such as:

    def materialize_python_view(self, view):
        dataframe = self.read_python_view(view)  # noqa: F841
        with duckdb.connect(str(self.path)) as con:
          con.sql(
              f"CREATE OR REPLACE TABLE {view.table_reference} AS SELECT * FROM dataframe"
          )

In the long term, using one explicit connection per thread would be a better & more elegant pattern (e.g. via dependency injection)

🟢 Tests pass locally and in my fork

Thanks!

MaxHalford

Hey Antoine! Thanks, this is a great contribution.

We use BigQuery at Carbonfact, and sometimes I make changes to lea without checking DuckDB 🙈

antoinejeannot added 2 commits July 23, 2024 21:03

duckdb: duplicate client connection for concurrency

633633b

bigquery: fix client docstring

9cee9d2

antoinejeannot changed the title ~~Fix main~~ Fix docstring & concurrency issue with duckdb Jul 24, 2024

MaxHalford approved these changes Jul 26, 2024

View reviewed changes

MaxHalford merged commit 61f1068 into carbonfact:main Jul 29, 2024
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix docstring & concurrency issue with duckdb #42

Fix docstring & concurrency issue with duckdb #42

antoinejeannot commented Jul 23, 2024 •

edited

Loading

MaxHalford left a comment

Fix docstring & concurrency issue with duckdb #42

Fix docstring & concurrency issue with duckdb #42

Conversation

antoinejeannot commented Jul 23, 2024 • edited Loading

MaxHalford left a comment

Choose a reason for hiding this comment

antoinejeannot commented Jul 23, 2024 •

edited

Loading