Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix docstring & concurrency issue with duckdb #42

Merged
merged 2 commits into from
Jul 29, 2024

Conversation

antoinejeannot
Copy link
Contributor

@antoinejeannot antoinejeannot commented Jul 23, 2024

Discovered lea this afternoon after reading a Carbonfact job opening and wanted to know more about it!

So here is my attempt at fixing the main branch, hope you do not mind 😊
I read and setup my environment as specified in CONTRIBUTING.md

The first issue is due to a docstring typo, fixed in 9cee9d2

The second one was introduced in 0ed11a9 when bumping duckdb to 1.0.
By bisecting, the issue was actually introduced by duckdb==0.10.1 (i.e. it works with 0.10.0) and is likely related to a deadlock between threads (using only one thread fixes the issue, two threads seems flaky, and more => 💀)
This seems to be also discussed in DuckerDB's docs:

Using Connections in Parallel Python Programs
The DuckDBPyConnection object is not thread-safe. If you would like to write to the same database from multiple threads, create a cursor for each thread with the DuckDBPyConnection.cursor() method.

The guilty: (l.66):

    def materialize_python_view(self, view):
        dataframe = self.read_python_view(view)  # noqa: F841
        # here v 
        self.con.sql(
            f"CREATE OR REPLACE TABLE {view.table_reference} AS SELECT * FROM dataframe"
        )

This inevitably leads to tests hanging, in CI & locally, which end up killed after a few hours.
Duplicating the connection using self.con.cursor() looks like the easiest short term way to fix this issue since the DuckDB client is likely to be used in concurrent scenarios. It is also used by theread_sql function:

    def materialize_python_view(self, view):
        dataframe = self.read_python_view(view)  # noqa: F841
        self.con.cursor().sql(
            f"CREATE OR REPLACE TABLE {view.table_reference} AS SELECT * FROM dataframe"
        )

    ...

    def read_sql(self, query: str) -> pd.DataFrame:
        return self.con.cursor().sql(query).df()

One might also create a connection on-the-fly without storing it using context managers such as:

    def materialize_python_view(self, view):
        dataframe = self.read_python_view(view)  # noqa: F841
        with duckdb.connect(str(self.path)) as con:
          con.sql(
              f"CREATE OR REPLACE TABLE {view.table_reference} AS SELECT * FROM dataframe"
          )

In the long term, using one explicit connection per thread would be a better & more elegant pattern (e.g. via dependency injection)

🟢 Tests pass locally and in my fork

Thanks!

@antoinejeannot antoinejeannot changed the title Fix main Fix docstring & concurrency issue with duckdb Jul 24, 2024
Copy link
Member

@MaxHalford MaxHalford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Antoine! Thanks, this is a great contribution.

We use BigQuery at Carbonfact, and sometimes I make changes to lea without checking DuckDB 🙈

@MaxHalford MaxHalford merged commit 61f1068 into carbonfact:main Jul 29, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants