SQLAlchemy backlog #74

amotl · 2022-06-01T10:29:11Z

Hi there,

while working on crate/crate-python#391, some backlog items have accumulated. I will gather them within this ticket.

Internals

We've identified a few shortcomings in the internal implementation of the CrateDB SQLAlchemy dialect. While it seems to work in general, those spots can well be improved, in order to better align with the internal API hooks of SQLAlchemy, and how the CrateDB dialect interacts with that.

Migrate away from using the before_execute event.
Mike Bayer advised us to hook into ExecutionContext.pre_exec() for rewriting UPDATE statements instead of using any kinds of engine- or dialect-events, like our current implementation.
Discussions:
- Add support for SQLAlchemy 1.4 crate-python#391 (comment)
- Making the CrateDB dialect compatible with SQLAlchemy 1.4 sqlalchemy/sqlalchemy#5915 (reply in thread)
Investigate CompileError: Unconsumed column names
Issue: SQLAlchemy: Investigate CompileError: Unconsumed column names #78
SQLAlchemy 2: Type support for patch to visit_update #84
Investigate whether SQLAlchemy's MutableDict implementation can be used for implementing CrateDB's OBJECT type, see SQLAlchemy: Use JSON type adapter for implementing CrateDB's OBJECT crate-python#561 (review).

More

Improve code coverage and reduce code duplication of the visit_update_14 method.
Reference: Add support for SQLAlchemy 1.4 crate-python#391 (review)
Support SQLAlchemy dialect with asyncpg #79

With kind regards,
Andreas.

The text was updated successfully, but these errors were encountered:

robd003 · 2022-06-20T17:11:24Z

Getting support for async SQLAlchemy would be super useful

I have a few queries that can take slightly over 1 sec to execute and being able to not block would be HUGE

amotl · 2023-03-30T11:24:10Z

Dear Robert,

support for asynchronous communication with SQLAlchemy, based on the asyncpg and psycopg3 drivers, is being evaluated at crate/crate-python#532. Please note that this is experimental, and we currently have no schedule about when or how this will be released.

With kind regards,
Andreas.

amotl · 2023-07-05T22:24:57Z

Coming from crate/crate-python#553 (comment) and crate/crate-python#553 (comment), there are a few additional backlog items for SQLAlchemy/pandas/Dask:

Expand support for fast-path bulk insert operations to the other DML operations UPDATE and DELETE offered by the CrateDB bulk operations endpoint, when this is sensible within a pandas/Dask environment.
Don't stop at to_sql(), also cover optimal read_sql() techniques within the documentation, when possible.
In the context of determining the available number of CPU cores on your machine, it may also make sense to refer to the os.cpu_count() or multiprocessing.cpu_count() functions, and/or the cpu-count package. ¹
Another detail could be to elaborate a bit about the other main argument to the dask.from_pandas() function, chunksize=, which can only be used exclusively to the npartitions= argument. ²
python ddf = dask.from_pandas(df, chunksize=CHUNKSIZE)
In this way, you don't specify the number of partitions, but also use the chunk size as a parameter to configure the workload scheduler. This concept might fit even better with typical ETL tasks from/to database systems, which we are exploring here to make them more efficient with CrateDB.
You'd probably use the same chunksize value here, which will also be used for configuring the outbound batch chunker to the database, but I am not sure about it yet. If this is the case, it would make its usage even easier in different scenarios, because a user would just need to configure a good chunk size, not different to the basic pandas usage at all, and not need to spend much specific thoughts on compute resources at all.

Rationale: While you will probably know the number of cores in advance if you are professionally scheduling cluster workloads anyway, I think inquiring the number of available cores, and using that figure on demand, still makes totally sense if your program is meant to run on different environments, for example down to Jupyter notebooks, which don't reach out to a cluster, and just process fractions of the whole workload(s) on smaller workstations, but still aim to utilize their resources as good as possible, i.e. to prevent only using 4 cores while 16 would be available. Apologies for that beast of a sentence. ↩

dask.from_pandas() function

def from_pandas(
    data: pd.DataFrame | pd.Series,
    npartitions: int | None = None,
    chunksize: int | None = None,
    sort: bool = True,
    name: str | None = None,
) -> DataFrame | Series:
    """
    Construct a Dask DataFrame from a Pandas DataFrame

    This splits an in-memory Pandas dataframe into several parts and constructs
    a dask.dataframe from those parts on which Dask.dataframe can operate in
    parallel.  By default, the input dataframe will be sorted by the index to
    produce cleanly-divided partitions (with known divisions).  To preserve the
    input ordering, make sure the input index is monotonically-increasing. The
    ``sort=False`` option will also avoid reordering, but will not result in
    known divisions.

    Note that, despite parallelism, Dask.dataframe may not always be faster
    than Pandas.  We recommend that you stay with Pandas for as long as
    possible before switching to Dask.dataframe.

    npartitions : int, optional
        The number of partitions of the index to create. Note that if there
        are duplicate values or insufficient elements in ``data.index``, the
        output may have fewer partitions than requested.
    chunksize : int, optional
        The desired number of rows per index partition to use. Note that
        depending on the size and index of the dataframe, actual partition
        sizes may vary.
    """

↩

amotl mentioned this issue Jun 1, 2022

Add support for SQLAlchemy 1.4 crate/crate-python#391

Merged

amotl mentioned this issue Jan 27, 2023

SQLAlchemy: Investigate CompileError: Unconsumed column names #78

Open

amotl changed the title ~~SQLAlchemy 1.4 aftermath~~ SQLAlchemy backlog Jan 27, 2023

amotl mentioned this issue Mar 4, 2023

[DRAFT] SA20: Add compatibility adapters for psycopg3 and asyncpg dialects crate/crate-python#532

Closed

amotl mentioned this issue Jul 5, 2023

SQLAlchemy: Use JSON type adapter for implementing CrateDB's OBJECT crate/crate-python#561

Merged

amotl mentioned this issue Jul 5, 2023

Add fast-path INSERT method insert_bulk for SQLAlchemy/pandas/Dask crate/crate-python#553

Merged

amotl mentioned this issue Dec 23, 2023

Dialect: Add support for asyncpg and psycopg3 drivers #11

Draft

4 tasks

amotl transferred this issue from crate/crate-python Jun 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SQLAlchemy backlog #74

SQLAlchemy backlog #74

amotl commented Jun 1, 2022 •

edited

Loading

robd003 commented Jun 20, 2022

amotl commented Mar 30, 2023

amotl commented Jul 5, 2023 •

edited

Loading

SQLAlchemy backlog #74

SQLAlchemy backlog #74

Comments

amotl commented Jun 1, 2022 • edited Loading

Internals

More

robd003 commented Jun 20, 2022

amotl commented Mar 30, 2023

amotl commented Jul 5, 2023 • edited Loading

Footnotes

amotl commented Jun 1, 2022 •

edited

Loading

amotl commented Jul 5, 2023 •

edited

Loading