Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-966003: Fix Arrow return value for zero-row queries #1801

Closed
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion src/snowflake/connector/cursor.py
Original file line number Diff line number Diff line change
Expand Up @@ -1117,7 +1117,7 @@ def fetch_arrow_batches(self) -> Iterator[Table]:
)
return self._result_set._fetch_arrow_batches()

def fetch_arrow_all(self) -> Table | None:
def fetch_arrow_all(self) -> Table:
Copy link
Collaborator

@sfc-gh-aling sfc-gh-aling Dec 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sfc-gh-mkeller , do you think this is a API breaking change that previously we return None now we return an empty Table.

I'm thinking of introducing a parameter to protect the change.
(name TBD)
def fetch_arrow_all(self, return_table_even_no_data=False)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we could do that and then use typing.overload to express the different return types based on that boolean flag.

I'd love if we had a plan on how to introduce such changes with a timeline to deprecate them over some time. Maybe some warning wit a date'd comment?

Copy link
Contributor Author

@thomasaarholt thomasaarholt Dec 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some thoughts

  1. Here's a suggested fix using your ideas:
from typing import overload, Literal
from warnings import warn
from pyarrow import Table

# Temporarily omitting `self` in the methods in order to make it easy to test

@overload
def fetch_arrow_all(return_table_even_no_data: Literal[False] = False) -> None:
    ...


@overload
def fetch_arrow_all(return_table_even_no_data: Literal[True] = True) -> Table:
    ...


def fetch_arrow_all(return_table_even_no_data: bool = False) -> Table | None:
    if not return_table_even_no_data:
        warn(
            "The default behaviour of fetch_arrow_all when the query returns zero rows "
            "will change from returning None to returning an empty pyarrow table with "
            "schema using the highest bit length in version X.X.X.",
            DeprecationWarning,
            stacklevel=2,
        )
    # regular code here


foo = fetch_arrow_all(return_table_even_no_data=True) # Table, or "Unknown" if no type stubs for pyarrow are present
bar = fetch_arrow_all(return_table_even_no_data=False) # None

I'm happy to change any of this.

  1. You could consider this a bug-fix though. It is a discrepancy between the pandas and pyarrow apis. Depends on how strict you want to be about potentially breaking changes.

  2. If you follow semver, then I'd expect that you would only introduce breaking changes with major version change. As a user of the connector, it would be nice to know when one expects breaking changes, even if you don't follow semver.

Copy link
Collaborator

@sfc-gh-aling sfc-gh-aling Dec 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @thomasaarholt, thanks for the proposal, I personally like it. re your third point, yeah.. sometime I feel torn when facing a breaking change bug fix, but the fact is that there might be programs that depend on this "bug" behavior, so even with a small change like this, something can go wrong in users env.

so I feel it would be safer to introduce a parameter for the new behavior.

I'm thinking of a better naming force_return_table instead of the return_table_even_no_data (or you have a better idea for the name)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you help make the code change based upon our discussion? really appreciate your contribution and the discussion!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! I decided to omit the line containing the X.X.X version entirely.

self.check_can_use_arrow_resultset()
if self._prefetch_hook is not None:
self._prefetch_hook()
Expand Down
4 changes: 2 additions & 2 deletions src/snowflake/connector/result_set.py
Original file line number Diff line number Diff line change
Expand Up @@ -164,13 +164,13 @@ def _fetch_arrow_batches(
self._can_create_arrow_iter()
return self._create_iter(iter_unit=IterUnit.TABLE_UNIT, structure="arrow")

def _fetch_arrow_all(self) -> Table | None:
def _fetch_arrow_all(self) -> Table:
sfc-gh-aling marked this conversation as resolved.
Show resolved Hide resolved
"""Fetches a single Arrow Table from all of the ``ResultBatch``."""
tables = list(self._fetch_arrow_batches())
if tables:
return pa.concat_tables(tables)
else:
return None
return self.batches[0].to_arrow()
sfc-gh-aling marked this conversation as resolved.
Show resolved Hide resolved

def _fetch_pandas_batches(self, **kwargs) -> Iterator[DataFrame]:
"""Fetches Pandas dataframes in batches, where batch refers to Snowflake Chunk.
Expand Down
11 changes: 11 additions & 0 deletions test/integ/pandas/test_arrow_pandas.py
Original file line number Diff line number Diff line change
Expand Up @@ -1167,6 +1167,17 @@ def test_simple_arrow_fetch(conn_cnx):
assert lo == rowcount


def test_arrow_zero_rows(conn_cnx):
with conn_cnx() as cnx:
with cnx.cursor() as cur:
cur.execute(SQL_ENABLE_ARROW)
cur.execute("select 1::NUMBER(38,0) limit 0")
table = cur.fetch_arrow_all()
# Snowflake will return an integery dtype with maximum bit-depth if
# no rows are returned
assert table.schema[0].type == pyarrow.int64()
thomasaarholt marked this conversation as resolved.
Show resolved Hide resolved


@pytest.mark.parametrize("fetch_fn_name", ["to_arrow", "to_pandas", "create_iter"])
@pytest.mark.parametrize("pass_connection", [True, False])
def test_sessions_used(conn_cnx, fetch_fn_name, pass_connection):
Expand Down
Loading