Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-427759: fetch_pandas_all not using int64 for NUMBER(38, 0) #828

Closed
jonashaag opened this issue Aug 16, 2021 · 10 comments · Fixed by snowflakedb/snowpark-python#1101
Closed
Assignees

Comments

@jonashaag
Copy link

Connector version 2.5.1.

X is a NUMBER(38, 0) column with values that fit into int8. It is fetched as decimal number rather than int64, unless you explicitly cast to NUMBER(18, 0):

# eng is a SQLAlchemy engine object
In [23]: res = eng.execute("select X from table(result_scan('...'))")
In [24]: res.cursor._result._fetch_arrow_all()
Out[24]:
pyarrow.Table
X: decimal(38, 0)

In [27]: cur = eng.execute("select X::number(18,0) from table(result_scan('...'))")
In [28]: cur.cursor._result._fetch_arrow_all()
Out[28]:
pyarrow.Table
LiteralRedacted0: int64
@github-actions github-actions bot changed the title fetch_pandas_all not using int64 for NUMBER(38, 0) SNOW-427759: fetch_pandas_all not using int64 for NUMBER(38, 0) Aug 16, 2021
@jonashaag
Copy link
Author

Aha, this only seems to happen when doing a result_scan.

@sfc-gh-sshankar
Copy link

@jonashaag, we're looking into this. In the meanwhile, given your comment above, can you provide sample code and observed behavior without the result_scan ?

@jonashaag
Copy link
Author

Here's a minimal example.

# eng is SQLAlchemy engine object
>>> eng.execute("create table xxx.snow427759 (x number(38,0))")
>>> eng.execute("insert into xxx.snow427759 (x) values (1)")
>>> res = eng.connect().execute("select * from xxx.snow427759")
>>> res.cursor._result._fetch_arrow_all()
pyarrow.Table
X: int8
>>> res2 = eng.connect().execute(f"select * from table(result_scan('{res.cursor.sfqid}'))")
>>> res2.cursor._result._fetch_arrow_all()
pyarrow.Table
X: decimal(38, 0)

@sfc-gh-mkeller
Copy link
Collaborator

@jonashaag I tried reproducing your issue , I found that the following worked as expected.

import snowflake.connector
import pyarrow
config = {...}

with snowflake.connector.connect(
    **config,
) as cnx:
    with cnx.cursor() as cur:
        cur.execute("select 1::number(38,0) as x")
        res = cur.fetch_arrow_all()
        assert pyarrow.types.is_int8(res["X"].type)
        sfqid = cur.sfqid
        cur.execute(f"select * from table(result_scan('{sfqid}'))")
        res = cur.fetch_arrow_all()
        assert pyarrow.types.is_int8(res["X"].type)

I tried using a table and also running the result_scan from a brand new connection and I am unable to reproduce your issue.
Could you see if you could reproduce the issue with our Python connector only?

@jonashaag
Copy link
Author

jonashaag commented Sep 20, 2021

Interesting. I can reproduce the behaviour with connector versions 2.4, 2.5, 2.6, using the following script:

import snowflake.connector
import pyarrow
config = ...

with snowflake.connector.connect(
    **config,
) as cnx:
    with cnx.cursor() as cur:
        cur.execute("select 1::number(38,0) as x")
        try:
            res = cur.fetch_arrow_all()
        except:
            res = cur._result._fetch_arrow_all()
        assert pyarrow.types.is_int8(res["X"].type), res["X"].type
        sfqid = cur.sfqid
# Same behaviour with or without new conn.
# with snowflake.connector.connect(
#     **config,
# ) as cnx:
#     with cnx.cursor() as cur:
        cur.execute(f"select * from table(result_scan('{sfqid}'))")
        try:
            res = cur.fetch_arrow_all()
        except:
            res = cur._result._fetch_arrow_all()
        assert pyarrow.types.is_int8(res["X"].type), res["X"].type
Traceback (most recent call last):
  File "t.py", line 22, in <module>
    assert pyarrow.types.is_int8(res["X"].type), res["X"].type
AssertionError: decimal128(38, 0)

@sfc-gh-mkeller
Copy link
Collaborator

@jonashaag I copy pasted your code snippet and I cannot reproduce your issue.
I'd need your query ID for these 2 queries to continue my investigation. It might be that you have some non-default parameter set on your account, or user that's causing this.

If you could post those here I could take a look. If you are not comfortable doing that then please open a support ticket with a link to this issue and then we can have a private communication channel that way.

@jonashaag
Copy link
Author

Query ID of select 1::number(38,0): 019f8835-3200-a818-0000-1e692eedf386

Query ID of result_scan: 019f8835-3200-a7fb-0000-1e692eede432

@sfc-gh-mkeller
Copy link
Collaborator

sfc-gh-mkeller commented Mar 2, 2022

Sorry, it took me a while to get to this.
I now found the culprit, let me demonstrate:

from base64 import b64decode
from io import BytesIO
from pyarrow import RecordBatchStreamReader
from snowflake.connector import connect

with connect(
    ...
) as conn:
    with conn.cursor() as c:
        c.execute("select to_number(a) as b from (select '1' as a)")
        results = c.get_result_batches()[0]
        print(RecordBatchStreamReader(BytesIO(b64decode(results._data))).schema.types)
    with conn.cursor() as c2:
        c2.execute("select to_number('1') as a")
        results2 = c2.get_result_batches()[0]
        print(RecordBatchStreamReader(BytesIO(b64decode(results2._data))).schema.types)

What happens is that when you have a sub-query (I suspect this to be more complicated and it really depends on the available metadata at the time of query execution) Snowflake is unable to tell the maximum number in a row.

So it defaults to the type that could hold the maximum value. We allow up to 38 digits, which does not fit into a 64 bit integer, so we'll probably need to go to 128 bits.
Now please take a look at: https://arrow.apache.org/docs/cpp/api/datatype.html and notice how the biggest integer Arrow supports is 64 bits wide. So in this case, Snowflake can only safely put the results into a Decimal128.

The only mitigations that I see:

  1. Snowflake's back-end could do some extra metadata digging for metadata and try fitting the data into Int64 and then fall-back to Decimal128 only if necessary. I could be wrong, but I think we could only do this for some specific queries, but not widely and consequently it could make all queries (that deal with NUMBERs) slower in the end.
  2. Arrow could introduce a Integer128. I'm not sure how widely Integer128 implementations are in C++ though.
  3. We could introduce our own convenience function in the connector, so if someone iterates through a column that's a NUMBER(X, 0) then we could cast those to Pyhon's ints. I still don't really like this, as just like in your case @jonashaag there's no way for us to put them into a pyarrow.Table that you expect from fetch_arrow_*.

Edit: I never included the script's output:

[Decimal128Type(decimal128(38, 0))]
[DataType(int8)]

@jonashaag
Copy link
Author

I wonder why this doesn’t happen in the original query but only in the result scan. Is that because it is a sub query? Could the Snowflake backend be taught to use the original query’s metadata for the result scan query?

@github-actions
Copy link

To clean up and re-prioritize bugs and feature requests we are closing all issues older than 6 months as of March 1, 2023. If there are any issues or feature requests that you would like us to address, please re-create them. For urgent issues, opening a support case with this link Snowflake Community is the fastest way to get a response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants