SNOW-427759: fetch_pandas_all not using int64 for NUMBER(38, 0) #828

jonashaag · 2021-08-16T19:45:09Z

Connector version 2.5.1.

X is a NUMBER(38, 0) column with values that fit into int8. It is fetched as decimal number rather than int64, unless you explicitly cast to NUMBER(18, 0):

# eng is a SQLAlchemy engine object
In [23]: res = eng.execute("select X from table(result_scan('...'))")
In [24]: res.cursor._result._fetch_arrow_all()
Out[24]:
pyarrow.Table
X: decimal(38, 0)

In [27]: cur = eng.execute("select X::number(18,0) from table(result_scan('...'))")
In [28]: cur.cursor._result._fetch_arrow_all()
Out[28]:
pyarrow.Table
LiteralRedacted0: int64

The text was updated successfully, but these errors were encountered:

jonashaag · 2021-08-18T10:02:40Z

Aha, this only seems to happen when doing a result_scan.

sfc-gh-sshankar · 2021-08-26T21:28:38Z

@jonashaag, we're looking into this. In the meanwhile, given your comment above, can you provide sample code and observed behavior without the result_scan ?

jonashaag · 2021-09-06T09:34:23Z

Here's a minimal example.

# eng is SQLAlchemy engine object
>>> eng.execute("create table xxx.snow427759 (x number(38,0))")
>>> eng.execute("insert into xxx.snow427759 (x) values (1)")
>>> res = eng.connect().execute("select * from xxx.snow427759")
>>> res.cursor._result._fetch_arrow_all()
pyarrow.Table
X: int8
>>> res2 = eng.connect().execute(f"select * from table(result_scan('{res.cursor.sfqid}'))")
>>> res2.cursor._result._fetch_arrow_all()
pyarrow.Table
X: decimal(38, 0)

sfc-gh-mkeller · 2021-09-17T23:51:45Z

@jonashaag I tried reproducing your issue , I found that the following worked as expected.

import snowflake.connector
import pyarrow
config = {...}

with snowflake.connector.connect(
    **config,
) as cnx:
    with cnx.cursor() as cur:
        cur.execute("select 1::number(38,0) as x")
        res = cur.fetch_arrow_all()
        assert pyarrow.types.is_int8(res["X"].type)
        sfqid = cur.sfqid
        cur.execute(f"select * from table(result_scan('{sfqid}'))")
        res = cur.fetch_arrow_all()
        assert pyarrow.types.is_int8(res["X"].type)

I tried using a table and also running the result_scan from a brand new connection and I am unable to reproduce your issue.
Could you see if you could reproduce the issue with our Python connector only?

jonashaag · 2021-09-20T08:37:10Z

Interesting. I can reproduce the behaviour with connector versions 2.4, 2.5, 2.6, using the following script:

import snowflake.connector
import pyarrow
config = ...

with snowflake.connector.connect(
    **config,
) as cnx:
    with cnx.cursor() as cur:
        cur.execute("select 1::number(38,0) as x")
        try:
            res = cur.fetch_arrow_all()
        except:
            res = cur._result._fetch_arrow_all()
        assert pyarrow.types.is_int8(res["X"].type), res["X"].type
        sfqid = cur.sfqid
# Same behaviour with or without new conn.
# with snowflake.connector.connect(
#     **config,
# ) as cnx:
#     with cnx.cursor() as cur:
        cur.execute(f"select * from table(result_scan('{sfqid}'))")
        try:
            res = cur.fetch_arrow_all()
        except:
            res = cur._result._fetch_arrow_all()
        assert pyarrow.types.is_int8(res["X"].type), res["X"].type

Traceback (most recent call last):
  File "t.py", line 22, in <module>
    assert pyarrow.types.is_int8(res["X"].type), res["X"].type
AssertionError: decimal128(38, 0)

sfc-gh-mkeller · 2021-10-04T17:26:24Z

@jonashaag I copy pasted your code snippet and I cannot reproduce your issue.
I'd need your query ID for these 2 queries to continue my investigation. It might be that you have some non-default parameter set on your account, or user that's causing this.

If you could post those here I could take a look. If you are not comfortable doing that then please open a support ticket with a link to this issue and then we can have a private communication channel that way.

jonashaag · 2021-10-11T07:50:04Z

Query ID of select 1::number(38,0): 019f8835-3200-a818-0000-1e692eedf386

Query ID of result_scan: 019f8835-3200-a7fb-0000-1e692eede432

sfc-gh-mkeller · 2022-03-02T22:17:50Z

Sorry, it took me a while to get to this.
I now found the culprit, let me demonstrate:

from base64 import b64decode
from io import BytesIO
from pyarrow import RecordBatchStreamReader
from snowflake.connector import connect

with connect(
    ...
) as conn:
    with conn.cursor() as c:
        c.execute("select to_number(a) as b from (select '1' as a)")
        results = c.get_result_batches()[0]
        print(RecordBatchStreamReader(BytesIO(b64decode(results._data))).schema.types)
    with conn.cursor() as c2:
        c2.execute("select to_number('1') as a")
        results2 = c2.get_result_batches()[0]
        print(RecordBatchStreamReader(BytesIO(b64decode(results2._data))).schema.types)

What happens is that when you have a sub-query (I suspect this to be more complicated and it really depends on the available metadata at the time of query execution) Snowflake is unable to tell the maximum number in a row.

So it defaults to the type that could hold the maximum value. We allow up to 38 digits, which does not fit into a 64 bit integer, so we'll probably need to go to 128 bits.
Now please take a look at: https://arrow.apache.org/docs/cpp/api/datatype.html and notice how the biggest integer Arrow supports is 64 bits wide. So in this case, Snowflake can only safely put the results into a Decimal128.

The only mitigations that I see:

Snowflake's back-end could do some extra metadata digging for metadata and try fitting the data into Int64 and then fall-back to Decimal128 only if necessary. I could be wrong, but I think we could only do this for some specific queries, but not widely and consequently it could make all queries (that deal with NUMBERs) slower in the end.
Arrow could introduce a Integer128. I'm not sure how widely Integer128 implementations are in C++ though.
We could introduce our own convenience function in the connector, so if someone iterates through a column that's a NUMBER(X, 0) then we could cast those to Pyhon's ints. I still don't really like this, as just like in your case @jonashaag there's no way for us to put them into a pyarrow.Table that you expect from fetch_arrow_*.

Edit: I never included the script's output:

[Decimal128Type(decimal128(38, 0))]
[DataType(int8)]

jonashaag · 2022-03-08T16:06:34Z

I wonder why this doesn’t happen in the original query but only in the result scan. Is that because it is a sub query? Could the Snowflake backend be taught to use the original query’s metadata for the result scan query?

github-actions · 2023-03-10T02:03:45Z

To clean up and re-prioritize bugs and feature requests we are closing all issues older than 6 months as of March 1, 2023. If there are any issues or feature requests that you would like us to address, please re-create them. For urgent issues, opening a support case with this link Snowflake Community is the fastest way to get a response

jonashaag added bug needs triage labels Aug 16, 2021

github-actions bot changed the title ~~fetch_pandas_all not using int64 for NUMBER(38, 0)~~ SNOW-427759: fetch_pandas_all not using int64 for NUMBER(38, 0) Aug 16, 2021

sfc-gh-sshankar assigned sfc-gh-mkeller Aug 26, 2021

sfc-gh-hchaturvedi added investigating and removed needs triage labels Oct 14, 2021

sfc-gh-mkeller removed the investigating label Mar 2, 2022

This was referenced May 3, 2022

SNOW-584971: fetch_pandas_all returns JSON columns as string not dict #1115

Closed

update write_pandas PR and fetch_pandas_all to be json friendly kbroughton/snowflake-connector-python#1

Open

Make fetch all pandas json friendly #1121

Open

github-actions bot added the Stale label Mar 10, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 10, 2023

sfc-gh-aalam mentioned this issue Oct 20, 2023

SNOW-940607: Fix precision loss during to_pandas conversion snowflakedb/snowpark-python#1101

Merged

5 tasks

sfc-gh-aalam closed this as completed in snowflakedb/snowpark-python#1101 Oct 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SNOW-427759: fetch_pandas_all not using int64 for NUMBER(38, 0) #828

SNOW-427759: fetch_pandas_all not using int64 for NUMBER(38, 0) #828

jonashaag commented Aug 16, 2021

jonashaag commented Aug 18, 2021

sfc-gh-sshankar commented Aug 26, 2021

jonashaag commented Sep 6, 2021

sfc-gh-mkeller commented Sep 17, 2021

jonashaag commented Sep 20, 2021 •

edited

Loading

sfc-gh-mkeller commented Oct 4, 2021

jonashaag commented Oct 11, 2021

sfc-gh-mkeller commented Mar 2, 2022 •

edited

Loading

jonashaag commented Mar 8, 2022

github-actions bot commented Mar 10, 2023

SNOW-427759: fetch_pandas_all not using int64 for NUMBER(38, 0) #828

SNOW-427759: fetch_pandas_all not using int64 for NUMBER(38, 0) #828

Comments

jonashaag commented Aug 16, 2021

jonashaag commented Aug 18, 2021

sfc-gh-sshankar commented Aug 26, 2021

jonashaag commented Sep 6, 2021

sfc-gh-mkeller commented Sep 17, 2021

jonashaag commented Sep 20, 2021 • edited Loading

sfc-gh-mkeller commented Oct 4, 2021

jonashaag commented Oct 11, 2021

sfc-gh-mkeller commented Mar 2, 2022 • edited Loading

jonashaag commented Mar 8, 2022

github-actions bot commented Mar 10, 2023

jonashaag commented Sep 20, 2021 •

edited

Loading

sfc-gh-mkeller commented Mar 2, 2022 •

edited

Loading