-
Notifications
You must be signed in to change notification settings - Fork 478
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SNOW-259668: write_pandas fails for some datetime64 values using PyArrow backend #600
Comments
Just as a temporary workaround I use pdf['some_datetime_column'] = pdf['some_datetime_column'].apply(lambda datetime_val: datetime_val.ceil('ms')) Probably doesn't work as expected, see comment below |
@plotneishestvo Thanks, yes, we're doing something similar. This is probably a little quicker for big tables:
|
Actually for me this |
Hm, what's the column type? I have a test that roundtrips a DataFrame with datetime64 to/from Snowflake, asserting that everything returned matches what was inserted. datetime64 columns are processed using the vectorized |
Oh, we're also using |
I've tried to use from snowflake.connector.pandas_tools import pd_writer
df.to_sql('oh_my_table', engine, index=False, method=pd_writer, if_exists='append') and it automatically creates table for dataframe, so it creates it with timestamp_type My solution at the moment is to cast date fields in iso strings and manually create a table before using pandas integration, so it automatically converts it to dates in snowflake from strings, and it works. Just want to write it here if somebody tackling with that, because it took me a lot of time to discover that in one small table we have future dates because of this issue :D |
Okay, looks like snowflake also has issues to understand parquet v2 timestamp values, I've tried to upload data using parquet verson '2.0' and I got the same firstly i passed parameter to pandas to store the same dataframe as parquet version 2.0: df.to_parquet('my_awasome_parquet.parquet', compression='snappy', engine='pyarrow', version='2.0') # version='2.0' makes a deal and just to check pq.read_metadata('my_awasome_parquet.parquet')
#output:
<pyarrow._parquet.FileMetaData object at 0x11ee05778>
created_by: parquet-cpp version 1.5.1-SNAPSHOT
num_columns: 2
num_rows: 2
num_row_groups: 1
format_version: 2.0
serialized_size: 1940 after that I've manually uploaded data to the table using this parquet 2.0 file and got the same SELECT $1:date_field FROM @"TEST_DB"."PUBLIC".oh_my_df ; However, casting to timestamp resulted as SELECT $1:date_field::timestamp_ntz FROM @"TEST_DB"."PUBLIC".oh_my_df ; |
Okay, starting with pandas 1.1.0 it automatically uses parquet 2.0 and this is why i started receiving |
This is a little out of scope for this issue, but if at all possible it would be helpful to support newer versions of PyArrow (they just released v3 early this week), for compatibility and bugfixes, but also because the early versions of PyArrow required by the Snowflake connector are enormous payloads, over 200MB IIRC. Newer releases are in the 50MB range. This can have a significant impact on deployments. |
👀 #565 |
I drafted a possible solution to the issue here: |
Hi all, we continue to work on a long-term solution internally. We're sorry about any inconvenience this issue might be causing. As a temporary workaround, we suggest passing a timezone to any timestamp values when building your pandas dataframe. The following works for millisecond and microsecond timestamps, ts = pd.Timestamp(1621291701002, unit="ms")
pd.DataFrame([('a', ts)], columns=['ID', 'DATE_FIELD']) you can define the dataframe as: ts = pd.Timestamp(1621291701002, unit="ms", tz="UTC")
pd.DataFrame([('a', ts)], columns=['ID', 'DATE_FIELD']) When the dataframe is unloaded into a parquet file and COPY'ed into Snowflake, the value will be correctly parsed: As for ns timestamps, specifically the issue mentioned by @willsthompson, using |
I came across the same problem, i ve solved it by converting npdatetime64 to object: |
It seems like this issue still persists. Moreover, for [ns] timestamps it seems like it can also affect the Pandas |
I solved the problem by adding
In chunk.to_parquet(
chunk_path,
compression=compression,
allow_truncated_timestamps=True,
) or alternatively: chunk.to_parquet(
chunk_path,
compression=compression,
use_deprecated_int96_timestamps=True,
) Does this solve your problem too ? |
Hey, I am trying the proposed alternative from @sfc-gh-kzaveri and I am sending a dataframe with tz defined
The table is created as TIMESTAMP_NTZ It is uploaded without time zone info And when read with
Any ideas what might make this behave like this? Any specific versions I should be using? |
Having similar issues with Snowpark
|
Thanks for pointing this out! The following snippet helped me: df['Date'] = df['Date'].dt.tz_localize("UTC+01:00").dt.ceil(freq='ms') |
Adding the parameter
|
We are internally working on a more permanent solution and provide an update next quarter. |
Hi @sfc-gh-aalam, do you have an update on this? |
take a look at #1687. Can you try using |
Nice! This works with sqlalchemy as well. Previously had to convert import pandas as pd
import numpy as np
from sqlalchemy import create_engine
from snowflake.sqlalchemy import URL
from snowflake.connector.pandas_tools import make_pd_writer
from snowflake.sqlalchemy import DATETIME
creds = {
'user': '',
'password': '',
'account': '',
'warehouse': '',
'role': '',
'database': ''
}
engine = create_engine(URL(**creds))
time = pd.date_range('2023-01-01', '2023-01-31', freq='1h')
df = pd.DataFrame({
'timestamp': time,
'test_col': ['XYZ123'] * len(time),
'values': np.random.random(len(time))
})
# df['timestamp'] = df.timestamp.astype(str) ### previous workaround for proper pd.Timestamp -> snowflake timestamp conversion
with engine.connect() as con:
df.to_sql(
name='test_table',
schema='schema_name',
index=False,
con=con,
if_exists='replace',
# dtype={'timestamp': DATETIME}, ### previous workaround for proper pd.Timestamp -> snowflake timestamp conversion
# method=pd_writer ### previous workaround
method=make_pd_writer(use_logical_type=True)
) |
closing this issue as |
Please answer these questions before submitting your issue. Thanks!
python --version
)?Python 3.8.5
python -c 'import platform; print(platform.platform())'
)?macOS-10.16-x86_64-i386-64bit
pip freeze
)?I invoked
snowflake.connector.pandas_tools.write_pandas
on a DataFrame with a column of typedatetime64[ns]
(using PyArrow as the default backend for ParquetWriter)I expected the datetime data written to the database verbatim with nanosecond precision.
write_pandas
fails when Arrow tries writing to Parquet using the default arguments:pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data
This problem IIUC is related to the PyArrow defaults, which for compatibility reasons defaults to Parquet
version=’1.0’
behavior, which only supports ms timestamps, and also defaultingallow_truncated_timestamps=False
, which raises an exception when any timestamp precision is lost during writing. The end result is always truncating ns-precision timestamps to ms-precision and therefore always throwing an exception.Since Snowflake supports ns-precision timestamps, I would expect defaults that allow them to be written from DataFrames without error. However, since I imagine it's messy supporting various Parquet backends, I think at a minimum
write_pandas
should accept kwargs to be passed to the parquet writer, so users can tailor the behavior they want (and workaround backend-specific problems like this one).The text was updated successfully, but these errors were encountered: