-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dagster-snowflake-pandas] pandas timestamp conversion fix #12190
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎ |
Current dependencies on/for this PR: This comment was auto-generated by Graphite. |
index=False, | ||
method=pd_writer, | ||
) | ||
except InterfaceError as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another option instead of reacting to a failed write would be to pre-emptively check if the dataframe as Timestamp data. if so we check if there is timezone information, and raise an error if it doesn't have that info
result = result.apply(_convert_string_to_timestamp, axis="index") | ||
try: | ||
result = pd.read_sql(sql=SnowflakeDbClient.get_select_statement(table_slice), con=con) | ||
except InterfaceError as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we go down the route of https://github.com/dagster-io/dagster/pull/12190/files#r1100731596 this error check would get removed
6292c98
to
6338da8
Compare
Interesting. It would be great to fix this, but, as you point out, the compatibility issues are tricky. Is it possible to store timestamps without timezones as the TIMESTAMP_NTZ type? Alternatively, we could automatically convert them to UTC? |
unfortunately no - pandas already tries to store timestamps with no timezone as TIMESTAMP_NTZ but they get stored as "Invalid date". For example, this asset
is stored as
if we don't do any conversions. But the asset
is stored as
funnily enough, the type of the date column in this case is also TIMESTAMP_NTZ
yeah we could do that. I think it still feels a bit similar to converting to strings because we're modifying user data without telling them, but it's less severe than a full type conversion so that's nice. we could document it and log a warning when we do a conversion, and maybe that's enough |
one of the things that feels particularly bad about converting the strings is that, if you serialize a dataframe and then deserialize it, the deserialized one will have different types than the serialized one, right? is there a way to maintain the property that what you put in is the same as what you get out? |
I did a test of storing pandas Timestamp data in the current (master) version of the IO manager so we have documentation of how the types change
Before we return the dataframe, the dtype of the
with column types
In
And the So to the dagster assets, the datatypes don't change, but if you did your own queries on the Snowflake Table you would get string data |
Another interesting thing to note - the pandas bigquery client auto-converts all non-timezone data to UTC |
I'd like to revive this since the 1.2 release could be a good time for the breaking change. I think the three options are:
if we go w 2 or 3 we could provide some guidance for migrating the table from string data to timestamp_ntz data |
I think we should go with 1 or 3 - only supporting timestamps with timezones would be too inconvenient for most users. I'm very unsure of whether 1 or 3 is better. If we go with 3, what would happen to users who already have data stored as strings? |
I'll put a test together to see what would happen to date data stored as strings. in the meantime, here's the message that kicked off looking into this issue again (basically they thought they were doing something wrong because their data was being stored as strings) https://dagster.slack.com/archives/C04EKVBA69Y/p1675295895479739 not sure if this is enough justification for switching our implementation, but it's some context at least |
Back on this train! finally got some time to run the test and here's my findings: If we do nothing to our code (ie keep the string conversion) and materialize a dataframe with time data, here's what we get:
If we switch to converting non-timezone data to UTC and materialize a dataframe with time data here's what we get:
If we switch to converting non-timezone data to UTC AND materialize an asset that was previously materialized using the string conversion scheme, here's what we get:
The last bullet point is what's most concerning, since the user will be expecting time data and will be getting string data. Here's a sequence of SQL queries that will convert a time varchar time column to a timestamp_ntz column (probably a more efficient way to do this but my sql is bad)
|
@benpankow @sryza I'd like to figure out a path froward for this for 1.3 in case we need to make a breaking change. The options i can think of are as follows:
I'm inclined to do option 2 or 3 since it allows us to store the data as the user intended (ie as a timestamp), but if you have other opinions or concerns, let me know! |
@jamiedemaria Is the pyarrow backend in Pandas 2.0 any help in resolving this? |
Thanks for being persistent on this @jamiedemaria. Why does adding the UTC timestamp result in TIMESTAMP_NTZ(9)? I would have thought that data in Snowflake with a timestamp would show up as TIMESTAMP_TZ? |
Would we want to detect this issue at write-time and error ("Must migrate!") so that users don't end up with incompatible strings in their table? |
no idea. TIMESTAMP_TZ is what i would have expected too. I'll see if i can find anything in snowflake docs, but chances are slim
yeah i could probably inspect the column types of the table before writing and ensure that timestamp data goes to a timestamp column |
re storing as TIMESTAMP_NTZ instead of TIMESTAMP_TZ - looks like this is something a lot of people see. There's a snowflake issue open for it snowflakedb/snowflake-sqlalchemy#199 that has been open for 2 years.... |
merge conflicts made this branch super difficult to use, so moving to #13097 with an implementation of what we've discussed here. closing this PR |
Summary & Motivation
There is an issue with storing pandas Timestamp values in snowflake where the year will get converted to a non-valid year (example 48399). In pr #8760 i got around this by storing pandas timestamps as strings, but i finally found this github issue that indicates that the real fix is to include timezone information in your pandas timestamps link.
This PR proposes removing the timestamp -> string conversion, however there are still some things to consider:
How I Tested These Changes