Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dagster-snowflake-pandas] pandas timestamp conversion fix #12190

Closed
wants to merge 5 commits into from

Conversation

jamiedemaria
Copy link
Contributor

@jamiedemaria jamiedemaria commented Feb 8, 2023

Summary & Motivation

There is an issue with storing pandas Timestamp values in snowflake where the year will get converted to a non-valid year (example 48399). In pr #8760 i got around this by storing pandas timestamps as strings, but i finally found this github issue that indicates that the real fix is to include timezone information in your pandas timestamps link.

This PR proposes removing the timestamp -> string conversion, however there are still some things to consider:

  • do we think this is the right approach, or should we stick with converting to strings. I think it is probably best for us to go down this route because then we store the user data as the user intended, rather than doing sneaky string conversions in the backend. However, we should have some erroring/alerting system that tells users to add timezone info to their data so they aren't left to figure this bug out for themselves (see the error messaging in this PR)
  • this would likely be a breaking change, as users with timestamps without a timezone will begin to get the original error. And may run into issues where the column is a string type and suddenly gets timestamp data. One option for this would be to hold off on merging this PR until we promote dagster-snowflake to 1.0

How I Tested These Changes

@vercel
Copy link

vercel bot commented Feb 8, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

2 Ignored Deployments
Name Status Preview Comments Updated
dagit-storybook ⬜️ Ignored (Inspect) Feb 8, 2023 at 10:18PM (UTC)
dagster ⬜️ Ignored (Inspect) Feb 8, 2023 at 10:18PM (UTC)

@jamiedemaria
Copy link
Contributor Author

Current dependencies on/for this PR:

This comment was auto-generated by Graphite.

index=False,
method=pd_writer,
)
except InterfaceError as e:
Copy link
Contributor Author

@jamiedemaria jamiedemaria Feb 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another option instead of reacting to a failed write would be to pre-emptively check if the dataframe as Timestamp data. if so we check if there is timezone information, and raise an error if it doesn't have that info

result = result.apply(_convert_string_to_timestamp, axis="index")
try:
result = pd.read_sql(sql=SnowflakeDbClient.get_select_statement(table_slice), con=con)
except InterfaceError as e:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we go down the route of https://github.com/dagster-io/dagster/pull/12190/files#r1100731596 this error check would get removed

@jamiedemaria jamiedemaria force-pushed the jamie/snowflake-pandas/timestamp-bug branch from 6292c98 to 6338da8 Compare February 8, 2023 22:18
@jamiedemaria jamiedemaria marked this pull request as ready for review February 8, 2023 22:20
@jamiedemaria jamiedemaria requested a review from sryza February 8, 2023 22:20
@sryza
Copy link
Contributor

sryza commented Feb 9, 2023

Interesting. It would be great to fix this, but, as you point out, the compatibility issues are tricky.

Is it possible to store timestamps without timezones as the TIMESTAMP_NTZ type? Alternatively, we could automatically convert them to UTC?

@jamiedemaria
Copy link
Contributor Author

jamiedemaria commented Feb 10, 2023

Is it possible to store timestamps without timezones as the TIMESTAMP_NTZ type

unfortunately no - pandas already tries to store timestamps with no timezone as TIMESTAMP_NTZ but they get stored as "Invalid date". For example, this asset

@asset
def time_asset() -> pd.DataFrame:
    time_df = pd.DataFrame(
            {
                "foo": ["bar", "baz"],
                "date": [
                    pd.Timestamp("2017-01-01T12:30:45"),
                    pd.Timestamp("2017-02-01T12:30:45"),
                ],
            }
        )
    return time_df

is stored as

Row | FOO | DATE
1   | bar | Invalid date
2   | baz | Invalid date

if we don't do any conversions. But the asset

@asset
def time_asset() -> pd.DataFrame:
    time_df = pd.DataFrame(
            {
                "foo": ["bar", "baz"],
                "date": [
                    pd.Timestamp("2017-01-01T12:30:45+00:00"),
                    pd.Timestamp("2017-02-01T12:30:45+00:00"),
                ],
            }
        )
    return time_df

is stored as

Row | FOO | DATE
1   | bar | 2017-01-01 12:30:45.000
2   | baz | 2017-02-01 12:30:45.000

funnily enough, the type of the date column in this case is also TIMESTAMP_NTZ

Alternatively, we could automatically convert them to UTC?

yeah we could do that. I think it still feels a bit similar to converting to strings because we're modifying user data without telling them, but it's less severe than a full type conversion so that's nice. we could document it and log a warning when we do a conversion, and maybe that's enough

@sryza
Copy link
Contributor

sryza commented Feb 13, 2023

yeah we could do that. I think it still feels a bit similar to converting to strings because we're modifying user data without telling them, but it's less severe than a full type conversion so that's nice. we could document it and log a warning when we do a conversion, and maybe that's enough

one of the things that feels particularly bad about converting the strings is that, if you serialize a dataframe and then deserialize it, the deserialized one will have different types than the serialized one, right? is there a way to maintain the property that what you put in is the same as what you get out?

@jamiedemaria
Copy link
Contributor Author

I did a test of storing pandas Timestamp data in the current (master) version of the IO manager so we have documentation of how the types change

@asset(
    key_prefix=["JAMIE"]
)
def pandas_time_asset() -> pd.DataFrame:
    time_df = pd.DataFrame(
            {
                "foo": ["bar", "baz"],
                "date": [
                    pd.Timestamp("2017-01-01T12:30:45"),
                    pd.Timestamp("2017-02-01T12:30:45"),
                ],
            }
        )
    context.log.info(time_df)
    context.log.info(time_df["date"])
    return time_df

@asset(
    key_prefix=["JAMIE"]
)
def downstream_pandas_time_asset(context, pandas_time_asset: pd.DataFrame) -> None:
    context.log.info(pandas_time_asset)
    context.log.info(pandas_time_asset["date"])

Before we return the dataframe, the dtype of the date column is datetime64[ns].
In snowflake, pandas_time_asset is:

FOO	DATE
bar	2017-01-01 12:30:45.000000 
baz	2017-02-01 12:30:45.000000 

with column types

Columns Data Type
FOO	VARCHAR(16777216)
DATE	VARCHAR(16777216)

In downstream_pandas_time_asset we load the dataframe and get

   foo            date
0  bar 2017-01-01 12:30:45
1  baz 2017-02-01 12:30:45

And the date column datatype is datetime64[ns]

So to the dagster assets, the datatypes don't change, but if you did your own queries on the Snowflake Table you would get string data

@jamiedemaria
Copy link
Contributor Author

Another interesting thing to note - the pandas bigquery client auto-converts all non-timezone data to UTC

@jamiedemaria
Copy link
Contributor Author

I'd like to revive this since the 1.2 release could be a good time for the breaking change. I think the three options are:

  1. do nothing, keep storing the datetimes as strings
  2. Throw and error when datetime data doesn't have a timezone
  3. convert all data w/o timezone to UTC (maybe log a debug or warning message) - this is what BigQuery does, for what it's worth

if we go w 2 or 3 we could provide some guidance for migrating the table from string data to timestamp_ntz data

@sryza
Copy link
Contributor

sryza commented Feb 22, 2023

I think we should go with 1 or 3 - only supporting timestamps with timezones would be too inconvenient for most users. I'm very unsure of whether 1 or 3 is better. If we go with 3, what would happen to users who already have data stored as strings?

@jamiedemaria
Copy link
Contributor Author

I'll put a test together to see what would happen to date data stored as strings. in the meantime, here's the message that kicked off looking into this issue again (basically they thought they were doing something wrong because their data was being stored as strings) https://dagster.slack.com/archives/C04EKVBA69Y/p1675295895479739

not sure if this is enough justification for switching our implementation, but it's some context at least

@jamiedemaria
Copy link
Contributor Author

jamiedemaria commented Mar 1, 2023

Back on this train! finally got some time to run the test and here's my findings:

If we do nothing to our code (ie keep the string conversion) and materialize a dataframe with time data, here's what we get:

  • in the asset, the column is of type datetime64[ns]
  • in snowflake the column is of type VARCHAR(16777216)
  • in a downstream asset, the column is of type datetime64[ns]

If we switch to converting non-timezone data to UTC and materialize a dataframe with time data here's what we get:

  • in the asset, the column is of type datetime64[ns]
  • in snowflake the column is of type TIMESTAMP_NTZ(9)
  • in a downstream asset, the column is of type datetime64[ns]

If we switch to converting non-timezone data to UTC AND materialize an asset that was previously materialized using the string conversion scheme, here's what we get:

  • in the asset, the column is of type datetime64[ns]
  • in snowflake the column is of type VARCHAR(16777216) (but there is no error uploading the data)
  • in a downstream asset, the column is of type str

The last bullet point is what's most concerning, since the user will be expecting time data and will be getting string data.

Here's a sequence of SQL queries that will convert a time varchar time column to a timestamp_ntz column (probably a more efficient way to do this but my sql is bad)

ALTER TABLE database.schema.table
ADD COLUMN time_copy TIMESTAMP_NTZ(9)

UPDATE database.schema.table
SET time_copy = to_timestamp_ntz(time)

ALTER TABLE database.schema.table
DROP COLUMN time

ALTER TABLER database.schema.table
RENAME COLUMN time_copy TO time

@jamiedemaria
Copy link
Contributor Author

@benpankow @sryza I'd like to figure out a path froward for this for 1.3 in case we need to make a breaking change. The options i can think of are as follows:

  1. make no functional changes to how we handle timestamps. Add some explanation in the docs about the string conversion
  2. make a breaking change to convert all un-timezoned time data to UTC. Add explanation in the docs and a migration guide for moving tables to the new system
  3. add a new configuration value that lets the user specify whether to do timezone conversion or string conversion. Add explanation in the docs and a migration guide for moving tables to the new system
    3a. add a deprecation warning for string conversion and remove it a future version
    3b. no deprecation warning and continue to support both conversion strategies

I'm inclined to do option 2 or 3 since it allows us to store the data as the user intended (ie as a timestamp), but if you have other opinions or concerns, let me know!

@mjclarke94
Copy link

@jamiedemaria Is the pyarrow backend in Pandas 2.0 any help in resolving this?

@sryza
Copy link
Contributor

sryza commented Mar 21, 2023

Thanks for being persistent on this @jamiedemaria.

Why does adding the UTC timestamp result in TIMESTAMP_NTZ(9)? I would have thought that data in Snowflake with a timestamp would show up as TIMESTAMP_TZ?

@sryza
Copy link
Contributor

sryza commented Mar 21, 2023

If we switch to converting non-timezone data to UTC AND materialize an asset that was previously materialized using the string conversion scheme, here's what we get:

Would we want to detect this issue at write-time and error ("Must migrate!") so that users don't end up with incompatible strings in their table?

@jamiedemaria
Copy link
Contributor Author

Why does adding the UTC timestamp result in TIMESTAMP_NTZ(9)? I would have thought that data in Snowflake with a timestamp would show up as TIMESTAMP_TZ?

no idea. TIMESTAMP_TZ is what i would have expected too. I'll see if i can find anything in snowflake docs, but chances are slim

Would we want to detect this issue at write-time and error ("Must migrate!") so that users don't end up with incompatible strings in their table?

yeah i could probably inspect the column types of the table before writing and ensure that timestamp data goes to a timestamp column

@jamiedemaria
Copy link
Contributor Author

re storing as TIMESTAMP_NTZ instead of TIMESTAMP_TZ - looks like this is something a lot of people see. There's a snowflake issue open for it snowflakedb/snowflake-sqlalchemy#199 that has been open for 2 years....

@jamiedemaria
Copy link
Contributor Author

merge conflicts made this branch super difficult to use, so moving to #13097 with an implementation of what we've discussed here. closing this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants