-
-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EIA-930 DATETIME_UTC data format issue using Snowflake #3937
Comments
@zaneselvans just showed me via email that a the pandas pd.read_parquet() is producing correct datetime output. So there is likely something wrong with my use of Snowflake. |
Hey, that's definitely weird. Have you been able to read datetime columns from other Parquet files without any issues? What happens if you try using the Can you select the maximum value of Are all of the years crazy? Or is it just a few of them? Here's the code I used which gave the expected results: ops = pd.read_parquet(
"s3://pudl.catalyst.coop/nightly/core_eia930__hourly_operations.parquet",
)
ops.datetime_utc.max()
# Timestamp('2024-08-01 07:00:00') And using PyArrow: ops = pd.read_parquet(
"s3://pudl.catalyst.coop/nightly/core_eia930__hourly_operations.parquet",
engine="pyarrow",
dtype_backend="pyarrow",
)
ops.datetime_utc.max()
# datetime.datetime(2024, 8, 1, 7, 0) Distribution and set of years looks right with pandas: ops.datetime_utc.dt.year.value_counts().sort_index().plot(kind="bar") |
Maybe it has something to do with the precision of the ops.datetime_utc.dtype
# timestamp[ms][pyarrow] |
It seems that all of the years in my data are weird:
|
My best guess is it has to do with the timestamp being stored as an integer number of mili/micro/nano seconds since t=0, and you need to somehow tell it which precision to use when interpreting the integer. But also it really seems like it should be smart enough to do that itself given the Parquet dtype metadata. |
Yeah, looking at your biggest "years" they're very close to 1000x further from 1970 than the present day is, so my guess is there's a milli/micro/nano conversion error on the way from Parquet to Snowflake somehow. Try dividing the timestamp by 1000 and see what dates you get. |
So I think Snowflake is getting milliseconds, and interpreting them as seconds. thousander = ops["datetime_utc"].astype(int)*1000
thousander.astype("datetime64[ms]").min()
# Timestamp('47466-05-16 08:00:00')
thousander.astype("datetime64[ms]").max()
# Timestamp('56553-09-17 16:00:00') |
@RyanMasson I'm going to go ahead and close this since I'm pretty sure we found the issue and it's (thankfully!) not in our data. |
Describe the bug
I am working in a trial version of the Snowflake web app to explore EIA-930 data. After loading EIA-930 data into Snowflake from one of the parquet files hosted in the PUDL S3 bucket (specifically the hourly_operations table from the nightly build path: s3://pudl.catalyst.coop/nightly/core_eia930__hourly_operations.parquet) I am seeing entries in the DATETIME_UTC column look like the following examples:
50467-06-17 01:00:00.000
50467-07-28 17:00:00.000
50467-09-08 09:00:00.000
50467-10-20 01:00:00.000
50467-11-30 16:00:00.000
50468-01-11 08:00:00.000
It appears that the standard YYYY part of a UTC datetime data type has been altered.
Bug Severity
I would like to sort this time series by date to look at BPA area electricity generation over time, but this datetime issue is preventing me from doing that.
To Reproduce
This is a snippet from my Snowflake SQL worksheet that reproduces the behavior. It loads and then queries all 5 million+ rows of the hourly_operations table and has to be run inside the Snowflake architecture. I have not tried loading this particular parquet file in another way to look at the DATETIME_UTC column and compare.
Expected behavior
I expected the DATETIME_UTC column produced by the final SELECT statement to conform to 'YYYY-MM-DD HH' but the YYYY part seems corrupted. It could be from some step I am making in accessing the data or in the file itself, I am not sure.
Software Environment?
This happening when accessing Nightly Build data through PUDL's AWS S3 API, using Snowflake. I did not install PUDL on my machine.
The text was updated successfully, but these errors were encountered: