Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandas DateTime column timezone incorrectly converted at insertion #380

Open
0liu opened this issue Jun 2, 2023 · 4 comments
Open

Pandas DateTime column timezone incorrectly converted at insertion #380

0liu opened this issue Jun 2, 2023 · 4 comments

Comments

@0liu
Copy link

0liu commented Jun 2, 2023

Describe the bug
When inserting a pandas dataframe to a Clickhouse table, and a DateTime column in the table is defined with a timezone other than UTC, timezone-aware datatime columns in the pandas dataframe will be incorrectly localized and inserted as UTC time. The actual timestamps stored in the database are shifted from the original timestamps.

This could be due to the incorrect timezone localization and conversion in apply_timezones_before_write

            ts = pd.to_datetime(items).tz_localize(timezone)

        ts = ts.tz_convert('UTC')

The items returned from block.get_column_by_index should be UNIX timestamps and are always in UTC, but pd.to_datetime(items).tz_localize(timezone) localizes the timestamp to the timezone defined in the table column ('America/Chicago' in the example below) incorrectly, and then converts it back to UTC. If commenting out these two parts as below, the problem is gone:

            ts = pd.to_datetime(items)  #.tz_localize(timezone)

#        ts = ts.tz_convert('UTC')

The workaround works with datetime table column defined in UTC or other timezones.

To Reproduce

from clickhouse_driver import Client
client = Client('localhost', database='default')
client.execute(
"""
    CREATE TABLE IF NOT EXISTS debug_tbl 
    (ts DateTime64(9, 'America/Chicago')) 
    ENGINE=MergeTree() ORDER BY ts
""")

data = pd.DataFrame(
    [
        pd.Timestamp(t, tz='America/Chicago') 
        for t in ['2023-06-01T11:28:05.661537256', '2023-06-01T11:28:06.334573921', '2023-06-01T11:28:07.821988266']
    ], 
columns=['ts']
)

# Insert pandas dataframe with numpy support
client.insert_dataframe('INSERT INTO debug_tbl VALUES', data, settings={'use_numpy': True})

# query the inserted data
client.query_dataframe("select * from debug_tbl", settings={'use_numpy': False})

The returned query results are:

                                ts
0 2023-06-01 16:28:05.661537-05:00
1 2023-06-01 16:28:06.334574-05:00
2 2023-06-01 16:28:07.821988-05:00

But it should return

                                ts
0 2023-06-01 11:28:05.661537-05:00
1 2023-06-01 11:28:06.334574-05:00
2 2023-06-01 11:28:07.821988-05:00

In the command line, clickhouse-client shows the inserted data are not correct:

69aa60b1a9d2 :) select * from debug_tbl

SELECT *
FROM debug_tbl

Query id: 4229d695-9b96-40b3-b4c7-742dd6276cd4

┌────────────────────────────ts─┐
│ 2023-06-01 16:28:05.661537256 │
│ 2023-06-01 16:28:06.334573921 │
│ 2023-06-01 16:28:07.821988266 │
└───────────────────────────────┘

3 rows in set. Elapsed: 0.003 sec. 

Expected behavior
The timestamps in the query result above are expected to be the same as those in the source dataframe.

Versions

  • Version of package with the problem: 0.2.6
  • ClickHouse server version: 23.4.2.11 with official docker image.
  • Python version: 3.10.10
@xzkostyan
Copy link
Member

Is this similar with #257?

@0liu
Copy link
Author

0liu commented Jun 2, 2023

Is this similar with #257?

This issue leads to the incorrect data stored in database, while #257 is only about the read back format. With the #288 fix, this issue still exists.

Btw, I don't see #288 changes query results of query_dataframe. With settings={'use_numpy': False}, it always returns timezone aware column (truncated to microseconds), and with settings={'use_numpy': True} it returns naive datatime column of nanoseconds.

@xzkostyan
Copy link
Member

Okay.

Does pure insert and select without pandas leads to incorrect data?

@0liu
Copy link
Author

0liu commented Jun 2, 2023

Does pure insert and select without pandas leads to incorrect data?

Inserting plain Python datetime seems good.

tz = pytz.timezone('America/Chicago')
plain_data = [
    [tz.localize(datetime.datetime(2023, 6, 1, 11, 28, 5, 661537))],
    [tz.localize(datetime.datetime(2023, 6, 1, 11, 28, 6, 334573))],
    [tz.localize(datetime.datetime(2023, 6, 1, 11, 28, 7, 821988))],
]
client.execute('INSERT INTO debug_tbl (ts) VALUES', plain_data)
69aa60b1a9d2 :) describe table debug_tbl

DESCRIBE TABLE debug_tbl

Query id: 6d8bb04d-fe9c-4931-80e8-3925c26b412a

┌─name─┬─type─────────────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ ts   │ DateTime64(9, 'America/Chicago') │              │                    │         │                  │                │
└──────┴──────────────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

1 row in set. Elapsed: 0.002 sec. 

69aa60b1a9d2 :) select * from debug_tbl

SELECT *
FROM debug_tbl

Query id: 619d500e-9a45-469f-8153-5360c3b8d2fe

┌────────────────────────────ts─┐
│ 2023-06-01 11:28:05.661537000 │
│ 2023-06-01 11:28:06.334573000 │
│ 2023-06-01 11:28:07.821988000 │
└───────────────────────────────┘

3 rows in set. Elapsed: 0.003 sec.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants