Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement TimeStampXXXTZVector for parquet isAdjustedToUTC timestamp columns #926

Open
laurentperez opened this issue Oct 17, 2024 · 0 comments

Comments

@laurentperez
Copy link

laurentperez commented Oct 17, 2024

Hi. I will PR this.

The following python code will generate a parquet with timestamp columns in us ns ms adjusted to UTC (1)

Reading it using #577 , org.jetbrains.kotlinx.dataframe.io.ArrowReadingImplKt#readField will throw NotImplementedError("reading from TimeStampXXXTZVector is not implemented")

This PR implements TimeStampXXXTZVector following https://github.com/apache/parquet-format/blob/master/LogicalTypes.md

Please note parquet file format has no seconds precision but only MILLIS or MICROS, NANOS so my implementation of TimeStampSecTZVector finds seconds from milliseconds.

Applying this code under my local checkout of #577, precisions return correctly for us ns ms as :

    @Test
    fun testReadTimestamp() {
        val frame = DataFrame.readParquet(
            URL("file:/home/lperez/Bureau/work/pocs/lavaret/python/timestamps_with_utc_and_local.parquet")
        )
        val columnTypes = frame.columnTypes()
        println("columnTypes: $columnTypes")
        println(frame)
    }
columnTypes: [kotlinx.datetime.LocalDateTime, kotlinx.datetime.LocalDateTime, kotlinx.datetime.LocalDateTime, kotlinx.datetime.LocalDateTime, kotlinx.datetime.LocalDateTime]
                timestamp_utc            timestamp_local         timestamp_brussels               timestamp_nanos        timestamp_millis
 0 2024-01-01T12:00:00.123456 2024-01-01T12:00:00.123456 2024-01-01T11:00:00.123456 2024-01-01T12:00:00.123456789 2024-01-01T12:00:00.123

(1)

zsh 10474  (git)-[main]-% python3 create-timestamp-parquet.py
shape: (1, 5)
┌─────────────────────┬─────────────────┬────────────────────────────────┬──────────────────────┬─────────────────────────────┐
│ timestamp_utc       ┆ timestamp_local ┆ timestamp_brussels             ┆ timestamp_nanos      ┆ timestamp_millis            │
│ ---                 ┆ ---             ┆ ---                            ┆ ---                  ┆ ---                         │
│ datetime[μs, UTC]   ┆ datetime[μs]    ┆ datetime[μs, UTC]              ┆ datetime[ns, UTC]    ┆ datetime[ms, UTC]           │
╞═════════════════════╪═════════════════╪════════════════════════════════╪══════════════════════╪═════════════════════════════╡
│ 2024-01-01          ┆ 2024-01-01      ┆ 2024-01-01 11:00:00.123456 UTC ┆ 2024-01-01           ┆ 2024-01-01 12:00:00.123 UTC │
│ 12:00:00.123456 UTC ┆ 12:00:00.123456 ┆                                ┆ 12:00:00.123456789 … ┆                             │
└─────────────────────┴─────────────────┴────────────────────────────────┴──────────────────────┴─────────────────────────────┘

zsh 10345  (git)-[main]-% parquet-tools inspect timestamps_with_utc_and_local.parquet
############ Column(timestamp_brussels) ############
name: timestamp_brussels
path: timestamp_brussels
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: Timestamp(isAdjustedToUTC=true, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false)
converted_type (legacy): TIMESTAMP_MICROS
compression: ZSTD (space_saved: -26%)

############ Column(timestamp_nanos) ############
name: timestamp_nanos
path: timestamp_nanos
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: Timestamp(isAdjustedToUTC=true, timeUnit=nanoseconds, is_from_converted_type=false, force_set_converted_type=false)
converted_type (legacy): NONE
compression: ZSTD (space_saved: -26%)

############ Column(timestamp_millis) ############
name: timestamp_millis
path: timestamp_millis
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: Timestamp(isAdjustedToUTC=true, timeUnit=milliseconds, is_from_converted_type=false, force_set_converted_type=false)
converted_type (legacy): TIMESTAMP_MILLIS
compression: ZSTD (space_saved: -26%)

(1)

import polars as pl
import pandas as pd
import pyarrow as pa

df = pl.DataFrame({
    "timestamp_utc": [
        pd.Timestamp('2024-01-01 12:00:00.123456', tz='UTC').to_pydatetime(),  # UTC timestamp
    ],
    "timestamp_local": [
        pd.Timestamp('2024-01-01 12:00:00.123456').to_pydatetime()  # Local timestamp without timezone
    ],
    "timestamp_brussels": [
        pd.Timestamp('2024-01-01 12:00:00.123456', tz='Europe/Brussels').tz_convert('UTC').to_pydatetime()  # Brussels time converted to UTC
    ],
    "timestamp_nanos": [
        '2024-01-01 12:00:00.123456789'
    ],
    "timestamp_millis": [
        '2024-01-01 12:00:00.123'
    ]
}).with_columns(
    pl.col("timestamp_nanos").str.to_datetime("%F %X.%9f", time_unit="ns")
    .dt.replace_time_zone("UTC")
).with_columns(
    pl.col("timestamp_millis").str.to_datetime("%F %X.%3f", time_unit="ms")
    .dt.replace_time_zone("UTC")
)
df.write_parquet("timestamps_with_utc_and_local.parquet")


print(df)
laurentperez pushed a commit to laurentperez/dataframe that referenced this issue Oct 17, 2024
@laurentperez laurentperez changed the title Implement TimeStampMicroTZVector for parquet isAdjustedToUTC timestamp columns DRAFT:Implement TimeStampMicroTZVector for parquet isAdjustedToUTC timestamp columns Oct 18, 2024
@laurentperez laurentperez changed the title DRAFT:Implement TimeStampMicroTZVector for parquet isAdjustedToUTC timestamp columns DRAFT:Implement TimeStampXXXTZVector for parquet isAdjustedToUTC timestamp columns Oct 18, 2024
@laurentperez laurentperez changed the title DRAFT:Implement TimeStampXXXTZVector for parquet isAdjustedToUTC timestamp columns Implement TimeStampXXXTZVector for parquet isAdjustedToUTC timestamp columns Oct 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant