Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat hive partitioning #41

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Feat hive partitioning #41

wants to merge 2 commits into from

Conversation

martibosch
Copy link
Collaborator

Prework

What kind of change does this PR introduce? (check at least one)

  • Bugfix
  • Feature
  • Documentation
  • Tutorial
  • Code style update
  • Refactor
  • Build-related changes
  • Other, please describe:

Does this PR introduce a breaking change? (check one)

  • Yes
  • No

If yes, please describe the impact and communicate accordingly:

The PR fulfills these requirements:

  • It's submitted to the branch named as follow:
    • Fix a bug: bugfix-<some_key>-<word>
    • Improve the doc: doc-<some_key>-<word>
    • Improve a tutorial tutorial-<some_key>-<word>
    • Add a new feature: feature-<some_key>-<word>
    • Refactor some code: refactor-<some_key>-<word>
    • Optimize some code: optimize-<some_key>-<word>
  • When resolving a specific issue, it's referenced in the PR's title (e.g. fix #xxx[,#xxx], where "xxx" is the issue number)
  • Don't forget to link PR to issue if you are solving one.
  • All tests are passing.
  • New/updated tests are included

If adding a new feature, the PR's description includes:

  • A convincing reason for adding this feature (to avoid wasting your time, it's best to open a suggestion issue first and wait for approval before working on it)

Other information:

Related GitHub issues and pull requests

  • Ref: #

Summary

Please explain the purpose and scope of your contribution.

@martibosch martibosch mentioned this pull request Jun 24, 2024
1 task
@martibosch
Copy link
Collaborator Author

martibosch commented Jun 24, 2024

This PR forces the id/var dir names to adopt a hive partitioning scheme - should it be optional?

For the partitions, hive partitioning seems to be enforced/hardcoded at https://github.com/ltelab/tstore/blob/feat-hive-partitioning/tstore/archive/ts/writers/pyarrow.py#L77

@martibosch martibosch force-pushed the feat-hive-partitioning branch 2 times, most recently from c29b057 to d50c06f Compare June 26, 2024 08:28
@martibosch
Copy link
Collaborator Author

I fixed some issues and rebased this to have a proper PR with this feature only. Now we can review it. I wonder: is it worth making the hive scheme optional at this point? I suggest we move forward with hive only. We may consider supporting futher schemes later on.

@ghiggi
Copy link
Contributor

ghiggi commented Jun 26, 2024

I will review this tomorrow or Friday @martibosch.

But as quick thought I would not enforce "store_id=1" and "variable=ts_variable" to have hive partitioning and neither the time series partitioning. In the case of not using hive partitioning for time series objects, if we want to enable the time filter function we still need to implement the listing of parquet files based on the partitioning info included in the TSTORE YAML file.

Two further considerations.

A TS object / partitioned parquet dataset is readable in whatever language-agnostic dataframe/query engine supporting reading parquet file.

A TSTORE directory structure with hive partitioning is not readable:

  • in LONG- format if a TS object contains more than a variable (is not a series but a dataframe ...)
  • it might not be neither readable in LONG-format even if the TS object contains a single variable. This should be tried out using i.e. duckdb or pyarrow.dataset ... for cases where a variable occurs in some store_id and not in others. It might well be that pyarrow/duckdb just infer the dataframe columns from the first store_id directory ;)

@martibosch
Copy link
Collaborator Author

ok I can make it optional, but I understand that for now we still leave the "hive" time partitioning hardcoded at https://github.com/ltelab/tstore/blob/feat-hive-partitioning/tstore/archive/ts/writers/pyarrow.py#L77 ?

@martibosch
Copy link
Collaborator Author

I amended the first commit in order to try to make this work not only for tslong but also for tsdf write/load.

@martibosch
Copy link
Collaborator Author

I have added a second commit drafting what I understand should be the rationale f the id_var argument. I am probably overthinking stuff, but from the "TODO" in https://github.com/ltelab/tstore/blob/main/tstore/tests/conftest.py#L236, I suppose that the use of id_var in a TSDF is not clear anyway.

@martibosch
Copy link
Collaborator Author

martibosch commented Jun 26, 2024

Sorry again for overthinking and for the likely premature optimization, but this is probably a good point to consider whether we need the time_var argument in the TSDF.

@martibosch
Copy link
Collaborator Author

Once the above issues are clear we can see how we make the id and var-level hive scheme optional, e.g., allow paths of the form my-tstore/1/temperature/year=2021/part.parquet (where 1 is an id value).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants