Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(output): export all outputs in parquet files #238

Merged
merged 4 commits into from
Jun 4, 2024

Conversation

vincent-leblond
Copy link
Contributor

@vincent-leblond vincent-leblond commented May 31, 2024

Add export for all outputs in parquet files. All existing exports are kept. New parquet files are added.
Parquet files use .parquet extension. And geospatial files use .geoparquet extension.
Pyarrow package dependency is added.

Parquet format is extremely faster to read than GeoPackage files and csv. For example on trips data on 10% population for departement 14, it takes 25 seconds to read in GeoPackage, and only 0.17 seconds in Parquet.

@sebhoerl
Copy link
Contributor

sebhoerl commented Jun 1, 2024

Hi Vincent, thanks for the PR. It looks like the dependencies are not consistent in the environment.yml (check failed unit test). I can find some time for it, but if you could figure it out that would be great:

  • Create a fresh environment using the develop environment.yml
  • Install pyarrow in whatever version is proposed by conda, this may also upgrade some other dependencies
  • Use a tool of your choice or simply conda env export --no-builds and check the version of pyarrow and all other dependencies that are in environment.yml and note their (potentially updated) versions in the new environment.yml

Second point, I think it would be great if this was configurable. I imagine something like:

config:
  [...]
  output_formats: ["csv", "parquet", "gpkg", "geoparquet"]

And it is woudl be set to ["csv", "gpkg"] by default for now. To be (1) backwards compatible and (2) allow users to completely switch to parquet if they want, but don't write duplicate outputs. So basically, this would just mean having an if around every output command depending on what is in the list. Would be great if you can take a look at this, otherwise I can also find some time.

@vincent-leblond
Copy link
Contributor Author

Hi Sebastian,
I will follow your advice and then come back to you.

@vincent-leblond
Copy link
Contributor Author

Changing pyarrow version to an older one seems to be enough.

@sebhoerl
Copy link
Contributor

sebhoerl commented Jun 4, 2024

Looks good, thanks a lot :)

@sebhoerl sebhoerl merged commit d90d93e into eqasim-org:develop Jun 4, 2024
2 checks passed
@vincent-leblond vincent-leblond deleted the feat/output_format branch June 4, 2024 06:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants