Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Figure out how to convert bcerror files to parquet #11

Open
jayhesselberth opened this issue Jun 27, 2024 · 0 comments
Open

Figure out how to convert bcerror files to parquet #11

jayhesselberth opened this issue Jun 27, 2024 · 0 comments

Comments

@jayhesselberth
Copy link
Member

TLDR; use parquet instead of CSV. At a minimum, compress bcerror CSVs with gzip before adding to the repo.

Parquet files are much more disk efficient, faster to parse, etc. Not high priority but would be useful to incorporate into remora pipelines where we just want per-base stats.

Might be as simple as:

library(readr)
library(nanoparquet)

write_parquet(read_csv("file.csv"), "file.parquet").

# then inspect to 
file.info("file.csv")
file.info("file.parquet")

# reload file in subsequent analyses
read_parquet("file.parquet")

Could also combine multiple CSVs together into one parquet with a column for sample name.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant