Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce top-level read_opus() #27

Closed
philipp-baumann opened this issue Sep 6, 2022 · 25 comments
Closed

Introduce top-level read_opus() #27

philipp-baumann opened this issue Sep 6, 2022 · 25 comments
Assignees

Comments

@philipp-baumann
Copy link
Member

#25

@philipp-baumann philipp-baumann changed the title Introduce top-level read_opus Introduce top-level read_opus() Sep 6, 2022
@ThomasKnecht
Copy link
Collaborator

ThomasKnecht commented Sep 6, 2022

#28

@pierreroudier
Copy link

I think this is a great idea.

Ultimately, this is a "driver" package, which means, in my view, that most users would really only interact with one function, read_opus.

If I think of a prototype for the function, this is what it would look like:

res <- read_opus(
  # Pass a vector of DSN, can be files or raw sources, can be more than one
  dsn = my_dsn, 
  # Option to "simplify" the output when more than 1 DSN are passed to `dsn`. 
  # If used an "analysis-ready" matrix is returned. Similar to the `simplify` option in {opusreader}
  matrix = FALSE, 
  # An option to limit the data returned to a specific type. 
  # When working with a lot of files and you only really want ScSm for example.
  type = NULL, 
  # Use parallel read when more than 1 DSN are passed to `dsn`
  parallel = TRUE,
  # Something lower level to optionally tweak chunk size etc
  # For power users only
  parallel.options = list(chunk.size = 10, ...),  
  # Print a progress bar if more than 1 DSN are passed to `dsn`
  progress = TRUE 
)

Something that is needed at the moment is to provide clarity about the different elements of data returned by the driver. Eg most users would likely be after the data contained in x$refl_no_atm_comp$data, but it's obviously not looking trivial at the moment.

@ThomasKnecht
Copy link
Collaborator

you are right, the output is kind of an issue. i think @philipp-baumann thought about a filter option. but i see what you mean, we could add a param = T/F option to get the full output or just the data...
i like your idea of returning a matrix if multiple files are read :)

@philipp-baumann
Copy link
Member Author

philipp-baumann commented Sep 9, 2022

Yes, I imagine a short iteration of the above:

Because read_opus() is high-level, we need a few short and concise arguments. These unleash the power in straight-forward manner, both for simple and quite advanced use cases. In that sense it is worth to invest some time in design thinking.

  • On individual measurements, there is so-called "in"-filters and "out_filters", specified with filter_in and filter_out. In @pierreroudier 's handy filter above would be in the out_filter class. There is many of these filters, but they will receive an extensive and simple description.

To grab specific elements, like x$refl_no_atm_comp$data, we have already thought of an out filter wrapper.

What we also have to consider is a datatype_out option. This is essentially the counterpart driver for dsn. @pierreroudier @ThomasKnecht do you have a better name for it? I imagine writing to .Rds, standardized json but also a dedicated matrix-attribute like in-memory spectral data structure in R. For the latter, we could collaborate with @l-ramirez-lopez (ping :)).

Regarding chunking: I would go with the registered cores. Parallel resource management should be done by the user. We can further simplify the function interface by adding one main parallel argument, like parallel_args, that are individually dispatched.

@philipp-baumann
Copy link
Member Author

philipp-baumann commented Sep 9, 2022

so a short update of the proposal, although we probably need to update again before we get to stable:

res <- read_opus(
  # Pass a vector of DSN, can be files or raw sources, can be more than one
  dsn = my_dsn,
  filter_in = my_named_list_of_filters_in,
  filter_out = my_named_list_of_filters_out,
  datatype_out = my_list_of_datatypes_out,
  parallel_args = my_named_list_of_parallel_args,
  progress_bar = FALSE
)

and with some more special features:

  • filter_in: includes reprex filters on directory and file names to be read. in addition, there is also time_from and time_to that accepts posix-style datetime to filter specific measurements done during a time period.

@philipp-baumann
Copy link
Member Author

philipp-baumann commented Sep 9, 2022

you are right, the output is kind of an issue. i think @philipp-baumann thought about a filter option. but i see what you mean, we could add a param = T/F option to get the full output or just the data... i like your idea of returning a matrix if multiple files are read :)

Let's put this param early in the arg list for filter_out.

@philipp-baumann
Copy link
Member Author

filter_out

Another iteration, based on the discussion with @ThomasKnecht . I think the previous draft was a bit too complicated

res <- read_opus(
  dsn = my_dsn,
  output_datatype,
  output_path = NULL,
  parallel = FALSE,
  progress_bar = FALSE
)

@ThomasKnecht
Copy link
Collaborator

instead of ouput_datatype -> data_only = F

@philipp-baumann
Copy link
Member Author

+1 for this Mr. Propper.

@philipp-baumann
Copy link
Member Author

philipp-baumann commented Sep 10, 2022

res <- read_opus(
  dsn = my_dsn,
  data_only = FALSE, # default is all blocks, `TRUE` only last spectra
  output_path = NULL, # driver for the output
  parallel = FALSE,
  progress_bar = FALSE
)

philipp-baumann pushed a commit that referenced this issue Sep 12, 2022
@pierreroudier
Copy link

It;s unclear to me what output_path means

@philipp-baumann
Copy link
Member Author

philipp-baumann commented Sep 12, 2022

It;s unclear to me what output_path means

6936623

output_path optional storage path for the parsed output. Default is NULL, which only returns the parsed data as an in-memory R object.

@philipp-baumann
Copy link
Member Author

@pierreroudier do you have a better name suggestion than output_path?

@ThomasKnecht
Copy link
Collaborator

I implemented the data_only as well as the parallel parts

@philipp-baumann @pierreroudier what do you think of that?

the data_only I also implemented in the read_opus_impl -> I think this is a nice to have in the base functionality anyways...

I am not so happy about my implementation of the data_only. it seems rather complicated... maybe you have a nicer suggestion?

@ThomasKnecht
Copy link
Collaborator

I also fixed a bug in the prep_spectra function ;)

@philipp-baumann
Copy link
Member Author

I implemented the data_only as well as the parallel parts

@philipp-baumann @pierreroudier what do you think of that?

the data_only I also implemented in the read_opus_impl -> I think this is a nice to have in the base functionality anyways...

I am not so happy about my implementation of the data_only. it seems rather complicated... maybe you have a nicer suggestion?

Awesome, thanks for this! short comments:

  • we will need to do chunked, reading, so I will add some chunking logic to it (when 1 file takes about 20ms to read, the overhead is other wise too high that this has any realistic benefit
  • I will think about data_only and come to you later this week.

@pierreroudier
Copy link

Sorry, I am still a bit confused by output_path, is that an option to write the data to disk? In which case, what format would be used to write to file(s)?

I'd be tempted to leave such a functionality outside read_opus, but maybe I'm missing something (large collections that don;t fit into memory?).

For data_only, what is the type returned? Matrix, or still a (trimmed) list?

@pierreroudier
Copy link

Opinionated comment on the arg names -- I like to keep them short, eg progress rather than progress_bar, maybe folder rather than output_path, etc.

@philipp-baumann
Copy link
Member Author

philipp-baumann commented Sep 13, 2022

Sorry, I am still a bit confused by output_path, is that an option to write the data to disk? In which case, what format would be used to write to file(s)?

I'd be tempted to leave such a functionality outside read_opus, but maybe I'm missing something (large collections that don;t fit into memory?).

For data_only, what is the type returned? Matrix, or still a (trimmed) list?

Maybe it was a bit many things in one issue, or respectively the merge that followed ;-) Yes indeed, output_path is an option to write data to disk. This could be for example in .Rds format, or in .json; the latter needs another transformation module. Another option is having a zarr-backend, which would mean a chunked n-dimensional array that represents all spectral data. I would like to have such a feature. If we stream data somewhere and have some quality control pipeline afterwards, we can at some point work in cloud environments such as s3 storages etc. At the moment I am not thinking about big data with such simple spectra, but I have sometimes project directories with few thousand OPUS files to be parsed.

For data_only, at the moment it is still a list. We have to think about some good logic. @pierreroudier @ThomasKnecht shall we imply with data_only a matrix, which might be a good simplification in one argument?

@pierreroudier
Copy link

Maybe it was a bit many things in one issue, or respectively the merge that followed ;-) Yes indeed, output_path is an option to write data to disk. This could be for example in .Rds format, or in .json; the latter needs another transformation module. Another option is having a zarr-backend, which would mean a chunked n-dimensional array that represents all spectral data.

Ok, got it. I think it would be best to develop the feature as a dedicated function, then once it's properly designed, consider its integration in read_opus. It's an advanced feature.

For data_only, at the moment it is still a list. We have to think about some good logic. @pierreroudier @ThomasKnecht shall we imply with data_only a matrix, which might be a good simplification in one argument?

My argument for this one is that users will either want to have the full metadata (in which case a list shall be returned), or want to quickly assemble data for modelling, in which case a matrix shall be returned.

For the latter, there is code pretty much ready to go in the old {opusreader} package.

@philipp-baumann
Copy link
Member Author

I generally like the matrix-support. To keep track of features and the development, I will move on with opening separate issue tickets. Like this it will be more simple to keep track and have a clean git workflow without too much wizardry.

@ThomasKnecht
Copy link
Collaborator

we removed the output_path. we want to make a separate function for writing the data into json files or whatever other format is possible

@philipp-baumann
Copy link
Member Author

@pierreroudier , as @ThomasKnecht indicated above, the design decision we are making is to keep the interface of read_opus() as generic as possible and with lowest possible complexity in arguments. This will enable us to have a rock solid interface that does not change over time, that serves its purpose for a broad community of users and contributors. Over time, there will be for sure options to customize specific wishes in helper functions, that are easier to maintain and are more specific to a particular spectroscopy workflow of individual organizations and companies.

@philipp-baumann
Copy link
Member Author

Such extended functionalities are for example #36 , #44 .
Closing because read_opus() and functions at various levels have been split by a new modular design with #47

@philipp-baumann
Copy link
Member Author

philipp-baumann commented Dec 7, 2022

see #47 for the solution

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants