Introduce top-level `read_opus()` #27

philipp-baumann · 2022-09-06T09:57:46Z

ThomasKnecht · 2022-09-06T17:24:22Z

pierreroudier · 2022-09-08T21:07:06Z

I think this is a great idea.

Ultimately, this is a "driver" package, which means, in my view, that most users would really only interact with one function, read_opus.

If I think of a prototype for the function, this is what it would look like:

res <- read_opus(
  # Pass a vector of DSN, can be files or raw sources, can be more than one
  dsn = my_dsn, 
  # Option to "simplify" the output when more than 1 DSN are passed to `dsn`. 
  # If used an "analysis-ready" matrix is returned. Similar to the `simplify` option in {opusreader}
  matrix = FALSE, 
  # An option to limit the data returned to a specific type. 
  # When working with a lot of files and you only really want ScSm for example.
  type = NULL, 
  # Use parallel read when more than 1 DSN are passed to `dsn`
  parallel = TRUE,
  # Something lower level to optionally tweak chunk size etc
  # For power users only
  parallel.options = list(chunk.size = 10, ...),  
  # Print a progress bar if more than 1 DSN are passed to `dsn`
  progress = TRUE 
)

Something that is needed at the moment is to provide clarity about the different elements of data returned by the driver. Eg most users would likely be after the data contained in x$refl_no_atm_comp$data, but it's obviously not looking trivial at the moment.

ThomasKnecht · 2022-09-09T04:53:59Z

you are right, the output is kind of an issue. i think @philipp-baumann thought about a filter option. but i see what you mean, we could add a param = T/F option to get the full output or just the data...
i like your idea of returning a matrix if multiple files are read :)

philipp-baumann · 2022-09-09T05:48:46Z

Yes, I imagine a short iteration of the above:

Because read_opus() is high-level, we need a few short and concise arguments. These unleash the power in straight-forward manner, both for simple and quite advanced use cases. In that sense it is worth to invest some time in design thinking.

On individual measurements, there is so-called "in"-filters and "out_filters", specified with filter_in and filter_out. In @pierreroudier 's handy filter above would be in the out_filter class. There is many of these filters, but they will receive an extensive and simple description.

To grab specific elements, like x$refl_no_atm_comp$data, we have already thought of an out filter wrapper.

What we also have to consider is a datatype_out option. This is essentially the counterpart driver for dsn. @pierreroudier @ThomasKnecht do you have a better name for it? I imagine writing to .Rds, standardized json but also a dedicated matrix-attribute like in-memory spectral data structure in R. For the latter, we could collaborate with @l-ramirez-lopez (ping :)).

Regarding chunking: I would go with the registered cores. Parallel resource management should be done by the user. We can further simplify the function interface by adding one main parallel argument, like parallel_args, that are individually dispatched.

philipp-baumann · 2022-09-09T05:58:45Z

so a short update of the proposal, although we probably need to update again before we get to stable:

res <- read_opus(
  # Pass a vector of DSN, can be files or raw sources, can be more than one
  dsn = my_dsn,
  filter_in = my_named_list_of_filters_in,
  filter_out = my_named_list_of_filters_out,
  datatype_out = my_list_of_datatypes_out,
  parallel_args = my_named_list_of_parallel_args,
  progress_bar = FALSE
)

and with some more special features:

filter_in: includes reprex filters on directory and file names to be read. in addition, there is also time_from and time_to that accepts posix-style datetime to filter specific measurements done during a time period.

philipp-baumann · 2022-09-09T06:03:09Z

you are right, the output is kind of an issue. i think @philipp-baumann thought about a filter option. but i see what you mean, we could add a param = T/F option to get the full output or just the data... i like your idea of returning a matrix if multiple files are read :)

Let's put this param early in the arg list for filter_out.

philipp-baumann · 2022-09-10T08:04:24Z

filter_out

Another iteration, based on the discussion with @ThomasKnecht . I think the previous draft was a bit too complicated

res <- read_opus(
  dsn = my_dsn,
  output_datatype,
  output_path = NULL,
  parallel = FALSE,
  progress_bar = FALSE
)

ThomasKnecht · 2022-09-10T08:10:05Z

instead of ouput_datatype -> data_only = F

philipp-baumann · 2022-09-10T08:12:02Z

+1 for this Mr. Propper.

philipp-baumann · 2022-09-10T08:13:02Z

res <- read_opus(
  dsn = my_dsn,
  data_only = FALSE, # default is all blocks, `TRUE` only last spectra
  output_path = NULL, # driver for the output
  parallel = FALSE,
  progress_bar = FALSE
)

pierreroudier · 2022-09-12T00:47:33Z

It;s unclear to me what output_path means

philipp-baumann · 2022-09-12T01:00:43Z

It;s unclear to me what output_path means

6936623

output_path optional storage path for the parsed output. Default is NULL, which only returns the parsed data as an in-memory R object.

philipp-baumann · 2022-09-12T10:23:25Z

@pierreroudier do you have a better name suggestion than output_path?

ThomasKnecht · 2022-09-12T18:36:08Z

I implemented the data_only as well as the parallel parts

@philipp-baumann @pierreroudier what do you think of that?

the data_only I also implemented in the read_opus_impl -> I think this is a nice to have in the base functionality anyways...

I am not so happy about my implementation of the data_only. it seems rather complicated... maybe you have a nicer suggestion?

ThomasKnecht · 2022-09-12T18:36:41Z

I also fixed a bug in the prep_spectra function ;)

philipp-baumann · 2022-09-12T20:26:37Z

I implemented the data_only as well as the parallel parts

@philipp-baumann @pierreroudier what do you think of that?

the data_only I also implemented in the read_opus_impl -> I think this is a nice to have in the base functionality anyways...

I am not so happy about my implementation of the data_only. it seems rather complicated... maybe you have a nicer suggestion?

Awesome, thanks for this! short comments:

we will need to do chunked, reading, so I will add some chunking logic to it (when 1 file takes about 20ms to read, the overhead is other wise too high that this has any realistic benefit
I will think about data_only and come to you later this week.

pierreroudier · 2022-09-12T21:46:57Z

Sorry, I am still a bit confused by output_path, is that an option to write the data to disk? In which case, what format would be used to write to file(s)?

I'd be tempted to leave such a functionality outside read_opus, but maybe I'm missing something (large collections that don;t fit into memory?).

For data_only, what is the type returned? Matrix, or still a (trimmed) list?

pierreroudier · 2022-09-12T22:00:56Z

Opinionated comment on the arg names -- I like to keep them short, eg progress rather than progress_bar, maybe folder rather than output_path, etc.

philipp-baumann · 2022-09-13T05:56:15Z

Sorry, I am still a bit confused by output_path, is that an option to write the data to disk? In which case, what format would be used to write to file(s)?

I'd be tempted to leave such a functionality outside read_opus, but maybe I'm missing something (large collections that don;t fit into memory?).

For data_only, what is the type returned? Matrix, or still a (trimmed) list?

Maybe it was a bit many things in one issue, or respectively the merge that followed ;-) Yes indeed, output_path is an option to write data to disk. This could be for example in .Rds format, or in .json; the latter needs another transformation module. Another option is having a zarr-backend, which would mean a chunked n-dimensional array that represents all spectral data. I would like to have such a feature. If we stream data somewhere and have some quality control pipeline afterwards, we can at some point work in cloud environments such as s3 storages etc. At the moment I am not thinking about big data with such simple spectra, but I have sometimes project directories with few thousand OPUS files to be parsed.

For data_only, at the moment it is still a list. We have to think about some good logic. @pierreroudier @ThomasKnecht shall we imply with data_only a matrix, which might be a good simplification in one argument?

pierreroudier · 2022-09-13T22:01:48Z

Maybe it was a bit many things in one issue, or respectively the merge that followed ;-) Yes indeed, output_path is an option to write data to disk. This could be for example in .Rds format, or in .json; the latter needs another transformation module. Another option is having a zarr-backend, which would mean a chunked n-dimensional array that represents all spectral data.

Ok, got it. I think it would be best to develop the feature as a dedicated function, then once it's properly designed, consider its integration in read_opus. It's an advanced feature.

For data_only, at the moment it is still a list. We have to think about some good logic. @pierreroudier @ThomasKnecht shall we imply with data_only a matrix, which might be a good simplification in one argument?

My argument for this one is that users will either want to have the full metadata (in which case a list shall be returned), or want to quickly assemble data for modelling, in which case a matrix shall be returned.

For the latter, there is code pretty much ready to go in the old {opusreader} package.

philipp-baumann · 2022-09-14T14:34:33Z

I generally like the matrix-support. To keep track of features and the development, I will move on with opening separate issue tickets. Like this it will be more simple to keep track and have a clean git workflow without too much wizardry.

ThomasKnecht · 2022-12-03T13:51:47Z

we removed the output_path. we want to make a separate function for writing the data into json files or whatever other format is possible

philipp-baumann · 2022-12-07T15:40:49Z

@pierreroudier , as @ThomasKnecht indicated above, the design decision we are making is to keep the interface of read_opus() as generic as possible and with lowest possible complexity in arguments. This will enable us to have a rock solid interface that does not change over time, that serves its purpose for a broad community of users and contributors. Over time, there will be for sure options to customize specific wishes in helper functions, that are easier to maintain and are more specific to a particular spectroscopy workflow of individual organizations and companies.

philipp-baumann · 2022-12-07T15:50:37Z

Such extended functionalities are for example #36 , #44 .
Closing because read_opus() and functions at various levels have been split by a new modular design with #47

philipp-baumann · 2022-12-07T15:51:13Z

see #47 for the solution

philipp-baumann changed the title ~~Introduce top-level read_opus~~ Introduce top-level read_opus() Sep 6, 2022

philipp-baumann assigned ThomasKnecht Sep 6, 2022

philipp-baumann pushed a commit that referenced this issue Sep 12, 2022

update #27

1d00cca

philipp-baumann mentioned this issue Sep 14, 2022

🧱 Helper to extract final spectra as matrix #36

Open

philipp-baumann closed this as completed Dec 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce top-level `read_opus()` #27

Introduce top-level `read_opus()` #27

philipp-baumann commented Sep 6, 2022

ThomasKnecht commented Sep 6, 2022 •

edited

Loading

pierreroudier commented Sep 8, 2022

ThomasKnecht commented Sep 9, 2022

philipp-baumann commented Sep 9, 2022 •

edited

Loading

philipp-baumann commented Sep 9, 2022 •

edited

Loading

philipp-baumann commented Sep 9, 2022 •

edited

Loading

philipp-baumann commented Sep 10, 2022

ThomasKnecht commented Sep 10, 2022

philipp-baumann commented Sep 10, 2022

philipp-baumann commented Sep 10, 2022 •

edited

Loading

pierreroudier commented Sep 12, 2022

philipp-baumann commented Sep 12, 2022 •

edited

Loading

philipp-baumann commented Sep 12, 2022

ThomasKnecht commented Sep 12, 2022

ThomasKnecht commented Sep 12, 2022

philipp-baumann commented Sep 12, 2022

pierreroudier commented Sep 12, 2022

pierreroudier commented Sep 12, 2022

philipp-baumann commented Sep 13, 2022 •

edited

Loading

pierreroudier commented Sep 13, 2022

philipp-baumann commented Sep 14, 2022

ThomasKnecht commented Dec 3, 2022

philipp-baumann commented Dec 7, 2022

philipp-baumann commented Dec 7, 2022

philipp-baumann commented Dec 7, 2022 •

edited

Loading

Introduce top-level read_opus() #27

Introduce top-level read_opus() #27

Comments

philipp-baumann commented Sep 6, 2022

ThomasKnecht commented Sep 6, 2022 • edited Loading

pierreroudier commented Sep 8, 2022

ThomasKnecht commented Sep 9, 2022

philipp-baumann commented Sep 9, 2022 • edited Loading

philipp-baumann commented Sep 9, 2022 • edited Loading

philipp-baumann commented Sep 9, 2022 • edited Loading

philipp-baumann commented Sep 10, 2022

ThomasKnecht commented Sep 10, 2022

philipp-baumann commented Sep 10, 2022

philipp-baumann commented Sep 10, 2022 • edited Loading

pierreroudier commented Sep 12, 2022

philipp-baumann commented Sep 12, 2022 • edited Loading

philipp-baumann commented Sep 12, 2022

ThomasKnecht commented Sep 12, 2022

ThomasKnecht commented Sep 12, 2022

philipp-baumann commented Sep 12, 2022

pierreroudier commented Sep 12, 2022

pierreroudier commented Sep 12, 2022

philipp-baumann commented Sep 13, 2022 • edited Loading

pierreroudier commented Sep 13, 2022

philipp-baumann commented Sep 14, 2022

ThomasKnecht commented Dec 3, 2022

philipp-baumann commented Dec 7, 2022

philipp-baumann commented Dec 7, 2022

philipp-baumann commented Dec 7, 2022 • edited Loading

Introduce top-level `read_opus()` #27

Introduce top-level `read_opus()` #27

ThomasKnecht commented Sep 6, 2022 •

edited

Loading

philipp-baumann commented Sep 9, 2022 •

edited

Loading

philipp-baumann commented Sep 9, 2022 •

edited

Loading

philipp-baumann commented Sep 9, 2022 •

edited

Loading

philipp-baumann commented Sep 10, 2022 •

edited

Loading

philipp-baumann commented Sep 12, 2022 •

edited

Loading

philipp-baumann commented Sep 13, 2022 •

edited

Loading

philipp-baumann commented Dec 7, 2022 •

edited

Loading