Discussion on design patterns of (meta)data management and extraction #59

davidbenncsiro · 2022-12-16T12:33:46Z

Around August, there was a function called read_opus_file() in opusreader2.

This yielded what appeared to be correct sample data from files I have been testing with, such as the one attached: DECCW1013_A11.zip

I updated the package today and note that read_opus_file() does not exist now.

I tried to use read_opus() but this returns a different result.

Instead of result$spec$wavenumbers and result$spec$data we now have result$sc_sample$wavenumbers, result$sc_sample$data, result$sc_ref$wavenumbers, result$sc_ref$data and so on for different block types.

Should I do something like divide sc_sample by sc_ref data to yield absorbance data?

Also, at least in the case of sc_sample, the wavenumbers vector has a different length from the data and the data values don't match what I was seeing from result$spec$data with read_opus_file().

See also:

The text was updated successfully, but these errors were encountered:

davidbenncsiro · 2022-12-16T12:39:59Z

I also realise that there is a lot of active development.

ThomasKnecht · 2022-12-16T19:47:28Z

Hi David!
Thanks for the input.

I just parsed your file with:

test <- read_opus("inst/extdata/test_data/DECCW1013_A11.0")

The previous spec is now renamed to ab or ab_no_atm_comp
According to the parsing output of your file, the ab_no_atm_comp is contained in the file.

I add the screenshots of the output:

ThomasKnecht · 2022-12-16T19:49:55Z

What was the length of the sc_sample you got with the read_opus_file() function?

philipp-baumann · 2022-12-16T19:55:11Z

opusreader2/R/prepare_spectra.R

Line 25 in 989cfa8

data_matrix <- matrix(ds_data[[1]]$data[1:NPT], nrow = 1, ncol = NPT)

philipp-baumann · 2022-12-16T20:05:54Z

sc_sample (single channel sample): Length of NPT := 4827
We found that in data, the last parsed number of the spectral blocks (sc_ref, sc_sample, ab_no_atm_comp) is 0. We also checked this in the Bruker OPUS software (right click on loaded file). We guess, either this is either by design (Bruker) to make reverse engineering more difficult, or by artefact ;-) We slice columns of the data matrix therefore from 1 to NPT.

philipp-baumann · 2022-12-16T20:10:29Z

Indeed, you can calculate the apparent absorbance with the sc_ref and sc_sample blocks (single channels in sensor intensity) so that abs = log10(1 / (sc_sample / sc_ref)) [it is possible that single channels need to be truncated to have same x-axis.]. However, there is no need to do so, because there is the ab_no_atm_comp block which contains apparent absorbance in your case. We specifically renamed this because it makes one aware of that no atmospheric correction was performed. Out of interest, is there a reason why you don't do atmospheric correction (CO2, water vapor) at CSIRO?

In case there would be atmospheric compensation set in the measurement settings in OPUS, there would be an additional ab (save in absorbance ticked) or refl block. Kindly note that if atmospheric correction would be done, you would still have the "precessor" raw AB block in the OPUS file, but Bruker would only show the last processed block as AB in the software (here result spectrum after applying atmospheric correction macro).

philipp-baumann · 2022-12-16T20:28:54Z

Well, we hope that with the current development efforts we can now soon release a stable CRAN release of opusreader2. @ThomasKnecht has done some internal refactoring to clean the parse logic with some proper modularization. We want to have read_opus() as a really stable workhorse, with a slim interface. The simpler the interface is and the more narrow the core functionality (one function, one job), the easier to maintain for us and more pleasant to use (stability). We decided to strictly avoid any extra functionality. Extra features/processing and other tasks after reading, such as extracting metadata, will be implemented as additional helper functions. Or, if time allows and people can convince us that is serves an important or interesting use case, we can make high-level wrappers of read_opus() that come in handy for specific workflows of the community. With this, we hope to leverage as many use cases as possible. Or happily via PR contributions.

philipp-baumann · 2022-12-16T20:32:51Z

Short question: We plan to draft a vignette in the R package that shines light on the inner working of the Bruker file specification and parsing. For this purpose and for extended unit tests, we would like to compile a test suite of files (many types of Bruker FT-IR instruments, equipments and measurements setttings). Can we use your file for this purpose? We would reference your institution for credits.

davidbenncsiro · 2022-12-17T06:01:14Z

@philipp-baumann and @ThomasKnecht. First, thank you very much for your rapid and enthusiastic replies!

I overlooked ab_no_atm_comp initially. I agree that it makes perfect sense to use this and yields the kind of result I expected to see.

Client code calling read_opus() could I suppose first check whether ab exists and if not, use ab_no_atm_comp. Would you consider that to be a reasonable strategy? Are there circumstances under which neither would be present, other than for an invalid file?

I will raise the atmospheric compensation question with the researchers I am collaborating with (as a research software engineer). It may be that this has changed over time or that different Opus settings are applied across site labs for different purposes, but it is very much worth discussing, so thank you.

@ThomasKnecht, I don't recall what was the length of the sc_sample from read_opus_file() since I only used it a couple of times before noticing what I thought was a problem under Linux vs Windows, but just turned out to be a different version installed via devtools::install_github(). Then I updated the installation under Windows and by then only read_opus() was present.

@philipp-baumann, I suspect there will be no problem in using the file in your vignette but I will talk with the researchers.

davidbenncsiro · 2022-12-17T06:03:26Z

One thing I'm wondering about is whether rather than or in addition to a message like:

Error in if (next_offset >= raw_size) { : 
  missing value where TRUE/FALSE needed
In addition: Warning message:
In 4L * chunk_size : NAs produced by integer overflow

from read_opus(), a status value or string could be returned from the function so that code could programmatically check whether a file read succeeded? Would it be worth creating an issue for this?

davidbenncsiro · 2022-12-17T06:04:30Z

On the subject of metadata, can the existing opusreader metadata extraction approach be added to opusreader2?

davidbenncsiro · 2022-12-19T07:35:40Z

What is the best way to determine which block to look for data in?

For example, suppose the data are reflectance, will the block containing data always be refl? As you have noted above, if the data are absorbance, presumably the block will be ab or ab_no_atm_comp.

But how to determine which of these to use? Inopusreader one could ask:'

result$metadata$result_spc

and get:

[1] "RFL"

philipp-baumann · 2022-12-21T22:25:09Z

One thing I'm wondering about is whether rather than or in addition to a message like:
Error in if (next_offset >= raw_size) { : 
  missing value where TRUE/FALSE needed
In addition: Warning message:
In 4L * chunk_size : NAs produced by integer overflow
from read_opus(), a status value or string could be returned from the function so that code could programmatically check whether a file read succeeded? Would it be worth creating an issue for this?

Could you elaborate on how and when you got such an integer overflow condition? Assume it must be an extremely large integer.

philipp-baumann · 2022-12-21T22:29:06Z

On the subject of metadata, can the existing opusreader metadata extraction approach be added to opusreader2?

@ThomasKnecht and me have discussed a plan how to work on a much better metadata extraction/management scheme that enables more sophisticated data integrity checks (plate positions, sample id, time stamps) etc. We'd like to separate key infos across table "dumps" so we can link them in relational manner via keys. Save common entries and meanwhile enable comprehensive standard operation procedures on a platform. @ThomasKnecht is currently drafting something you could check out soon.

philipp-baumann · 2022-12-21T22:32:38Z

What is the best way to determine which block to look for data in?

For example, suppose the data are reflectance, will the block containing data always be refl? As you have noted above, if the data are absorbance, presumably the block will be ab or ab_no_atm_comp.

But how to determine which of these to use? Inopusreader one could ask:'
result$metadata$result_spc
and get:
[1] "RFL"

first part, partly yes. either it is only ab_no_atm_comp or ab_no_atm_comp plus ab. So the first is always present in the file for the record, which is a good default that Bruker did internally. Regarding your last question, we plan a separate table for spectral types in the metadata. Something along this line will be part of the prototype of new extract_metadata().

philipp-baumann · 2022-12-21T22:34:46Z

@davidbenncsiro , FYI we have been updating some basic stuff like README and we are now starting proper semantic verisioning announced in NEWS.md.

philipp-baumann · 2022-12-21T22:40:51Z

Generally I'd be very cautious mixing corrected and non-corrected data in a production environment (when doing cross-comparisons of two identical spectrometers at two different locations on basis of models (developing models with data on one device with atmospheric correction, predict on the same type but different instrument using measurements without correction). It took us a while to figure out what is going on. We'll consider this case to be most flexible in metadata and spectra management.

philipp-baumann · 2022-12-21T22:49:43Z

#44 @Thorsten-Behrens FYI the discussion we are having with @davidbenncsiro here also refers to your issue/wish. We are working on something along the line of extract_metadata() to use in downstream services of spectral-cockpit.

davidbenncsiro · 2022-12-22T05:09:14Z

Short question: We plan to draft a vignette in the R package that shines light on the inner working of the Bruker file specification and parsing. For this purpose and for extended unit tests, we would like to compile a test suite of files (many types of Bruker FT-IR instruments, equipments and measurements setttings). Can we use your file for this purpose? We would reference your institution for credits.

I checked with the researcher in question and this is fine.

davidbenncsiro · 2022-12-22T05:11:47Z

One thing I'm wondering about is whether rather than or in addition to a message like:
Error in if (next_offset >= raw_size) { : 
  missing value where TRUE/FALSE needed
In addition: Warning message:
In 4L * chunk_size : NAs produced by integer overflow

...Would it be worth creating an issue for this?
Could you elaborate on how and when you got such an integer overflow condition? Assume it must be an extremely large integer.

I thought I had created a ticket for this but it appears I didn't. I will need to see if I can reproduce this.

ThomasKnecht · 2022-12-27T08:46:56Z

for the metadata extraction see #66

philipp-baumann · 2023-06-05T00:16:56Z

also @davidbenncsiro we have now support for basic metadata here #84

philipp-baumann · 2023-11-10T09:29:53Z

closing since we have this in #66

davidbenncsiro changed the title ~~Results from older read_opus_file() different from read~~ Results differ between read_opus_file() and current read_opus() Dec 16, 2022

philipp-baumann changed the title ~~Results differ between read_opus_file() and current read_opus()~~ Discussion on design patterns of metadata extraction Dec 21, 2022

philipp-baumann changed the title ~~Discussion on design patterns of metadata extraction~~ Discussion on design patterns of metadata management and extraction Dec 21, 2022

philipp-baumann changed the title ~~Discussion on design patterns of metadata management and extraction~~ Discussion on design patterns of (meta)data management and extraction Dec 21, 2022

philipp-baumann mentioned this issue Jul 7, 2023

Road map to first CRAN release #89

Open

2 tasks

philipp-baumann closed this as completed Nov 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion on design patterns of (meta)data management and extraction #59

Discussion on design patterns of (meta)data management and extraction #59

davidbenncsiro commented Dec 16, 2022 •

edited

Loading

davidbenncsiro commented Dec 16, 2022

ThomasKnecht commented Dec 16, 2022 •

edited

Loading

ThomasKnecht commented Dec 16, 2022

philipp-baumann commented Dec 16, 2022

philipp-baumann commented Dec 16, 2022

philipp-baumann commented Dec 16, 2022 •

edited

Loading

philipp-baumann commented Dec 16, 2022 •

edited

Loading

philipp-baumann commented Dec 16, 2022 •

edited

Loading

davidbenncsiro commented Dec 17, 2022

davidbenncsiro commented Dec 17, 2022

davidbenncsiro commented Dec 17, 2022

davidbenncsiro commented Dec 19, 2022

philipp-baumann commented Dec 21, 2022

philipp-baumann commented Dec 21, 2022 •

edited

Loading

philipp-baumann commented Dec 21, 2022 •

edited

Loading

philipp-baumann commented Dec 21, 2022

philipp-baumann commented Dec 21, 2022 •

edited

Loading

philipp-baumann commented Dec 21, 2022 •

edited

Loading

davidbenncsiro commented Dec 22, 2022

davidbenncsiro commented Dec 22, 2022

ThomasKnecht commented Dec 27, 2022

philipp-baumann commented Jun 5, 2023

philipp-baumann commented Nov 10, 2023

Discussion on design patterns of (meta)data management and extraction #59

Discussion on design patterns of (meta)data management and extraction #59

Comments

davidbenncsiro commented Dec 16, 2022 • edited Loading

davidbenncsiro commented Dec 16, 2022

ThomasKnecht commented Dec 16, 2022 • edited Loading

ThomasKnecht commented Dec 16, 2022

philipp-baumann commented Dec 16, 2022

philipp-baumann commented Dec 16, 2022

philipp-baumann commented Dec 16, 2022 • edited Loading

philipp-baumann commented Dec 16, 2022 • edited Loading

philipp-baumann commented Dec 16, 2022 • edited Loading

davidbenncsiro commented Dec 17, 2022

davidbenncsiro commented Dec 17, 2022

davidbenncsiro commented Dec 17, 2022

davidbenncsiro commented Dec 19, 2022

philipp-baumann commented Dec 21, 2022

philipp-baumann commented Dec 21, 2022 • edited Loading

philipp-baumann commented Dec 21, 2022 • edited Loading

philipp-baumann commented Dec 21, 2022

philipp-baumann commented Dec 21, 2022 • edited Loading

philipp-baumann commented Dec 21, 2022 • edited Loading

davidbenncsiro commented Dec 22, 2022

davidbenncsiro commented Dec 22, 2022

ThomasKnecht commented Dec 27, 2022

philipp-baumann commented Jun 5, 2023

philipp-baumann commented Nov 10, 2023

davidbenncsiro commented Dec 16, 2022 •

edited

Loading

ThomasKnecht commented Dec 16, 2022 •

edited

Loading

philipp-baumann commented Dec 16, 2022 •

edited

Loading

philipp-baumann commented Dec 16, 2022 •

edited

Loading

philipp-baumann commented Dec 16, 2022 •

edited

Loading

philipp-baumann commented Dec 21, 2022 •

edited

Loading

philipp-baumann commented Dec 21, 2022 •

edited

Loading

philipp-baumann commented Dec 21, 2022 •

edited

Loading

philipp-baumann commented Dec 21, 2022 •

edited

Loading