Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion on design patterns of (meta)data management and extraction #59

Closed
davidbenncsiro opened this issue Dec 16, 2022 · 23 comments
Closed

Comments

@davidbenncsiro
Copy link

davidbenncsiro commented Dec 16, 2022

Around August, there was a function called read_opus_file() in opusreader2.

This yielded what appeared to be correct sample data from files I have been testing with, such as the one attached: DECCW1013_A11.zip

I updated the package today and note that read_opus_file() does not exist now.

I tried to use read_opus() but this returns a different result.

Instead of result$spec$wavenumbers and result$spec$data we now have result$sc_sample$wavenumbers, result$sc_sample$data, result$sc_ref$wavenumbers, result$sc_ref$data and so on for different block types.

Should I do something like divide sc_sample by sc_ref data to yield absorbance data?

Also, at least in the case of sc_sample, the wavenumbers vector has a different length from the data and the data values don't match what I was seeing from result$spec$data with read_opus_file().

See also:

@davidbenncsiro davidbenncsiro changed the title Results from older read_opus_file() different from read Results differ between read_opus_file() and current read_opus() Dec 16, 2022
@davidbenncsiro
Copy link
Author

I also realise that there is a lot of active development.

@ThomasKnecht
Copy link
Collaborator

ThomasKnecht commented Dec 16, 2022

Hi David!
Thanks for the input.

I just parsed your file with:

test <- read_opus("inst/extdata/test_data/DECCW1013_A11.0")

The previous spec is now renamed to ab or ab_no_atm_comp
According to the parsing output of your file, the ab_no_atm_comp is contained in the file.

I add the screenshots of the output:

Screenshot 2022-12-16 at 20 38 41

@ThomasKnecht
Copy link
Collaborator

What was the length of the sc_sample you got with the read_opus_file() function?

@philipp-baumann
Copy link
Member

data_matrix <- matrix(ds_data[[1]]$data[1:NPT], nrow = 1, ncol = NPT)

@philipp-baumann
Copy link
Member

  • sc_sample (single channel sample): Length of NPT := 4827
    We found that in data, the last parsed number of the spectral blocks (sc_ref, sc_sample, ab_no_atm_comp) is 0. We also checked this in the Bruker OPUS software (right click on loaded file). We guess, either this is either by design (Bruker) to make reverse engineering more difficult, or by artefact ;-) We slice columns of the data matrix therefore from 1 to NPT.

@philipp-baumann
Copy link
Member

philipp-baumann commented Dec 16, 2022

Indeed, you can calculate the apparent absorbance with the sc_ref and sc_sample blocks (single channels in sensor intensity) so that abs = log10(1 / (sc_sample / sc_ref)) [it is possible that single channels need to be truncated to have same x-axis.]. However, there is no need to do so, because there is the ab_no_atm_comp block which contains apparent absorbance in your case. We specifically renamed this because it makes one aware of that no atmospheric correction was performed. Out of interest, is there a reason why you don't do atmospheric correction (CO2, water vapor) at CSIRO?

In case there would be atmospheric compensation set in the measurement settings in OPUS, there would be an additional ab (save in absorbance ticked) or refl block. Kindly note that if atmospheric correction would be done, you would still have the "precessor" raw AB block in the OPUS file, but Bruker would only show the last processed block as AB in the software (here result spectrum after applying atmospheric correction macro).

@philipp-baumann
Copy link
Member

philipp-baumann commented Dec 16, 2022

Well, we hope that with the current development efforts we can now soon release a stable CRAN release of opusreader2. @ThomasKnecht has done some internal refactoring to clean the parse logic with some proper modularization. We want to have read_opus() as a really stable workhorse, with a slim interface. The simpler the interface is and the more narrow the core functionality (one function, one job), the easier to maintain for us and more pleasant to use (stability). We decided to strictly avoid any extra functionality. Extra features/processing and other tasks after reading, such as extracting metadata, will be implemented as additional helper functions. Or, if time allows and people can convince us that is serves an important or interesting use case, we can make high-level wrappers of read_opus() that come in handy for specific workflows of the community. With this, we hope to leverage as many use cases as possible. Or happily via PR contributions.

@philipp-baumann
Copy link
Member

philipp-baumann commented Dec 16, 2022

Short question: We plan to draft a vignette in the R package that shines light on the inner working of the Bruker file specification and parsing. For this purpose and for extended unit tests, we would like to compile a test suite of files (many types of Bruker FT-IR instruments, equipments and measurements setttings). Can we use your file for this purpose? We would reference your institution for credits.

@davidbenncsiro
Copy link
Author

@philipp-baumann and @ThomasKnecht. First, thank you very much for your rapid and enthusiastic replies!

I overlooked ab_no_atm_comp initially. I agree that it makes perfect sense to use this and yields the kind of result I expected to see.

Client code calling read_opus() could I suppose first check whether ab exists and if not, use ab_no_atm_comp. Would you consider that to be a reasonable strategy? Are there circumstances under which neither would be present, other than for an invalid file?

I will raise the atmospheric compensation question with the researchers I am collaborating with (as a research software engineer). It may be that this has changed over time or that different Opus settings are applied across site labs for different purposes, but it is very much worth discussing, so thank you.

@ThomasKnecht, I don't recall what was the length of the sc_sample from read_opus_file() since I only used it a couple of times before noticing what I thought was a problem under Linux vs Windows, but just turned out to be a different version installed via devtools::install_github(). Then I updated the installation under Windows and by then only read_opus() was present.

@philipp-baumann, I suspect there will be no problem in using the file in your vignette but I will talk with the researchers.

@davidbenncsiro
Copy link
Author

One thing I'm wondering about is whether rather than or in addition to a message like:

Error in if (next_offset >= raw_size) { : 
  missing value where TRUE/FALSE needed
In addition: Warning message:
In 4L * chunk_size : NAs produced by integer overflow

from read_opus(), a status value or string could be returned from the function so that code could programmatically check whether a file read succeeded? Would it be worth creating an issue for this?

@davidbenncsiro
Copy link
Author

On the subject of metadata, can the existing opusreader metadata extraction approach be added to opusreader2?

@davidbenncsiro
Copy link
Author

What is the best way to determine which block to look for data in?

For example, suppose the data are reflectance, will the block containing data always be refl? As you have noted above, if the data are absorbance, presumably the block will be ab or ab_no_atm_comp.

But how to determine which of these to use? Inopusreader one could ask:'

result$metadata$result_spc

and get:

[1] "RFL"

@philipp-baumann
Copy link
Member

One thing I'm wondering about is whether rather than or in addition to a message like:

Error in if (next_offset >= raw_size) { : 
  missing value where TRUE/FALSE needed
In addition: Warning message:
In 4L * chunk_size : NAs produced by integer overflow

from read_opus(), a status value or string could be returned from the function so that code could programmatically check whether a file read succeeded? Would it be worth creating an issue for this?

Could you elaborate on how and when you got such an integer overflow condition? Assume it must be an extremely large integer.

@philipp-baumann
Copy link
Member

philipp-baumann commented Dec 21, 2022

On the subject of metadata, can the existing opusreader metadata extraction approach be added to opusreader2?

@ThomasKnecht and me have discussed a plan how to work on a much better metadata extraction/management scheme that enables more sophisticated data integrity checks (plate positions, sample id, time stamps) etc. We'd like to separate key infos across table "dumps" so we can link them in relational manner via keys. Save common entries and meanwhile enable comprehensive standard operation procedures on a platform. @ThomasKnecht is currently drafting something you could check out soon.

@philipp-baumann
Copy link
Member

philipp-baumann commented Dec 21, 2022

What is the best way to determine which block to look for data in?

For example, suppose the data are reflectance, will the block containing data always be refl? As you have noted above, if the data are absorbance, presumably the block will be ab or ab_no_atm_comp.

But how to determine which of these to use? Inopusreader one could ask:'

result$metadata$result_spc

and get:

[1] "RFL"

first part, partly yes. either it is only ab_no_atm_comp or ab_no_atm_comp plus ab. So the first is always present in the file for the record, which is a good default that Bruker did internally. Regarding your last question, we plan a separate table for spectral types in the metadata. Something along this line will be part of the prototype of new extract_metadata().

@philipp-baumann
Copy link
Member

@davidbenncsiro , FYI we have been updating some basic stuff like README and we are now starting proper semantic verisioning announced in NEWS.md.

@philipp-baumann
Copy link
Member

philipp-baumann commented Dec 21, 2022

Generally I'd be very cautious mixing corrected and non-corrected data in a production environment (when doing cross-comparisons of two identical spectrometers at two different locations on basis of models (developing models with data on one device with atmospheric correction, predict on the same type but different instrument using measurements without correction). It took us a while to figure out what is going on. We'll consider this case to be most flexible in metadata and spectra management.

@philipp-baumann philipp-baumann changed the title Results differ between read_opus_file() and current read_opus() Discussion on design patterns of metadata extraction Dec 21, 2022
@philipp-baumann
Copy link
Member

philipp-baumann commented Dec 21, 2022

#44 @Thorsten-Behrens FYI the discussion we are having with @davidbenncsiro here also refers to your issue/wish. We are working on something along the line of extract_metadata() to use in downstream services of spectral-cockpit.

@philipp-baumann philipp-baumann changed the title Discussion on design patterns of metadata extraction Discussion on design patterns of metadata management and extraction Dec 21, 2022
@philipp-baumann philipp-baumann changed the title Discussion on design patterns of metadata management and extraction Discussion on design patterns of (meta)data management and extraction Dec 21, 2022
@davidbenncsiro
Copy link
Author

Short question: We plan to draft a vignette in the R package that shines light on the inner working of the Bruker file specification and parsing. For this purpose and for extended unit tests, we would like to compile a test suite of files (many types of Bruker FT-IR instruments, equipments and measurements setttings). Can we use your file for this purpose? We would reference your institution for credits.

I checked with the researcher in question and this is fine.

@davidbenncsiro
Copy link
Author

One thing I'm wondering about is whether rather than or in addition to a message like:

Error in if (next_offset >= raw_size) { : 
  missing value where TRUE/FALSE needed
In addition: Warning message:
In 4L * chunk_size : NAs produced by integer overflow

...Would it be worth creating an issue for this?

Could you elaborate on how and when you got such an integer overflow condition? Assume it must be an extremely large integer.

I thought I had created a ticket for this but it appears I didn't. I will need to see if I can reproduce this.

@ThomasKnecht
Copy link
Collaborator

for the metadata extraction see #66

@philipp-baumann
Copy link
Member

also @davidbenncsiro we have now support for basic metadata here #84

@philipp-baumann
Copy link
Member

closing since we have this in #66

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants