Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validator error in Splash #248

Open
ksjewell opened this issue Oct 27, 2023 · 28 comments
Open

Validator error in Splash #248

ksjewell opened this issue Oct 27, 2023 · 28 comments

Comments

@ksjewell
Copy link

Hi René,

I am getting the following Validator error:

10:33:21.420 ERROR massbank.cli.Validator - ACCESSION: MSBNK-BAFG-CSL23102611413
10:33:21.420 ERROR massbank.cli.Validator - ^
10:33:21.420 ERROR massbank.cli.Validator - Error in 'BAFG/MSBNK-BAFG-CSL23102611413.txt'.
10:33:21.473 ERROR massbank.cli.Validator - SPLASH from record file does not match SPLASH calculated from peaklist. splash10-0gx3-9000000000-fdf8d511e2f88d17c82e defined in record file, but splash10-0w3u-9000000000-fdf8d511e2f88d17c82e calculated from peaks.

I checked the file and the actual splash in the file is:
´splash10-0006-9300000000-5cd70311703e2423a1c5´

I ran the code separately and indeed this is the splash I get when I run:

Browse[1]> spec
        mz intensity
1  44.9980       0.2
2  80.0261       0.1
3  93.0321       0.4
4 108.0227       0.3
Browse[1]> splashR::getSplash(spec)
[1] "splash10-0006-9300000000-5cd70311703e2423a1c5"

So I not only don't understand where it is getting the splash ´splash10-0gx3-9000000000-fdf8d511e2f88d17c82e´ from, I also do not understand why it is computing ´splash10-0w3u-9000000000-fdf8d511e2f88d17c82e´, a different one than I am.

@meier-rene
Copy link
Collaborator

Honestly, I dont know. Could you please drop the MSBNK-BAFG-CSL23102611413.txt file here for me?

@schymane
Copy link
Member

Strange that it's in the first block, I also don't recall seeing this case before...

@ksjewell
Copy link
Author

@meier-rene
Copy link
Collaborator

Thank you. I checked your file. It contains:
splash10-0006-9300000000-5cd70311703e2423a1c5
Validator reports it finds
splash10-0006-9300000000-5cd70311703e2423a1c5
but wants
splash10-052f-9300000000-5cd70311703e2423a1c5.

I expect you get the output shown in your first comment from a run of the validator with multiple files. This software runs multithreaded and sometimes output gets a bit messed up. I expect, that the outputline you found belongs to a different record. And in the output the explanation comes first and then the filename, see below a single file validation.

We focus instead on the output of the validation of a single file. You are right: There is a missmatch about the SPLASH calculated by RMassBank and the one from the Validator.

Validator version: 2.2.5-SNAPSHOT
14:12:50.497 ERROR massbank.cli.Validator - SPLASH from record file does not match SPLASH calculated from peaklist. splash10-0006-9300000000-5cd70311703e2423a1c5 defined in record file, but splash10-052f-9300000000-5cd70311703e2423a1c5 calculated from peaks.
14:12:50.499 ERROR massbank.cli.Validator - ACCESSION: MSBNK-BAFG-CSL23102611413
14:12:50.499 ERROR massbank.cli.Validator - ^
14:12:50.499 ERROR massbank.cli.Validator - Error in 'MSBNK-BAFG-CSL23102611413.txt'.

I need to dig a little bit deeper.

@ksjewell
Copy link
Author

Alright, seems you will solve it soon. Just as a heads-up, I used splashR to compute the Splash.

@schymane
Copy link
Member

Interesting, https://splash.fiehnlab.ucdavis.edu/ gives
image

...and it only worked on those numbers, returned a format error on the middle column only.

@meier-rene
Copy link
Collaborator

We recently had a similar issue MassBank/MassBank-web#384 and it was related to zeros somehow. What happens in your R Object if you remove the 0 in the first row?

@schymane
Copy link
Member

I thought of that issue too, but this is affecting the first block this time, not the third one - which is really strange. Is it related to the middle column somehow (all entries are below 1)

image

Tagging in @berlinguyinca and @ssmehta again ;-)

@meier-rene
Copy link
Collaborator

meier-rene commented Oct 27, 2023

We need to solve that issue on the R side.

curl -d '{ "ions": [ {"mass": 44.998, "intensity": 0.2 }, {"mass": 80.0261, "intensity": 0.1 }, {"mass": 93.0321, "intensity": 0.4 }, {"mass": 108.0227, "intensity": 0.3 } ], "type": "MS"}' -H "Content-Type: application/json"  https://splash.fiehnlab.ucdavis.edu/splash/it 
splash10-052f-9300000000-5cd70311703e2423a1c5

The REST endpoints agrees with the java implementation. And the 44.9980 gives the same. I will read the old issue again very carefully.

@ksjewell
Copy link
Author

ksjewell commented Oct 27, 2023

I can't find a way in R to skip the first 0 in 44.9980 but leave the others unchanged. If I round everything to 3 decimal places, I also get the incorrect splash

@schymane
Copy link
Member

Please don't round to 3 dp! That will for sure change the splash (but also the final hash block too, right?).
The first block is a summary block, it makes no sense why it would change so dramatically ... it should not be sensitive to a 0.

@schymane
Copy link
Member

In the second and third blocks, intensities are summed over fixed (but different) bin sizes and wrapped over ten bins. The wrapped bin (zero-based) index for a given ion is computed as floor (m/z ÷ BinSize) modulo 10. This wrapping strategy accommodates all possible spectral mass ranges while maintaining fixed-length summary blocks.

From the article ... the second block (wrapped bin) is the one that's changing: 052f vs 0006

@meowcat
Copy link

meowcat commented Oct 30, 2023

Looking at the failing file, I note that your absolute intensities are all <1. Is this how Sciex reports them? Does that have anything to do with the issue?

@ksjewell
Copy link
Author

This is how Sciex converts them to mzXML. I believe in the native Sciex format, the numbers are higher.

@meowcat
Copy link

meowcat commented Oct 30, 2023

@meier-rene
Copy link
Collaborator

@meowcat great finding. this means this issue should go to the R implementation at https://github.com/berlinguyinca/spectra-hash?
Besides that, any chance that we get higher intensities out of the Sciex export for now? I expect you use ProteoWizard for the conversion?

@ksjewell
Copy link
Author

I can just change the intensities temporarily to create the splash, no?

@meier-rene
Copy link
Collaborator

You dont need to bother about the SPLASH issue, because I can easily fix that on the txt files. If you think your files are fine and only some SPLASH are broken, please reopen your PR.

I expect that there is a fix required to the SPLASH library to solve that issue on the RMassBank side.

@meowcat
Copy link

meowcat commented Oct 30, 2023

@ksjewell Since you import the records in MsBackendMassbank and then export them again (right?), you could in fact recalculate the splash there, yes.
Something like

spectraData(sp)$splash <- map_chr(peaksData(sp), function(pks) {
 pks[,2] <- pks[,2] * 1000
 RMassBank:::getSplash(pks)
}

I expect that there is a fix required to the SPLASH library to solve that issue on the RMassBank side.

yep; though best would be to get the fix in the original SPLASH lib and port it identically, so we don't have two different implementations of the fix. I hope multiplying by 1k will not break a few other SPLASHes because of rounding issues

@meier-rene
Copy link
Collaborator

yep; though best would be to get the fix in the original SPLASH lib and port it identically, so we don't have two different implementations of the fix.

I agree, thats why I opened a issue at the splash package repo.

@ksjewell
Copy link
Author

ksjewell commented Nov 9, 2023

I think I am making progress but there is still one single Validator error left (this is after multiplying intensity by 1000)
Since it is just one file I will change the i to an l and be done with it :). But you know, in case it helps:

20:09:06.617 ERROR massbank.cli.Validator - SPLASH from record file does not match SPLASH calculated from peaklist. splash10-014i-9000000000-508039bd516ba9b5a8ab defined in record file, but splash10-014l-9000000000-508039bd516ba9b5a8ab calculated from peaks.

Here is the file:

ACCESSION: MSBNK-BAFG-CSL231109456
RECORD_TITLE: Benzyl-dimethyl-decylammonium; LC-ESI-QTOF; MS2; 150 V
DATE: 2023.11.09
AUTHORS: Kevin S. Jewell; Björn Ehlig; Arne Wick
LICENSE: dl-de/by-2-0
COPYRIGHT: Copyright 2023 Federal Institute of Hydrology, Koblenz, Germany
COMMENT: CONFIDENCE Reference Standard (Level 1)
COMMENT: Chromatography method: dx.doi.org/10.1016/j.chroma.2015.11.014
COMMENT: Acquisition method: 10.1002/rcm.8541
CH$NAME: Benzyl-dimethyl-decylammonium
CH$COMPOUND_CLASS: Antimicrobial; Pharmaceutical
CH$FORMULA: [C19H34N]+
CH$EXACT_MASS: 276.2686
CH$SMILES: CCCCCCCCCC[N+](C)(C)Cc1ccccc1
CH$IUPAC: InChI=1S/C19H34N/c1-4-5-6-7-8-9-10-14-17-20(2,3)18-19-15-12-11-13-16-19/h11-13,15-16H,4-10,14,17-18H2,1-3H3/q+1
CH$LINK: CAS 48185-25-7
CH$LINK: INCHIKEY UARILQSOMYIQCM-UHFFFAOYSA-N
AC$INSTRUMENT: TripleTOF 5600 SCIEX
AC$INSTRUMENT_TYPE: LC-ESI-QTOF
AC$MASS_SPECTROMETRY: MS_TYPE MS2
AC$MASS_SPECTROMETRY: ION_MODE POSITIVE
AC$MASS_SPECTROMETRY: COLLISION_ENERGY 150
AC$MASS_SPECTROMETRY: FRAGMENTATION_MODE CID
AC$MASS_SPECTROMETRY: IONIZATION ESI
AC$CHROMATOGRAPHY: COLUMN_NAME Zorbax Eclipse Plus C18 2.1 mm x 150 mm, 3.5 um, Agilent
AC$CHROMATOGRAPHY: COLUMN_TEMPERATURE 40 °C
AC$CHROMATOGRAPHY: FLOW_GRADIENT 0 min min 98% A, 1 min 98% A, 2 min 80% A, 16.5 min 2% A, 22 min 2% A, 22.1 min 98% A, 27 min 98% A
AC$CHROMATOGRAPHY: FLOW_RATE 0.3 mL/min
AC$CHROMATOGRAPHY: RETENTION_TIME 10.366 min
AC$CHROMATOGRAPHY: SOLVENT A: Water 0.1% Formic acid, B: Acetonitrile 0.1% Formic acid
MS$FOCUSED_ION: PRECURSOR_M/Z 276.2686
MS$FOCUSED_ION: PRECURSOR_TYPE [M]+
MS$DATA_PROCESSING: COMMENT Export with Spectra 1.9.12 MsBackendMassbank 1.7.4
MS$DATA_PROCESSING: WHOLE RMassBank 2.3.1
PK$SPLASH: splash10-014i-9000000000-508039bd516ba9b5a8ab
PK$NUM_PEAK: 4
PK$PEAK: m/z int. rel.int.
  42.0443 4.6 142
  58.0706 7.6 235
  65.0436 32.2 999
  91.0554 11.5 356
//



@schymane
Copy link
Member

schymane commented Nov 9, 2023

Annoyingly, the SPLASH website won't take it (which may be a clue in itself, this happened before too). I've tried several variants.
image

The 999-scaled values give the i variant:
image

@ksjewell
Copy link
Author

So does that mean the i variant is correct in this case and the Validator is incorrect?

@schymane
Copy link
Member

Not sure, need @meier-rene 's opinion on this... it's strange that it doesn't work at all with the decimals...

@sneumann
Copy link
Member

Hi, I can confirm that the online calculator https://splash.fiehnlab.ucdavis.edu/ is unhappy about decimals for intensities. decimals in m/z are fine there. IIRC the online calculator uses the scala implementation. Yours, Steffen

@berlinguyinca
Copy link

berlinguyinca commented Nov 13, 2023 via email

@sneumann
Copy link
Member

Ok, digging a bit further ... so far we used the online splash calculator that takes the peaklist as kinda CSV, and which complains about non-integer intensities due to the input validation. Using the REST call we get for the spectrum in #248 (comment):

curl -X POST -H 'Content-Type: application/json' -d '{"ions":[{"mass": 42.0443, "intensity": 4.6},{"mass": 58.0706, "intensity": 7.6},{"mass": 65.0436, "intensity": 32.2},{"mass": 91.0554, "intensity": 11.5}], "type": "MS"}' https://splash.fiehnlab.ucdavis.edu/splash/it  ; echo
splash10-014l-9000000000-508039bd516ba9b5a8ab

which is the same value as the massbank validator ... splash10-014l-9000000000-508039bd516ba9b5a8ab calculated from peaks.. So I added a unit test to splashR checking this output:
https://github.com/berlinguyinca/spectra-hash/pull/51/files and it gets the correct result.

I also checked that both splashR and the splash code we copy&pasted into RMassBank give identical results:

> spectrum <- cbind(mz=c(42.0443, 58.0706, 65.0436, 91.0554), intensity=c(4.6, 7.6, 32.2, 11.5))
> splashR:::getSplash(spectrum)
[1] "splash10-014l-9000000000-508039bd516ba9b5a8ab"
> RMassBank:::getSplash(spectrum)
[1] "splash10-014l-9000000000-508039bd516ba9b5a8ab"
> sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: x86_64-pc-linux-gnu (64-bit)
... other attached packages:
[1] RMassBank_3.11.1.1 Rcpp_1.0.10        splashR_0.0.3      digest_0.6.31     

So I get the feeling RMassBank passes something weird to getSplash().
I would need @ksjewell 's help to run the RMassBank with some more diagnostic output to capture what values are used for this record in
https://github.com/MassBank/RMassBank/blob/3b61006a1a4bac9c94e780ad82834a1dae9ce417/R/createMassBank.R#L1556
Simplest would be to add the following line to save the peaks for the offending record:

if (mbdata[["PK$SPLASH"]]=="splash10-014i-9000000000-508039bd516ba9b5a8ab") save(peaks, file="peaks-splash10-014i-9000000000-508039bd516ba9b5a8ab.Rdata")

That'd be highly appreciated, please ping me if you need help.
Yours, Steffen

@ssmehta
Copy link

ssmehta commented Nov 14, 2023

Hi all,

Regarding the initial problem in this issue relating to inconsistent histograms, I believe this was due to a a missing binning correction factor in splashR. I submitted a PR which should fix this: berlinguyinca/spectra-hash#52

For the second spectrum, I agree with @sneumann that it doesn't seem to be an issue with SPLASH. I tried some variations of intensities and could only produce 014l as the prefilter histogram, with and without the histogram fix I submitted. Hopefully with some more information we can track down that discrepancy.

Best,
Sajjan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants