Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for checking missing sources with extra-data-validate #280

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

JamesWrigley
Copy link
Member

@JamesWrigley JamesWrigley commented Feb 22, 2022

This adds support for checking that sources in runs are not missing from more than some percentage of trains. MID noticed that occasionally certain devices (particularly cameras and fastADCs) would be missing from runs, and this wasn't noticed until later when they tried to analyze the data.

The goal is for this to be run automatically by the RunValidator device, perhaps with a setting to change the missing data threshold. You can test it with this run, which is missing a bunch of data from the AGIPD and some cameras: /gpfs/exfel/exp/MID/202221/p003210/raw/r0008.

Couple of things I'm not too sure about:

  • Does performance matter? This is implemented quite naively, and on the above run it takes ~8s (vs ~2s without checking sources). Since every file is checked anyway I imagine it's possible to modify the FileValidator to report the data counts for each source/key it contains.
  • Is it necessary to report every single key that is missing data? I don't know exactly how the DAQ works, but if it's guaranteed to dump all data for a source per-train, then I think it'd be enough to show e.g. <source> is missing data for <n> trains instead of listing every key. But maybe pipeline data is handled differently, e.g. for the big detectors?
  • Should there be a way to limit which sources are checked? I think not, unless for the sake of performance (in which case I'd rather optimize this implementation as in the first point).
  • Should it be possible to set a threshold per-source? As this is not much more than a sanity check, I think it's not necessary.

(BTW it's probably easiest to review each commit individually, they're fairly atomic)

Also removed the `suffix` argument from progress_bar() and `nproblems` from
RunValidator.progress().
This uses a user-defined threshold of trains with missing data to decide which
sources to report.
@tmichela
Copy link
Member

Thanks James. To answer some of your questions:

Does performance matter?

It is not critical, but it is important

Is it necessary to report every single key that is missing data?

Probably not. In general all keys come from the same data hash in pipeline data sources, so it's probably safe to assume if one is written, all keys from the same hash are too. The MHz detector are a special case, they have 4 sub-sections (image, header, trailer, detector) and each can be individually present or not. (having said that, for analysis, only image is relevant).

now, I'm not sure this is the right place for this check. Currently extra-data-validate purpose is to check that a file is not broken, i.e. it respects the format definition. Missing data is not a file problem and is actually something "normal" as probably almost all runs will have missing data for one or multiple data sources. I also know that people tend to let some instrument selected in the DAQ even if they aren't used, so you get no data for them all the time.
I think it will generate a lot of noise in the output of validate if you check the data ratio.

Also, for your interest, Robert and I are working on extending how we inform people about data issues. We will have a DnD to brainstorm on that idea when Robert is back. And thinking of a way to actively notifying beam staff about these problems is a great use case (is that a reasonable timeframe for you?)

There are also a couple of way already available to "solve" this issue:

  1. using lsxfel

e.g. something like:

$ lsxfel ./r0001 --detail "*:*" | grep "data for" -B2 | sed -n 1~2p

  - HED_EXP_IC1/ADC/1:channel_0.output
      data for 0 trains (0.00%), up to 0 entries per train
  - HED_EXP_IC1/ADC/1:channel_1.output
      data for 0 trains (0.00%), up to 0 entries per train
  - HED_EXP_IC1/ADC/1:channel_2.output
      data for 2734 trains (100.00%), up to 1 entries per train
  ...
  1. the RunValidator device already shows the data ratio per source
  2. The DAQ recently added a metrics table, did MID try to use that?

image

@tmichela
Copy link
Member

maybe @takluyver has a different opinion?

@JamesWrigley
Copy link
Member Author

now, I'm not sure this is the right place for this check. Currently extra-data-validate purpose is to check that a file is not broken, i.e. it respects the format definition. Missing data is not a file problem and is actually something "normal" as probably almost all runs will have missing data for one or multiple data sources. I also know that people tend to let some instrument selected in the DAQ even if they aren't used, so you get no data for them all the time. I think it will generate a lot of noise in the output of validate if you check the data ratio.

Ouch, yeah if that's the case then this will be waaaay too noisy 😬 I could add an option to only check certain sources, but if that source list comes from the DAQ it doesn't solve anything...

Also, for your interest, Robert and I are working on extending how we inform people about data issues. We will have a DnD to brainstorm on that idea when Robert is back. And thinking of a way to actively notifying beam staff about these problems is a great use case (is that a reasonable timeframe for you?)

Ah hah, that sounds cool. I don't know when Robert is back, but I'll check with MID if the things you listed below would help in the meantime.

2. the RunValidator device already shows the data ratio per source

I think they wanted something more warning-y, but perhaps they don't see those logs in their main scene. I'll check that.

3. The DAQ recently added a metrics table, did MID try to use that?
image

Oooo that looks very useful 👀 No I'm not sure, I'll ask.

@philsmt
Copy link
Contributor

philsmt commented Feb 23, 2022

I think the option of quickly checking for this is very useful, but as Thomas suggested it is not universally an error condition for data to be missing on a given train. On the contrary, there are devices either intentionally or by their nature emitting data at less than 10 Hz. But that's not a problem with the feature itself, but giving it a finite default value. It should be disabled by default and easily turned on, even permanently if the instrument chooses so for the run validator.

On the more technical side, I think there are plenty of assumptions you can include to reduce the number of walked keys:

  • CONTROL data is basically guaranteed to exist for every train the DAQ itself registers, as it copies its value from the previous train if there has been no property update. I'm not entirely sure on this point, but from the top of my head I cannot recall an instance this has not been true. Naturally a DAQ bug may cause this, so one could allow a flag to restrict missing-data-check to INSTRUMENT data only if performance matters.
  • Going further to INSTRUMENT data, the DAQ indexes fast data by its root key (e.g. data or image). All keys below this key must exist for every input, or the DAQ will drop the entire train on this source.

@takluyver
Copy link
Member

I think there's a practical problem with this: as I understand it, one of the easiest ways for data to go missing is when the source is not configured in the DAQ. In that case, 100% of the data is missing, but the source name isn't there in the files, so this check wouldn't see a problem.

My inclination is to agree with @tmichela, that we should keep the scope of extra-data-validate limited to 'are these files correctly structured?', and missing data is not an issue with the file structure. But from a user's perspective, I know the difference between a DAQ bug and a bug or a misconfiguration somewhere else in Karabo is not important, and the 'validation' that a lot of people want is 'is my data being saved?' So I don't want to get too hung up on the principle of this.

I wonder if the subjective question of 'is enough of the data that I care about present in these files?' is better dealt answered by a separate tool than the objective one extra-data-validate tries to answer. This is roughly what I was going for with the percentages in lsxfel, but that's still distinctly clumsy.

@JamesWrigley
Copy link
Member Author

I think the option of quickly checking for this is very useful, but as Thomas suggested it is not universally an error condition for data to be missing on a given train. On the contrary, there are devices either intentionally or by their nature emitting data at less than 10 Hz. But that's not a problem with the feature itself, but giving it a finite default value. It should be disabled by default and easily turned on, even permanently if the instrument chooses so for the run validator.

That makes sense, and indeed I didn't think of cases where a device doesn't send data at 10Hz (one example I can think of would be the Shimadzu camera at SPB).

I think there's a practical problem with this: as I understand it, one of the easiest ways for data to go missing is when the source is not configured in the DAQ. In that case, 100% of the data is missing, but the source name isn't there in the files, so this check wouldn't see a problem.

Hmm, that's not really the problem I'm trying to solve. If a source is missing from the DAQ then that's definitely an operator error. This validation is only looking for cases where the DAQ is configured correctly and data should have been written, but it wasn't. That's a more insidious problem to discover because technically the bad sources are present in the files, so lsxfel and RunDirectory etc will report them as being present, it's only when someone checks how many trains there are that they'd notice. If a source is missing from the DAQ then at least it'll be obvious as soon as someone tries to use it from the files in any way.

I wonder if the subjective question of 'is enough of the data that I care about present in these files?' is better dealt answered by a separate tool than the objective one extra-data-validate tries to answer.

I don't have a strong opinion on that. Personally I would lean towards keeping it in extra-data-validate for the reasons you mentioned before WRT users not caring about the distinction between formatting and 'is my data being saved'.

So my takeaways are:

  • This should be optional, and disabled by default since the risk of false positives is high.
  • It should be possible to specify which sources to include or exclude from the checks.
  • Performance can be improved by only looking at a single key for each INSTRUMENT source, and could add an option to disable checking control data for performance.

Does that sound good? I think it's simpler if the feature stays in extra-data-validate, but I could also put it in a new command (extra-data-check-missing?).

@tmichela
Copy link
Member

So my takeaways are:

Yes, that sounds reasonable.

Although, since we are starting to work on a project that is meant to handle any kind of data validity issues (and data loss definitely falls in this scope) I would prefer not to add that right now, and wait until we have a clearer plan on how this project will handle and inform users with data issues.

What about this?

  1. We update the runValidator to make this kind of issues more visible and configurable, since that is the original problem you're trying to solve.
  2. we discuss the direction this validation project should go (and this specific check as part of this project), when Robert is back (in 2 weeks).

@JamesWrigley
Copy link
Member Author

Yep yep, that sounds good to me 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants