Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip empty datasets #192

Open
wants to merge 11 commits into
base: master
Choose a base branch
from
Open

Skip empty datasets #192

wants to merge 11 commits into from

Conversation

tmichela
Copy link
Member

@tmichela tmichela commented Jun 9, 2021

Fixes #189

I simplified the case where we skip a dataset when it is completely empty in a file, as it it the only scenario I've seen so far where we had index filled but no data.

Copy link
Member

@takluyver takluyver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I wonder if any of these places should warn that there's missing data where the index says there should be data.

I'd like to get @philsmt's thoughts on this. The idea of using .select(..., require_all=True) is that you know the selected data all exists for the same set of trains, so you can get arrays and use them against one another easily. But that only looks at the indexes. So this would mean that even after using require_all, you could get arrays missing some or all trains.

Of course, in that case it would break either way, but delaying the error might make it harder to understand.

extra_data/keydata.py Outdated Show resolved Hide resolved
extra_data/keydata.py Outdated Show resolved Hide resolved
extra_data/reader.py Outdated Show resolved Hide resolved
extra_data/tests/conftest.py Outdated Show resolved Hide resolved
@tmichela
Copy link
Member Author

tmichela commented Jun 9, 2021

I wonder if any of these places should warn that there's missing data where the index says there should be data.

Are there any other place where this can be problematic? I can also add a check in select and exclude these sources there directly.
Currently data_counts also only checks the index.

@philsmt
Copy link
Contributor

philsmt commented Jun 9, 2021

Thanks for working on this!

I'm not fully comfortable however with "just" treating it silently and return no data, but at the minimum raise a Warning if not raise an exception with an optional flag to continue reading.

The situation of an empty dataset but supposed records in INDEX represents an abnormal situation - as opposed to actually having a pipeline source sending no data. I think we should go with the assumption that this was not intended (of course not sending data may too, but that is beyond our control).
Let me illustrate it with the example Thomas gave with .select(..., require_all=True): For a proper data-less source (no records in INDEX, no data in INSTRUMENT), this will return an empty DataCollection. For a corrupted data-less source (records in INDEX, no data in INSTRUMENT), this would return a filled DataCollection. However, I'd also argue that we should then not simply return an empty DataCollection in the second case without any additional notice. Then users will rightfully complain about EXtra-data not properly reading the data, even though lsxfel shows its presence.

Given the ubiquity of these files, we should be able to handle it once access goes beyond the INDEX section and to INSTRUMENT, but definitely make the user aware of the underlying corrupt structure.

@tmichela
Copy link
Member Author

tmichela commented Jun 9, 2021

Hum, I don't think I agree here.

Yes, this is a problem and the file is wrongly written. But this problem is identified and effectively the result if having an empty dataset (even with filled index) is the same as a data source not sending data. And lsxfel will display it identically for both cases.

I don't think the user should care for these differences and raising an exception because there's a mismatch between the dataset array and the index, but not raising when both index and dataset are empty will be confusing.
I think we should properly return that there's no data for this source in this run. And handle the details of why something is wrong in the validation code.

For a corrupted data-less source (records in INDEX, no data in INSTRUMENT), this would return a filled DataCollection

that is updated now, and would return an empty DataCollection.

@philsmt
Copy link
Contributor

philsmt commented Jun 9, 2021

Yes, this is a problem and the file is wrongly written. But this problem is identified and effectively the result if having an empty dataset (even with filled index) is the same as a data source not sending data. And lsxfel will display it identically for both cases.

Are you sure with lsxfel? I glanced at the diff and did not see changes to directly reporting what's written in the INDEX, i.e. it will show different results for actually-empty-data and corrupted-empty-data.

I don't think the user should care for these differences and raising an exception because there's a mismatch between the dataset array and the index, but not raising when both index and dataset are empty will be confusing.
I think we should properly return that there's no data for this source in this run. And handle the details of why something is wrong in the validation code.

I don't think it will be confusing. Let me try to explain the situation from an operator's perspective:
When a pipeline source is really not sending data, it's because the source is actually in some state it's not producing data. In the simplest case, it has some rate property that is zero. There will be no data preview in Karabo.
When the DAQ writes these corrupt files, there was actually data sent (else there would be no record in INDEX), but it was dropped. This is almost impossible for an operator to detect, as the source is producing data, and will probably report so. He may even have a preview updating in karabo-gui.

I understand your point that the difference is academic from the file's point of view, but it's not from the scientist who tried to record the data he saw on the screen. And raising a warning that the data is absent due some misconfiguration beyond the experimental's control is a way to allow for fixing this in time. If we'd be 100% sure it is always caught by automatic validation (or even better, the DAQ 😠), I'd be fine with just handling it implicitly. But we're far away from that.

@takluyver
Copy link
Member

Would it be an acceptable compromise to drop it with a warning from methods which retrieve data from multiple sources (like .train_from_id()), but leave it raising errors for methods which deal with a specific source & key? If you're specifically accessing a key which has been affected by this, the chances are that it's a problem, and we shouldn't try to hide that.

@philsmt
Copy link
Contributor

philsmt commented Jun 9, 2021

That sounds like a good idea to me. I don't want to restrict the functionality when we could handle it, but for KeyData there's nothing to do anyway.

@tmichela
Copy link
Member Author

tmichela commented Jun 9, 2021

Are you sure with lsxfel? I glanced at the diff and did not see changes to directly reporting what's written in the INDEX, i.e. it will show different results for actually-empty-data and corrupted-empty-data.

Ah, I run in this problem only for slow data, and in that case, I don't think you'd see a difference with lsxfel (for slow data).

And raising a warning that the data is absent due some misconfiguration beyond the experimental's control is a way to allow for fixing this in time. If we'd be 100% sure it is always caught by automatic validation (or even better, the DAQ angry), I'd be fine with just handling it implicitly. But we're far away from that.

I agree this is important, but still don't think that this is the right place for it. You see it as a beamline operator. But more generally, users don't have a clue what's a DAQ, karabo, and what can cause this or that issue and I am not sure we should let their code deal with the various caveat of our files. I see DataCollection as an interface to data in files and if there's no data for a key then it returns nothing, whatever the reason. And if we want to understand why this has no data, use other tools.
extra-data-validate returns errors for this kind of problem, lsxfel may also be a good place for warnings.

also note that this kind of issues is seen on old files (2019), I don't think any new file has that kind of problem.

@philsmt
Copy link
Contributor

philsmt commented Jun 10, 2021

Ah, maybe we set our focus on different kinds of data in our discussion. I think the changes will affect both though.

For control data, quite likely this problem only occurs with older files, I've never seen it myself. Honestly this problem's even worse for control data, as it can be assumed to be present for every train whenever a source is present in files. The state of "source-existing-but-no-control-data" is an exceptional state that should not occur at all.

For fast data, the problem's still there, alive and kicking. It's still comparably easy to trigger it, even if most instruments got their hands burned by now and are somewhat aware to watch out for it. And lsxfel will happily report the presence of trains, while a DataCollection would suddenly return None. We actually had this discussion in the past that lsxfel is supposed to be fast, and inspecting whether there's actually data is not meant to be within its scope.

I made a poor choice of words with "operator". I was actually refering to the "user operators", i.e. those users present or watching the experiment remotely, looking at data preview whether they satisfy their needs and triggering runs. Those roles are combined for commissioning, but in a user beamtime, it's not the instrment staff deciding which data to take. You're right that the far off PhD tasked with analyzing data may not see a difference. I'm not advocating for sabotaging EXtra-data here, but pointing out the exceptional state. The user may well ignore it. But there's a chance it helps diagnosing this problem, which is really hard for a non-expert to understand.

We should not intentionally assume ignorance on our user's part.

@tmichela
Copy link
Member Author

The state of "source-existing-but-no-control-data" is an exceptional state that should not occur at all.

That is bad, but there's actually even worse: slow data updates not being seen by the DAQ, so values in files are wrong. And that is still happening (have seen that in my last DOC shift) :(

And lsxfel will happily report the presence of trains, while a DataCollection would suddenly return None

Ah, I'm actually an idiot. So, with this merge request, lsxfel would correctly report that there is not data for that source (as info asks for data_counts)

I feel this discussion goes beyond the scope of that PR. I merely want to handle an other case of malformed file that breaks the API... without changing the API behavior.
Currently if there's no data, we skip it, regardless of the cause. If I start introducing errors or warning, that introduce an inconsistency in the API as the information given to the user will vary according to the reason causing the absence of data:

  • empty datasets (with index) -> warning
  • device not recording any data (ex: device off) -> no warning
  • invalid trainId -> no warning
  • any other problem -> no warning

Same cause (no data for a source @ train) leads to different result, and as a user I am confused by that.

We can discuss a modification in the API for that, but that is a larger change and would require consistency. I can imagine for example to raise an error if a selection returns an empty collection (that would be a more abstract case that include that specific issue).

@takluyver
Copy link
Member

I'm somewhat unsure about this now.

The case where the DAQ doesn't record any data, but still makes index entries as though it did is a DAQ bug - whereas simple missing data is not. We shouldn't prevent people retrieving other data that was correctly recorded, but I think this has gone too far in the direction of hiding a serious problem. If users specifically try to get data which is affected by this problem, I think they should get an error, so it's clear that the files are broken.

So I'm currently thinking:

  • Interfaces which retrieve data from a single source & key (KeyData & similar) should continue to error like they do in master (except if we want to catch those errors and re-raise them with some more detailed message).
  • Interfaces which get data from several keys (.trains(), .train_from_id(), .train_from_index()) should handle this case like missing data, so they can still return other data which is not affected.
  • Inspection interfaces like lsxfel... I'm not sure. Do we have any idea how much more slower it would be checking for this? ITDM have just worked on writing the indexes at the beginning of the file, which should be a nice speedup for this kind of inspection; it would be a shame if we then went and made it slower again.


# Remove any trains previously selected, for which this
# selected source and key group has no data.
train_ids = np.intersect1d(train_ids, source_tids)
train_ids = np.intersect1d(train_ids, key_tids)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code now needs to:

  • Find all keys for control sources (which are often more numerous than instrument keys, IIRC), where the selection keeps all keys.
  • Do one intersection per key, rather than one per source (/source group). I found in Speed up _check_source_conflicts() #183 that NumPy set operations are not that efficient for working with many small sets. (I appreciate that it avoids a set union per file, though).

So I am concerned that this could be markedly slower in some circumstances (probably when selecting sources with many keys) than the code it replaces.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is slower, but I don't think that it's a problem here. Selections are not operation you'd do very often and you'd also not select many sources.
When testing with this branch, a selection took always less than 1 ms.

@takluyver
Copy link
Member

I also realised while looking into something related that data_counts uses pandas. I've tried to avoid using pandas for basic operations in EXtra-data (particularly things called by lsxfel), because it can be fairly slow to load. I might try to write something similar which can work without pandas.

@tmichela
Copy link
Member Author

OK, I'm still not convinced this is right, but I seem to be the minority. So be it.
I'll update this branch with the suggestions.

@takluyver
Copy link
Member

I'm not entirely convinced of what I'm saying either - I see the argument that missing data is missing data and it should appear the same regardless of how it came to be missing. But if we can't reach a consensus, I lean towards keeping more things as they are. We might decide to make more changes later - whereas if we change now and then want to change back, that's extra confusion.

@takluyver
Copy link
Member

@tmichela @philsmt: I don't suppose either of you have shifted your thinking on this? 🙂 I'm stuck in the middle: I can see the argument that users don't really care why missing data is missing, but I also don't want to let serious DAQ bugs pass silently.

Maybe warnings are the answer? Indicate that there's a problem with the files, but give results back as if it was regular missing data?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ignoring empty datasets
3 participants