Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not check for monotonic increase of timestamps in validation by default #503

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

steffenhauf
Copy link
Member

@steffenhauf steffenhauf commented Mar 21, 2024

While train ids should always be monotonically increasing, for timestamps there is no such requirement. A device is free to choose any timestamp for a given train id - they are technically not coupled.

The evolved DAQ in fact by design introduces non-monotonically increasing timestamps whenever "late" data, i.e. data that is outside its finite buffer size is applied on a best effort basis. It will then be applied for latest train still in the buffer onwards, which in this case might have a newer timestamp attached to its data than the late data applied. That means that trains that had already left the buffer can have newer timestamps too.

Consider the following scenario, where train id is the train id of the data, and reference train id, when it was registered on the DAQ. The buffer length of the DAQ is 5

idx train id reference train id p1 p2 t1 t2
0 42 42 1 a 100 100
1 46 47 3 no update 500 600
2 47 51 4 no update 600 no update
3 48 53 5 c 700 700
4 45 56 2 b 400 400
5 49 53 5 e 700 800

Here, data with idx:=4 occurred "late". In the DAQ it will be applied best effort, i.e. to the first train still within the buffer, , which in this case would be the train idx:=2, since the train with idx:=1 is already outside the buffer at this point. The file data will thus look like this:

idx train id reference train id p1 p2 t1 t2
0 42 42 1 a 100 100
1 46 47 3 a 500 600
2 47 51 4 b 600 400
3 48 53 5 c 700 700
4 49 54 5 e 700 800

Note how the timestamps now do not monotonically increase.

@steffenhauf steffenhauf changed the title Do not check for monitonic increase of timestamps in validation by default Do not check for monotonic increase of timestamps in validation by default Mar 21, 2024
@steffenhauf
Copy link
Member Author

I'm not sure what to do regarding the failing coverage check - Coverage decreased because of line wrapping as far as I can tell.

@takluyver
Copy link
Member

I understand that we can get this while the DAQ itself is behaving as expected. But do we still consider that the data recording system as a whole has malfunctioned? It seems like either the stored value at idx 1 is wrong because a valid update didn't make it to the DAQ in time, or the timestamp associated with idx 2 is wrong because something incorrectly backdated a change that really happened later.

So it seems reasonable to check for this and mark it as a problem, even if it's not a bug in the DAQ. If there are contexts where this occurs a lot and we want to check for other issues, perhaps the check should be opt-out rather than opt-in?

@takluyver
Copy link
Member

(And don't worry about the coverage measurement - it's useful to have the measurements and the reminder, but we don't generally insist that every changed line must be covered, and codecov can give odd results on PRs in any case)

@steffenhauf
Copy link
Member Author

If the data is late, then the data recording system as a whole is misbehaving. But I would argue we should track this e.g. through the new ERRATA section which has the purpose to inform that data could not be stored fully, or normally, or both. In my view there is no implied causality of a monotonically increasing train id and timestamp (at least for Karabo). Consider for instance a scenario where a device connects to hardware which has an internal buffer to which it writes its measurement data non-sequentially and in large intervals. The device now pulls all entries in the buffer once so often and they are returned in a readout order that is not necessarily the write order. It sends out one buffer entry per train, and because the developer knows that the data might be quite old (by DAQ standards), they attach the train id of when the data was read out to the train, and keep the timestamp as recorded by the hardware. In my view, while hypothetical (I'm not aware of such a device existing), this is a completely valid read out approach

  • the developer took care that data can be digested by a DAQ with a finite train buffer length
  • the developer preserved as much of the time information as possible, i.e. the time stamps
  • the DAQ will ingest the data and also preserve the information given to it.

I don't see why this would be an error. Timestamps and train ids are not coupled in Karabo exactly for this reason.

@takluyver
Copy link
Member

because the developer knows that the data might be quite old (by DAQ standards), they attach the train id of when the data was read out to the train, and keep the timestamp as recorded by the hardware...

You're obviously the expert here, but this sounds wrong to me? Or if that kind of thing is really expected, we might need to substantially rethink how people access data 😞 . We assume that data saved with a train ID represents values physically measured at the time of that train, not when one piece of hardware/software sends it to another. This assumption is fundamental to correlating data from different sources, it's baked into the design of EXtra-data and what we've taught people over the last few years. If it's not reliable, figuring out what the value really was for a given train becomes much more complicated.

So what I'd expect to happen in that scenario is that something ( ™️ ) takes the hardware timestamps and translates them to train IDs so the DAQ can at least make an effort to write the correct values for the trains they correspond with.

Even with that, though, I don't think it's viable for the update interval to be arbitrarily long, because that means we can miss an arbitrary amount of data at the end of a run. And worse, I don't think we can even represent missing CONTROL data at present, so there would just be wrong data.

@steffenhauf
Copy link
Member Author

steffenhauf commented Mar 21, 2024

Don't worry, you don't have to scrap all (or any) assumptions. The something (TM) are the Karabo devices, and we take great care that the train id matches the time points when the data is valid. I chose this hypothetical example to illustrate why there is not enforced coupling of timestamp and train ids on the control system side. Any real world device would have either had reason to see the train ids at the time of read out as valid for the data, e.g., the developer would have reason to believe the timestamps are inconsistent, or would have aligned train ids with timestamps.

The unique time correlator is the train id, and a device developer will take care that data assigned one train id will correlate in time with data with the same train id. The timestamps usually will as well, but are "just" metadata, and there's nothing that technically enforces a correlation. In other words, one should still be able to correlate data by train id, even if no timestamp was assigned to it.

You are correct that the acceptance interval of the DAQ is finite in length, and thus at the end there is always a chance that some data that was generated after the interval with a train id inside a run's train id range was not accepted. You would indeed need an infinite interval to handle all cases.

In practice with a combination of ERRATA, best effort application of late values as they come, and Influx DB we are pretty well setup though.

The whole motivation of this suggestion is not to question or cast our data correlation model in doubt. It covers all cases that are coverable with reasonable (and even more) effort in my view, and we go through quite some length to at least log when the DAQ technically had to resort to fallbacks.

What I want to motivate is that there are better indicators than not monotonically increasing timestamps for late or disordered data, such as evaluating the ERRATA section. Thinking about it, maybe one should strive for a combination of both indicators. A non-monotonic timestamp that cannot be explained by ERRATA logs is a validation error, while one that is warrants more of an informational output?

@takluyver
Copy link
Member

Thanks! So in the scenario you initially described, with an out-of-order update arriving too late to be fully backdated, there would be some information in ERRATA about this? What would that look like?

For extra-data-validate, I'd still be inclined to say that it should flag this scenario as a problem by default - it's a problem in the saved data that we can detect, even if it's not a DAQ bug.

@philsmt
Copy link
Contributor

philsmt commented Mar 25, 2024

I can definitely see both side of the argument. Even if we accept no fixed correlation between train ID and timestamp as illustrated by Steffen (and thus do not consider it an error), we're left with the fact that it's still unexpected or out of the ordinary, and thus extra-data-validate would be a tool to draw attention to it.

Does it make sense now to introduce two tiers of validation problems, something akin to an error (critical to properly accessing the data) and some warning? It seems like extra-data-validate is increasingly used in a CI-sense requiring a boolean result whether we consider a file acceptable, as well as in the historical sense of running it more manually "is-something-up-in-the-files". The check for monotonic timestamps naturally would fall in the latter category.

@takluyver
Copy link
Member

Thanks! That may well make sense, although you could argue that problems which produce valid files with potentially incorrect data are more severe than problems which produce files we can't read. They're certainly different kinds of problem, though, so it makes sense to allow separating them.

Maybe like code linter tools, we should have IDs for the different checks we do so we can have a standard CLI for enabling/disabling each check, or enabling but not counting it as a failure for the exit code. Or maybe that's overkill. 🤔

@takluyver
Copy link
Member

From our discussion a few weeks ago, I've had a go in #522 at making specific checks skippable with an option like:

extra-data-validate --skip control_timestamps_order

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants