Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add concept of legacy source names #527

Merged
merged 10 commits into from
Jun 24, 2024
Merged

Add concept of legacy source names #527

merged 10 commits into from
Jun 24, 2024

Conversation

philsmt
Copy link
Contributor

@philsmt philsmt commented Jun 12, 2024

The calibration team wants to discontinue the practice of writing corrected data under the same source name as raw data, but instead switch the type part of the source name from DET to CORR. For limited backwards compatibility, we aim to insert soft links under the old source name in corrected files. While this would work transparently in EXtra-data, some more support for this concept seems prudent, which I'd like to start with this MR.

It introduces the concept of legacy sources throughout the FileAccess, DataCollection and SourceData APIs. For now I limited this solely to tracking, i.e. it has no impact on any business logic when accessing files or data:

  • When reading sources, FileAccess will probe the source's leaf object for being a soft link and record its target. The source is counted as a regular source otherwise. For performance reasons, I limited this to INSTRUMENT sources for now (simple benchmarking suggests tens of ms for this operation when cold)
  • Legacy sources are part of the run_files_map and will invalidate any existing cache.
  • SourceData objects know whether they're a legacy source through a non-None value of SourceData.legacy, DataCollection has a legacy_sources: dict property.
  • When accessing the SourceData object of a legacy source via a DataCollection, a DeprecationWarning is emitted.
  • DataCollection.info() tracks legacy sources separately in their own section alongside their target.

For now it does not yet touch data='all', which will still kick out the raw data. I would add tests once we're happy with the design.

As part of some earlier tests, I also added support for multiple XTDF detectors in a single DataCollection. I'm happy to remove this, but it seems useful to have for the future (e.g. AGIPD1M and AGIPD4M in a single run).

@takluyver @tmichela @dgoeries

@takluyver
Copy link
Member

Just to note, as part of the name change, we'll also want #468 (or something like it) so that the components layer can pick between corrected & raw data when both are 'visible' under different names.

extra_data/read_machinery.py Outdated Show resolved Hide resolved
extra_data/reader.py Outdated Show resolved Hide resolved
extra_data/sourcedata.py Outdated Show resolved Hide resolved

if sd.is_legacy:
warn(f"{source} is a legacy name for {self.legacy_sources[source]}. "
f"Access via this name will be removed at a future data.",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpicking comment

Suggested change
f"Access via this name will be removed at a future data.",
"Access via this name will be removed at a future data.",

Also, I assume here you don't mean at a future date but for future data.?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did originally mean date, but actually referencing future data is a good idea. We're not going to remove the legacy name from existing files, just stop adding it. Thanks!

@philsmt philsmt marked this pull request as ready for review June 16, 2024 16:13
@philsmt
Copy link
Contributor Author

philsmt commented Jun 16, 2024

This PR is now fully ready for review, I added tests for the new APIs.

For this purpose, extra_data.mockdata.detectors.DetectorModule can now optionally inject a legacy name and I added a new fixture mock_spb_modern_proc_run containing this. This could easily be extended to any device, but I would use this legacy pattern sparingly so we can hold off for now.

Comment on lines +158 to +160
# Get FileAccess for first module.
fa = sorted(RunDirectory(mock_modern_spb_proc_run).files,
key=lambda fa: fa.filename)[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

min() also takes a key= argument to do things like this. I'm not particularly requesting a change, I just remembered a neat feature.

@takluyver
Copy link
Member

Thank-you, LGTM

@philsmt
Copy link
Contributor Author

philsmt commented Jun 20, 2024

Just to make sure, we all agree to invalidate existing run files maps?

@takluyver
Copy link
Member

I'm OK with that. I don't want to invalidate that cache too frequently - that defeats the point of caching - but once in a while I don't think it's a big deal.

If we wanted to get clever, we could say we invalidate it only for proc data, since it doesn't seem likely that the raw data will ever have links like this. But I suspect it's not worth the extra complexity to manage that, as the lower levels of EXtra-data don't have a raw/proc distinction so far.

@philsmt
Copy link
Contributor Author

philsmt commented Jun 21, 2024

For the record, I did some benchmarking picking random files across all proposals to avoid caching. Looping over all INSTRUMENT sources, probing a single source whether it's a soft link or not takes between 500 us and 1 ms, 3.5 ms on average per file.

@philsmt philsmt merged commit 58a0da2 into master Jun 24, 2024
8 checks passed
@takluyver takluyver added this to the 1.18 milestone Sep 20, 2024
@tmichela tmichela deleted the feat/corr-source-names branch October 1, 2024 10:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants