Add concept of legacy source names #527

philsmt · 2024-06-12T15:08:37Z

The calibration team wants to discontinue the practice of writing corrected data under the same source name as raw data, but instead switch the type part of the source name from DET to CORR. For limited backwards compatibility, we aim to insert soft links under the old source name in corrected files. While this would work transparently in EXtra-data, some more support for this concept seems prudent, which I'd like to start with this MR.

It introduces the concept of legacy sources throughout the FileAccess, DataCollection and SourceData APIs. For now I limited this solely to tracking, i.e. it has no impact on any business logic when accessing files or data:

When reading sources, FileAccess will probe the source's leaf object for being a soft link and record its target. The source is counted as a regular source otherwise. For performance reasons, I limited this to INSTRUMENT sources for now (simple benchmarking suggests tens of ms for this operation when cold)
Legacy sources are part of the run_files_map and will invalidate any existing cache.
SourceData objects know whether they're a legacy source through a non-None value of SourceData.legacy, DataCollection has a legacy_sources: dict property.
When accessing the SourceData object of a legacy source via a DataCollection, a DeprecationWarning is emitted.
DataCollection.info() tracks legacy sources separately in their own section alongside their target.

For now it does not yet touch data='all', which will still kick out the raw data. I would add tests once we're happy with the design.

As part of some earlier tests, I also added support for multiple XTDF detectors in a single DataCollection. I'm happy to remove this, but it seems useful to have for the future (e.g. AGIPD1M and AGIPD4M in a single run).

@takluyver @tmichela @dgoeries

takluyver · 2024-06-13T08:41:10Z

Just to note, as part of the name change, we'll also want #468 (or something like it) so that the components layer can pick between corrected & raw data when both are 'visible' under different names.

extra_data/read_machinery.py

extra_data/reader.py

extra_data/sourcedata.py

kakhahmed · 2024-06-14T14:07:48Z

extra_data/reader.py

+
+        if sd.is_legacy:
+            warn(f"{source} is a legacy name for {self.legacy_sources[source]}. "
+                 f"Access via this name will be removed at a future data.",


nitpicking comment

Suggested change

f"Access via this name will be removed at a future data.",

"Access via this name will be removed at a future data.",

Also, I assume here you don't mean at a future date but for future data.?

I did originally mean date, but actually referencing future data is a good idea. We're not going to remove the legacy name from existing files, just stop adding it. Thanks!

… FileAccess

philsmt · 2024-06-16T16:15:39Z

This PR is now fully ready for review, I added tests for the new APIs.

For this purpose, extra_data.mockdata.detectors.DetectorModule can now optionally inject a legacy name and I added a new fixture mock_spb_modern_proc_run containing this. This could easily be extended to any device, but I would use this legacy pattern sparingly so we can hold off for now.

takluyver · 2024-06-17T13:23:39Z

extra_data/tests/test_file_access.py

+    # Get FileAccess for first module.
+    fa = sorted(RunDirectory(mock_modern_spb_proc_run).files,
+                key=lambda fa: fa.filename)[0]


min() also takes a key= argument to do things like this. I'm not particularly requesting a change, I just remembered a neat feature.

takluyver · 2024-06-17T13:25:37Z

Thank-you, LGTM

philsmt · 2024-06-20T07:03:08Z

Just to make sure, we all agree to invalidate existing run files maps?

takluyver · 2024-06-20T12:32:40Z

I'm OK with that. I don't want to invalidate that cache too frequently - that defeats the point of caching - but once in a while I don't think it's a big deal.

If we wanted to get clever, we could say we invalidate it only for proc data, since it doesn't seem likely that the raw data will ever have links like this. But I suspect it's not worth the extra complexity to manage that, as the lower levels of EXtra-data don't have a raw/proc distinction so far.

philsmt · 2024-06-21T11:49:53Z

For the record, I did some benchmarking picking random files across all proposals to avoid caching. Looping over all INSTRUMENT sources, probing a single source whether it's a soft link or not takes between 500 us and 1 ms, 3.5 ms on average per file.

takluyver reviewed Jun 13, 2024

View reviewed changes

extra_data/read_machinery.py Outdated Show resolved Hide resolved

extra_data/reader.py Outdated Show resolved Hide resolved

extra_data/sourcedata.py Outdated Show resolved Hide resolved

kakhahmed reviewed Jun 14, 2024

View reviewed changes

philsmt added 9 commits June 16, 2024 16:18

Support multiple XTDF detectors per collection in info()

bdbc5b5

Track soft links to sources as legacy source names in FileAccess API

c7f5d9c

Add warnings for legacy sources and annotate as such in info()

082f28b

(fixup) make DETECTOR_SOURCE_RE a bit more stringent again

60e08db

(fixup) rename SourceData.legacy to SourceData.canonical_name

b8acf14

(fixup) condense XTDF legacy sources in info

c62d48c

Optimize DataCollection._check_data_missing() to skip legacy sources

dedba8b

Enable mock XTDF sources to have a legacy name and add such tests for…

14e685b

… FileAccess

(fixup) add legacy source tests for DataCollection and SourceData

2682eba

philsmt force-pushed the feat/corr-source-names branch from 50fc57d to 2682eba Compare June 16, 2024 15:57

(fixup) clarify legacy source warning

f106f7d

philsmt marked this pull request as ready for review June 16, 2024 16:13

takluyver reviewed Jun 17, 2024

View reviewed changes

philsmt merged commit 58a0da2 into master Jun 24, 2024
8 checks passed

takluyver added this to the 1.18 milestone Sep 20, 2024

tmichela deleted the feat/corr-source-names branch October 1, 2024 10:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add concept of legacy source names #527

Add concept of legacy source names #527

philsmt commented Jun 12, 2024

takluyver commented Jun 13, 2024

kakhahmed Jun 14, 2024

philsmt Jun 16, 2024

philsmt commented Jun 16, 2024

takluyver Jun 17, 2024

takluyver commented Jun 17, 2024

philsmt commented Jun 20, 2024

takluyver commented Jun 20, 2024

philsmt commented Jun 21, 2024

	f"Access via this name will be removed at a future data.",
	"Access via this name will be removed at a future data.",

Add concept of legacy source names #527

Add concept of legacy source names #527

Conversation

philsmt commented Jun 12, 2024

takluyver commented Jun 13, 2024

kakhahmed Jun 14, 2024

Choose a reason for hiding this comment

philsmt Jun 16, 2024

Choose a reason for hiding this comment

philsmt commented Jun 16, 2024

takluyver Jun 17, 2024

Choose a reason for hiding this comment

takluyver commented Jun 17, 2024

philsmt commented Jun 20, 2024

takluyver commented Jun 20, 2024

philsmt commented Jun 21, 2024