European-XFEL · philsmt · Nov 30, 2022 · Dec 1, 2022 · Dec 1, 2022 · Dec 1, 2022
diff --git a/docs/data_format.rst b/docs/data_format.rst
@@ -1,119 +1,282 @@
+
 Data files format
 =================
 
-The main unit of data this tool works with is a *run*. A run is data collected
-in a specific period, and each research proposal given beantime at European XFEL
-may collect hundreds of runs.
+Scientific data at European XFEL is saved as structured HDF5 files in format
+called EXDF. Each file contains data for one or more *sources* having multiple
+*keys* that carry values for certain *trains* [1]_. Most sources, in particular for
+raw data, correspond to an entity in the Karabo control system called a *device*,
+which may manage physical hardware but also purely perform software functions.
+
+In most cases at European XFEL, such files are encountered with raw data recorded
+during an experiment by its data acquisition system (DAQ) or with automatically
+processed data created afterwards. Here, sources and trains are generally split
+across multiple files with sources grouped by an *aggregator* and trains by
+enumerated *sequences*. Sources with a very large data volume often have their own
+aggregator (e.g. very fast digitizers) or are even be spread across multiple of
+them (e.g. multi-module detectors). It is not guaranteed that the same sequence
+for different aggregators will cover the same set of trains.
+
+A continuous DAQ recording over a period of time is a *run* and includes all the
+structured HDF5 in a single directory. Here these files follow a naming pattern::
+
+    RAW-R0348-AGIPD04-S00002.h5
+
+which denotes an HDF5 file for the ``RAW`` data class of aggregator ``AGIPD04``
+of sequence ``2`` in run ``348``. Within a run the grouping of sources
+into  aggregators does not change. Each *proposal* can collect any number of runs
+during their granted beamtime.
+
+This document describes the most recent version **1.3** of this file format. While
+earlier version are used for data written at the time, their use is discouraged
+for any new files. The appendix lists the differences between each versions.
+
 
-A run is stored as a directory containing HDF5 data files from different
-sources. These fall into two important categories:
+Data sources
+------------
 
-1. Detector data, from the main X-ray detectors in the various experiments.
+Sources differ in the semantics of data generation and validity by either being
+*control data* (also called slow or broker data) or *instrument data*
+(also called fast, pipeline or XTDF data). In terms of the Karabo control system,
+control sources represent device properties while instrument sources are the data
+sent by device output channels. As such, it is recommended to follow the Karabo
+device naming convention [2]_ for source names.
 
-   - Each detector module writes separate files, e.g. ``RAW-R0348-AGIPD00-S00000.h5``.
-     The number in the third part of the filename identifies the module (0 in
-     this example).
-   - The detectors in use as of April 2018 are *LPD* and *AGIPD* in the file
-     names. Each has 16 modules numbered 0–15.
+* Control data represents a steady state that retains its value until changed again.
+  Examples for this are motor positions or detector configuration like the frame rate.
+  In general control data has a single value for every train in a file, as any train
+  without it changing carries the value of the previous train. As such each value is
+  accompanied by the timestamp this value became current. Based on the Karabo device
+  naming convention, control sources should follow the pattern ``DOMAIN/TYPE/MEMBER``.
+  Their data is saved in the ``CONTROL`` and ``RUN`` top-level groups described
+  further below.
 
-2. All the other data, such as motor positions, beam measurements, etc., are
-   recorded through a *data aggregator*, and stored in a file with the letters
-   *DA* in the name, e.g. ``RAW-R0450-DA01-S00000.h5``.
+* Instrument sources represent momentary data that is only valid for a single train
+  or pulse. This covers most scientific detectors such as digitizers, cameras and
+  more. These sources are never guaranteed to have data for every train, but may
+  also have multiple and varying entries per train. Their names should follow the
+  pattern ``DOMAIN/TYPE/MEMBER:PIPELINE/GROUP``. The data is saved in the ``INSTRUMENT``
+  top-level groups described further below.
 
-The last part of the file name (e.g. ``S00000``) is a sequence number. The
-data within a run may be broken into a number of sequences. So
-``RAW-R0450-DA01-S00000.h5`` and ``RAW-R0450-DA01-S00001.h5`` will contain data
-from the same set of devices, with sequence 1 continuing just after the end of
-sequence 0. Though all data within a run may be broken into sequences, different
-data sets do not necessarily break at the same point, so the various 'sequence 0'
-data files in a run do not have corresponding data.
+  The last component ``GROUP`` or index group is generally considered part of the key
+  rather than the source itself. In a Karabo perspective it is equivalent to the
+  top-level key in pipeline data. In files however, an instrument source may have
+  a different number of entries per train for each of its index groups and it is
+  thus treated differently than keys further down in the hierarchy.
 
 
 HDF5 file structure
 -------------------
 
+Every HDF5 file must contain the top-level groups ``/METADATA`` and ``/INDEX``.
+Depending on the included sources, there may additionally be the groups
+``/CONTROL``, ``/RUN`` and ``/INSTRUMENT``.
+
+
 METADATA
 ~~~~~~~~
 
-The ``METADATA`` group in an HDF5 file contains three datasets, each of which
-is a 1D array of strings:
+The ``METADATA`` group in an HDF5 file contains auxiliary information as individual
+datasets, most of which are constant across a run and or even proposal. Even when only
+containing a single entry, all these datasets are 1D with a length of 1 or more.
+
+For any given collections, not all of these datasets may be present depending on how
+it was created. The following datasets however are considered mandatory to allow
+proper interpretation of a file's structure:
+
+* ``dataFormatVersion [str]`` Data file format version of this file.
+
+* ``dataSources`` describes the sources in this file in three different representations.
+
+  * ``dataSources/root [str]`` lists the top-level group a source is found in, ``CONTROL``
+    or ``INSTRUMENT``.
+
+  * ``dataSources/deviceId [str]`` lists the source names itself. For instrument sources,
+    this includes the top-level key called index group, and the same source may thus be listed
+    multiple times for each of its index groups.
 
-* ``METADATA/dataSourceId`` lists data groups in the file. The values are either:
+  * ``dataSources/dataSourceId [str]`` lists the combination of the prior two, i.e. the
+    full path to each source's index group.
 
-  * ``CONTROL/`` followed by a Karabo device name, e.g.
-    ``CONTROL/SA1_XTD2_XGM/DOOCS/MAIN``.
-  * ``INSTRUMENT/`` followed by a Karabo device name, a colon, the name of the
-    output channel, a slash, and the name of a data group (?), e.g.
-    ``INSTRUMENT/SA1_XTD2_XGM/DOOCS/MAIN:output/data``
+For scientific data, it is recommended to include the following datasets to describe their
+origin:
 
-* ``METADATA/deviceId`` lists the part of each *dataSourceId* after the first
-  slash.
-* ``METADATA/root`` lists the parts before the first slash, so
-  ``concat(root, "/", deviceId) == dataSourceId``.
+* ``creationDate [str]`` [what was this time again?]
 
-These three data sets always have the same number of values. They may be padded
-with empty strings, so empty entries are ignored.
+* ``updateDate [str]``  [probably last change to this file?]
+
+* ``proposalNumber [uint32]`` Proposal number this file belongs to.
+
+* ``runNumber [uint32]``  Run number this file belongs to.
+
+* ``sequenceNumber [uint32]``  Sequence number this file has for the aggregator it belongs to.
+
+Raw data recorded with the EuXFEL DAQ software will contain the datasets to indicate the
+software versions used in this process:
+
+* ``daqLibrary [str]`` EuXFEL DAQ software version used to write this file
+
+* ``karaboFramework [str]`` Karabo framework version the DAQ software ran in
 
 INDEX
 ~~~~~
 
-``INDEX/trainId`` is a 1D array of uint64, listing the pulse trains which the
-file holds data for. This is crucial, since all other data has to be matched up
-according to train IDs.
+The ``INDEX`` group contains information about the *trains* contained in the file and how
+the actual data rows in ``CONTROL`` and ``INSTRUMENT`` relate to them. All datasets in this group
+are 1D and have a length identical to the number of trains in the file.
+
+There are three datasets at the top-level of this group:
 
-For each entry in ``METADATA/deviceId``, the ``INDEX`` group contains two
-datasets, both uint64 data with the same length as the train IDs:
+* ``trainId [uint64]`` lists the global train ID for this train entry.
 
-* ``INDEX/{ deviceId }/count``: for each train ID, how many data samples did
-  this device record. This may be 0 if no data was recorded for this train.
-* ``INDEX/{ deviceId }/first``: for each train ID, the index at which the
-  corresponding data starts in the arrays for this device.
+* ``timestamp [uint64]`` lists the number of nanseconds since the Epoch for this train entry.
 
-Thus, to find the data for a given train ID, we could do::
+* ``flag [int32]`` lists ``1`` for safe train entries and ``0`` for train entries where the timing
+  may be unreliable, e.g. because it is attributed to the wrong train ID. For DAQ recordings up
+  to version **1.2**, this is only the case when a source different than the timeserver sent the first
+  data entry for a given train.
 
-    train_index = trainIds.index(train_id)
-    first = device_firsts[train_index]
-    count = device_counts[train_index]
-    train_data = data[first : first+count]
+* ``origin [int32]`` lists the actual source index into ``METADATA/dataSources`` that sent that first
+  entry for each given train entry, or ``-1`` if it is the timeserver. For DAQ recordings up to
+  version **1.2**, every entry with a non-negative ``origin`` will have a ``flag`` of ``0``.
 
-Control data is always (?) recorded once per train, so *count* is 1 and *first*
-counts up from 0 to the number of trains. Instrument data is more variable.
+For each source in ``METADATA/dataSources/deviceId``, the ``INDEX`` group then also contains two
+datasets that map the train entries in the top-level datasets above to each source's data rows
+in ``CONTROL`` or ``INSTRUMENT``:
+
+* ``INDEX/{deviceId}/count [uint64]`` counts how many data samples did
+  this source record for each train. This may be 0 if no data was recorded.
+* ``INDEX/{deviceId}/first [uint64]`` contains the index at which the
+  corresponding data for each train starts in the arrays for this device.
+
+Thus, to find the data for a given train ID::
+
+    train_index = list(file['INDEX/trainId']).index(train_id)
+    first = file[f'INDEX/{device_id}/first'][train_index]
+    count = file[f'INDEX/{device_id}/count'][train_index]
+    train_data = file[f'INSTRUMENT/{device_id}/{key}][first:first+count]
 
-Some older files use a different index format with first/last/status instead of
-first/count. In this case, a status of 0 means that no data was recorded
-for that train.
 
 CONTROL and RUN
 ~~~~~~~~~~~~~~~
 
-For each *CONTROL* entry in ``METADATA/dataSourceId``, there is a group with
-that name in the file. This may have further arbitrarily nested subgroups
-representing different properties of that device, e.g.
-``/CONTROL/SA1_XTD2_XGM/DOOCS/MAIN/current/bottom/output``.
+For each *CONTROL* entry in ``METADATA/dataSources``, there is a group with
+that name in the file with further arbitrarily nested subgroups representing different
+keys of that source, e.g. ``CONTROL/SA1_XTD2_XGM/DOOCS/MAIN/current/bottom/output``
+for the key ``current.bottom.output`` of source ``SA1_XTD2_XG/DOOCS/MAIN``. Note that
+while the key hierarchy is expressed using groups in files, a dot is commonly used
+to separate the components.
 
 The leaves of this tree are pairs of datasets called ``timestamp`` and ``value``.
 Each dataset has one entry per train, and the ``timestamp`` record when the
-value was updated, which is typically less than once per train. The ``value``
-dataset may have extra dimensions, but in most cases it is 1D.
+current value was updated, which is typically less than once per train and thus
+likely in the past.
 
-(Does timestamp update if value is re-read but doesn't change?)
+The key groups themselves may have one or more HDF attributes attached with
+additional metadata:
+
+* ``displayedName [str]`` may denote a more exhaustive name for this key, e.g.
+  ``Complete Target Burst duration`` for ``totBurstDuration``.
+* ``alias [str]`` may specify an alternative name depending on context, e.g.
+  a hardware-specific designation for the value of a key.
+* ``description [str]`` may contain a full text explaining this key.
+* ``metricPrefixSymbol [str]`` may specify the metric prefix symbol for the unit
+  this key's values are expressed in, e.g. ``G``, ``k`` or ``n``.
+* ``unitSymbol [str]`` may specify the unit symbol this key's values are expressed
+  in, e.g. ``A``, ``Hz`` or ``eV``. Enumerations may use the symbol ``#`` and ratios
+  the symbol ``%``.
+
+EuXFEL DAQ recording often contain further attributes corresponding to attributes in
+the Karabo control system.
 
 ``RUN`` holds a complete duplicate of the ``CONTROL`` hierarchy, but each pair
-of ``timestamp`` and ``value`` contain only one entry, taken at the start of
-the run. There is still a dimension for this, so 2D value datasets in CONTROL
-have corresponding 2D datasets in RUN, but the first dimension has length 1.
+of ``timestamp`` and ``value`` contain only one entry taken at the start of
+the run. All datasets continue to be vectors, so even for scalar values the
+first dimension has length 1. It may also contain additional keys not present in
+``CONTROL``, e.g. whose values either do not change or is not relevant across trains.
 
-(Is RUN exactly duplicated in subsequent sequence files?)
 
 INSTRUMENT
 ~~~~~~~~~~
 
-For each *INSTRUMENT* entry in ``METADATA/dataSourceId``, there is a group with
-that name in the file. Each such group holds a 1D ``trainId`` dataset, and a
-number of other datasets (possibly nested in subgroups). All these datasets have
-the same length in the first dimension: this represents the successive readings
-taken. The slices defined by the corresponding datasets in *INDEX* work on
-this dimension.
+For each *INSTRUMENT* entry in ``METADATA/dataSources``, there is a group with
+that name in the file with further arbitrarily nested subgroups representing different
+keys of that source, e.g. ``INSTRUMENT/SPB_DET_AGIPD1M-1/DET/0CH0:xtdf/image/data``
+for the key ``image.data`` of source ``SPB_DET_AGIPD1M-1/DET/0CH:xtdf``. Unlike for
+*CONTROL* sources, the top-level part of the key called index group (in this example,
+``image``) is part of the entry in ``METADATA/dataSources`` to allow a variable number
+of data entries per train for each of these index groups. Note that while the key
+hierarchy is expressed using groups in files, a dot is commonly used to separate
+the components.
+
+The leafs of this tree directly contain the datasets holding the key values. Those
+datasets of the same index group of a given source have the same length in the first
+dimension, with each row representing a successive reading. The index group's ``INDEX``
+records can be used to connect them to the respective trains.
+
+As with *CONTROL* sources, the keys of *INSTRUMENT* sources may have the same HDF
+attributes attached with additional metadata.
+
+
+Format versions
+---------------
+
+1.3
+~~~
+
+The EuXFEL DAQ software is using this format version since January 2023.
+
+This section only lists the differences to past format versions.
+
+1.2
+~~~
+
+* There are no metadata attributes for keys in ``CONTROL``, ``RUN``  and ``INSTRUMENT``.
+
+The EuXFEL DAQ software used this format version between July 2021 and Februrary 2023.
+
+1.1
+~~~
+
+* ``INDEX/flag`` dataset is similar to ``INDEX/origin`` in later versions, listing the index into ``METADATA/dataSources`` of the source that sent the first entry for a given train. Unlike ``INDEX/origin`` however, the time server itself is a virtual source with index ``0`` rather than ``-1``.
+
+  **Warning:** This flips the meaning compared to earlier versions with ``0`` indicating a *safe* train and a positive number for unreliable timing.
+* ``METADATA/dataSources`` contains a static virtual source ``Karabo_TimeServer`` with an empty entry in ``METADATA/dataSources/root``.
+
+The EuXFEL DAQ software used this format version only briefly around July 2021.
+
+1.0
+~~~
+
+* ``INDEX`` group contains only the top-level datasets ``trainId``, ``timestamp``, ``flag``.
+
+The EuXFEL DAQ software used this format version between February 2020 and September 2021.
+
+0.5
+~~~
+
+**Warning:** This file format version is lacking the ``METADATA/dataFormatVersion`` dataset and can thus only be inferred from its structure.
+
+* ``INDEX`` group contains only the top-level dataset ``trainId``.
+* ``METADATA`` group is identical to ``METADATA/dataSources`` in later versions,
+  i.e. directly contains the datasets ``root``, ``deviceId`` and ``dataSourceId``.
+
+The EuXFEL DAQ software used this format version between February 2018 and April 2020.
+
+0.1
+~~~
+
+**Warning:** This file format version is lacking the ``METADATA/dataFormatVersion`` dataset and can thus only be inferred from its structure.
+
+Same as 0.5 in addition to:
+
+* ``INDEX/{deviceId}`` group specifies the mapping from trains to data rows of each source via ``first``/``last`` datasets with ``last = first + count - 1`` denoting the last row index belonging to a particular train.
+
+The EuXFEL DAQ software used this format version until April 2018.
+
+
+References
+----------
 
-The ``trainId`` dataset for each instrument group thus appears to be redundant
-with the information in INDEX.
+.. [1] Decking et al: *A MHz-repetition-rate hard X-ray free-electron laser driven by a superconducting linear accelerator*, Nature Photonics 391-397, 2020
+.. [2] European XFEL DAQ and Control systems naming convention: https://docs.xfel.eu/share/s/dDHQtDIkRUiXPr9DM6WQ-Q