Dataset directory structure #215

Gautzilla · 2024-10-22T08:29:14Z

The current dataset directory structure suffers from some flaws. For example, running an analysis that differ from a previous one only in time period request overwriting the previous analysis.

In this issue, I try to expose these flaws by using an example dataset in which I run 4 analyses that differ by the audio parameters (time duration, sample rate) and/or by the fft parameters (in that case no reshaping of the audio files is needed).
I'll first describe the analyses and the original dataset, then show the code snippets matching each analysis, and then the directory structure that results from these analyses.

Finally, I've added 2 draft directory structures:

a first one that is built on top of the existing structure and simply add the layers the current structure doesn't consider
a second structure that implies more changes, where directories are analysis-based

What do you, as OSEkit users, think of these draft structures?

Example

An original dataset, from which 4 analyses are run:

Analysis	Description
A1	Different audio length than original Different start/end times than original
A2	Same audio parameters than A1 : no reshaping needed. Only fft parameters change
B	Different start/end times than A1 and A2: reshaping needed.
C	Different audio parameters than A1, A2 and B: reshaping needed.

Original Dataset :

audio_file_length = 3_600
sampling_frequency = 128_000
t_start = Timestamp("01-01-2023 00:00:00")
t_stop = Timestamp("03-01-2023 12:00:00")

Analyses :

Analysis A1

# Different time period than original
t_start = Timestamp("02-01-2023 00:00:00")
t_stop = Timestamp("02-01-2023 12:00:00")

# Different audio parameters than original
audio_length = 1_800
sampling_frequency = 128_000

nfft = 1_024
window_size = 4_096
overlap = 20
zoom_level = 0
scale = 'linear'

Analysis A2

# Same time period and audio parameters than A1: audio files doesn't need to be reshaped.
t_start = Timestamp("02-01-2023 00:00:00")
t_stop = Timestamp("02-01-2023 12:00:00")

audio_length = 1_800
sampling_frequency = 128_000

# Only fft parameters differ from analysis A1

nfft = 1_024
window_size = 2_048
overlap = 50
zoom_level = 5
scale = 'log'

Analysis B

# Different time period: reshape needed.

t_start = Timestamp("03-01-2023 00:00:00")
t_stop = Timestamp("03-01-2023 12:00:00")

audio_length = 1_800
sampling_frequency = 128_000

nfft = 1_024
window_size = 4_096
overlap = 20
zoom_level = 0
scale = 'linear'

Analysis C

t_start = Timestamp("02-01-2023 00:00:00")
t_stop = Timestamp("02-01-2023 12:00:00")

# Different audio parameters: reshape needed.

audio_length = 900
sampling_frequency = 64_000

nfft = 1_024
window_size = 4_096
overlap = 20
zoom_level = 0
scale = 'linear'

Current directory structure

dataset
    ├╴ data
    │   ├╴ audio
    │   │   ├╴ 1800_128000
    │   │   │   ├╴ audio_1a.wav
    │   │   │   ├╴ audio_2a.wav
    │   │   │   ├╴ ...
    │   │   │   ├╴ metadata.csv
    │   │   │   └╴ timestamp.csv
    │   │   ├╴ 900_64000
    │   │   │   └╴ ...
    │   │   └╴ 3600_128000
    │   │       ├╴ audio_1.wav
    │   │       ├╴ audio_2.wav
    │   │       ├╴ ...
    │   │       ├╴ file_metadata.csv
    │   │       ├╴ metadata.csv
    │   │       └╴ timestamp.csv
    │   └╴ auxiliary
    ├╴ other
    ├╴ log
    └╴ processed
        ├╴ adjustment_spectros
        │   ├╴ spectro_a1.png
        │   ├╴ spectro_a2.png
        │   └╴ adjust_metadata.csv
        └╴ spectrogram
            ├╴ 1800_128000
            │   ├╴ 1024_4096_20_linear
            │   │   ├╴ image
            │   │   │   ├╴ spectro_A1_1.png
            │   │   │   ├╴ spectro_A1_2.png
            │   │   │   └╴ ...
            │   │   ├╴ matrix
            │   │   └╴ metadata.csv
            │   └╴ 1024_2048_50_linear
            │       └╴ ...
            └╴ 900_64000
                └╴1024_4096_20_linear
                    └╴ ...

Problems:

No support for t_start and t_stop
- Initializing Analysis A2 implies overwriting Analysis A1
No support for all spectrogram parameters (eg zoom level, y-axis method)
Some names could be more explicit:
- processed: could be replaced by output.
- spectrogram and matrix could fall into a spectrum upper level

Draft modifications of existing structure

Adds subfolders for the start and end dates of the analysis.
Add missing spectrogram parameters in folder names
Minor renaming of some folders (e.g. processed -> output)

dataset
    ├╴ data
    │   ├╴ audio
    │   │   ├╴ 1800_128000
    │   │   │   ├╴ 2023-01-02_00-00-00__2023-01-02_12-00-00
    │   │   │   │   ├╴ audio_1a.wav
    │   │   │   │   ├╴ audio_2a.wav
    │   │   │   │   ├╴ ...
    │   │   │   │   ├╴ analysis_metadata.csv
    │   │   │   │   └╴ file_metadata.csv
    │   │   │   └╴ 2023-01-03_00-00-00__2023-01-03_12-00-00
    │   │   │       └╴ ...
    │   │   ├╴ 900_64000
    │   │   │   └╴ 2023-01-02_00-00-00__2023-01-02_12-00-00
    │   │   │       └╴ ...
    │   │   └╴ 3600_128000_original
    │   │       └╴ 2023-01-01_00-00-00__2023-01-03_12-00-00
    │   │           ├╴ audio_1.wav
    │   │           ├╴ audio_2.wav
    │   │           ├╴ ...
    │   │           ├╴ analysis_metadata.csv
    │   │           └╴ file_metadata.csv
    │   └╴ auxiliary
    ├╴ other
    ├╴ logs
    └╴ output
        ├╴ adjustment_spectros
        │   ├╴ spectro_a1.png
        │   ├╴ spectro_a2.png
        │   └╴ adjust_metadata.csv
        └╴ spectrum
            ├╴ 1800_128000
            │   ├╴ 2023-01-02_00-00-00__2023-01-02_12-00-00
            │   │   ├╴ 1024_4096_20_0_linear
            │   │   │   ├╴ spectrogram
            │   │   │   │   ├╴ spectro_A1_1.png
            │   │   │   │   ├╴ spectro_A1_2.png
            │   │   │   │   └╴ ...
            │   │   │   ├╴ matrix
            │   │   │   └╴ spectrum_metadata.csv
            │   │   └╴ 1024_2048_50_5_log
            │   │       └╴ ...
            │   └╴ 2023-01-03_00-00-00__2023-01-03_12-00-00
            │       └╴ 1024_4096_20_0_linear
            │           └╴...
            └╴ 900_64000
                └╴ 2023-01-02_00-00-00__2023-01-02_12-00-00
                    └╴ 1024_4096_20_0_linear
                        └╴...

Remarks

There still are some flaws in this structure:

Should LTAS be put in special directories, or just a spectrum directory with timestamps / file duration / sample rate that match the original data?
Should all adjustement spectrograms be put in a same folder?
The metadata.csv name is used several times for different uses
- In the original data folder, file_metadata.csv and timestamp.csv contain redundant information, keep only file_metadata.csv?
- Replace xxx_metadata.csv files with xxx.json files that could be used for serializing python classes? (e.g., an analysis_dataset.json file in each analysis folder that can be parsed to a Dataset object in OSEkit).

Draft new structure

Consider latter remarks
One directory per analysis (dataset\audiolength_samplerate\tstart_tend\: correspond to one call to the reshaper module).
- These directories include both the data and output folders.
Specifies original dataset (which could also hold analyses with output etc.)

dataset
    ├╴ 3600_128000_original
    │   └╴ 2023-01-01_00-00-00__2023-01-03_12-00-00
    │       ├╴ analysis.json
    │       ├╴ data
    │       │   ├╴ audio
    │       │   │   ├╴ audio_1.wav
    │       │   │   ├╴ audio_2.wav
    │       │   │   ├╴ ...
    │       │   │   └╴ audio.json
    |       |   └╴ auxiliary
    |       ├╴ log
    |       └╴ output
    ├╴ 1800_128000
    │   ├╴ 2023-01-02_00-00-00__2023-01-02_12-00-00
    │   │   ├╴ analysis.json
    │   │   ├╴ data
    │   │   │   ├╴ audio
    │   │   │   │   ├╴ audio_1.wav
    │   │   │   │   ├╴ audio_2.wav
    │   │   │   │   ├╴ ...
    │   │   │   │   └╴ audio.json
    │   │   │   └╴ auxiliary
    │   │   ├╴ output
    │   │   │   ├╴ 1024_4096_20_0_linear
    │   │   │   │   ├╴ spectrogram
    │   │   │   │   │   ├╴ spectrogram_1.png
    │   │   │   │   │   ├╴ spectrogram_2.png
    │   │   │   │   │   └╴ ...
    │   │   │   │   ├╴ matrix
    │   │   │   │   └╴ spectrum.json
    │   │   │   └╴ 1024_2048_50_5_log
    │   │   │       ├╴ spectrogram
    │   │   │       │   ├╴ spectrogram_1.png
    │   │   │       │   └╴ ...
    │   │   │       ├╴ matrix
    │   │   │       └╴ spectrum.json
    │   │   └╴ log
    │   └╴ 2023-01-03_00-00-00__2023-01-03_12-00-00
    │       ├╴ analysis.json
    │       ├╴ data
    │       │   ├╴ audio
    │       │   │   ├╴ audio_1.wav
    │       │   │   ├╴ audio_2.wav
    │       │   │   ├╴ ...
    │       │   │   └╴ audio.json
    │       │   └╴ auxiliary
    │       ├╴ output
    │       │   └╴ 1024_4096_20_0_linear
    │       │       ├╴ spectrogram
    │       │       │   ├╴ spectrogram_1.png
    │       │       │   ├╴ spectrogram_2.png
    │       │       │   └╴ ...
    │       │       ├╴ matrix
    │       │       └╴ spectrum.json
    │       └╴ log
    └╴ 900_64000
        └╴ 2023-01-02_00-00-00__2023-01-02_12-00-00
            ├╴ analysis.json
            ├╴ data
            │   ├╴ audio
            │   │   ├╴ audio_1.wav
            │   │   ├╴ audio_2.wav
            │   │   ├╴ ...
            │   │   └╴ audio.json
            │   └╴ auxiliary
            ├╴ output
            │   └╴ 1024_4096_20_0_linear
            │       ├╴ spectrogram
            │       │   ├╴ spectrogram_1.png
            │       │   ├╴ spectrogram_2.png
            │       │   └╴ ...
            │       ├╴ matrix
            │       └╴ spectrum.json
            └╴ log

The text was updated successfully, but these errors were encountered:

MaelleTtrt · 2024-10-24T08:14:53Z

I like the proposed new structure, I think it solves most of our issues with the folder names. I just have a few questions:

I don't understand where are the LTAS data in this structure ?
You replaced time.csv and file_metadata.csv with audio.json and metadata.csv with spectrum.json?

For me, all adjustments spectrograms can be put into the same fodler, and they can also be deleted once the whole spectrogram genereation is launched.

Gautzilla · 2024-10-24T14:25:54Z

@MaelleTtrt

Easier question first:

You replaced time.csv and file_metadata.csv with audio.json and metadata.csv with spectrum.json?

The names are placeholders atm, but the idea is to make deserializable files: the analysis.json is a file that contains the (humanly readable and editable) informations on the analysis, and which is understandable by OSEkit for creating objects (such as a DasetAnalysis which you could recover later on)

It may not be a good idea for the file_metadata.csv because we want to keep a simple dictionary for tracking audio timestamps, or maybe we will want to create something like an AudioSet class that contains the audio metadata, plus methods for filtering them by timestamp or whatever. I guess I'll clarify all that as I progress in reformatting the package!

I don't understand where are the LTAS data in this structure ?

This simple, quick question led to a complicated, long discussion, which in turn led to another draft structure, involving even more drastic changes to OSEkit 👺

Basically, here are the changes we evoked:

Time period is moved above audio parameters in the structure:

This might help keeping track on which time regions of the dataset has already been analyzed

dataset
├── 2023-01-01_00-00-00__2023-01-03_12-00-00
│   └── 3600_128000_original
│       └── ...
├── 2023-01-02_01-00-00__2023-01-02_12-00-00
│   ├── 1800_128000
│   │   └── ... # A1 and A2
│   └── 900-64000
│       └── ... # C
└── 2023-01-03_01-00-00__2023-01-03_12-00-00
    └── 1800_128000
        └── ... # B

Store LTAS in the time period root

Would store LTAS as an analysis, but with specifing LTAS instead of the audio duration (which would implicitly be something like (t2-t1)/Timedelta(seconds = 1)):

For example, I want to generate a LTAS with a sr of 258 Hz over the whole example dataset time period, with a sr of 258 Hz, a time resolution of 30 minutes (that is, 258 * 1800 = 230400-samples-wide temporal windows), and nfft=256. Moreover, I want to generate a LTAS with the same parameters, only on the period covered by Analyses A1 & A2. This would lead to the following structure:

dataset
├── 2023-01-01_00-00-00__2023-01-03_12-00-00
│   ├── 3600_128000_original
│   │   ├── analysis.json
│   │   ├── data
│   │   ├── log
│   │   └── output      
│   └── LTAS_128
│       ├── analysis.json
│       ├── log
│       └── output
│           └── 1800_256
│               ├── spectrogram
│               ├── marix
│               └── spectrum.json
└── 2023-01-02_00-00-00__2023-01-02_12-00-00
    └── LTAS_128
        ├── analysis.json
        ├── log
        └── output
            └── 1800_256
                ├── spectrogram
                ├── marix
                └── spectrum.json

Replace `nfft` and `window_size` with `frequency_resolution` and `temporal_resolution`

As a time resolution of 20 ms is more obvious than a window size of 3840 samples at a sampling rate of 192 kHz, we might use these metrics primarily for creating the analyses?

This would imply some backstage checks: we should e.g. match nffts that are powers of 2 whatever the given frequency resolution:

    nfft = int(sample_rate // frequency_resolution)
    optimal_nfft = 1 << (int(nfft).bit_length() - 1) # warn the user if we change the frequency_resolution so that nfft matches optimal_nfft

or consider overlap in the computing of the time window sizes if a given temporal resolution leads to very small window sizes or whatever.

If these metrics appear to make more sense than the previous ones, we should still discuss how to include them in the directory structure, as the resolutions might be floating points: a LTAS could work with ~hours-long temporal resolutions, and an analysis looking for dolphin clicks with temporal resolutions in the order of a millisecond. Would we risk to add dots in the folder names (🤢) ? Should we note the resolution in milliseconds (and in millihertz for the frequency resolutions of campaigns that study whales ??)

Gautzilla · 2024-11-21T10:16:37Z

As discussed with @mathieudpnt and @PaulCarvaillo, we might keep features that risk breaking the retro-compatibility for later, in a brighter future when OSEkit is reformatted and easier to maintain! ☀️

Gautzilla self-assigned this Oct 22, 2024

Gautzilla added data format Work related to spectrogram/audio format and how to process it and removed data format Work related to spectrogram/audio format and how to process it labels Oct 22, 2024

ElodieENSTA added the APLOSE related The changes are impacted APLOSE behavior label Oct 22, 2024

Gautzilla added the long term Issue that takes a long time to be corrected label Nov 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset directory structure #215

Dataset directory structure #215

Gautzilla commented Oct 22, 2024 •

edited

Loading

MaelleTtrt commented Oct 24, 2024

Gautzilla commented Oct 24, 2024 •

edited

Loading

Gautzilla commented Nov 21, 2024

Dataset directory structure #215

Dataset directory structure #215

Comments

Gautzilla commented Oct 22, 2024 • edited Loading

Example

Original Dataset :

Analyses :

Analysis A1

Analysis A2

Analysis B

Analysis C

Current directory structure

Problems:

Draft modifications of existing structure

Remarks

Draft new structure

MaelleTtrt commented Oct 24, 2024

Gautzilla commented Oct 24, 2024 • edited Loading

Time period is moved above audio parameters in the structure:

Store LTAS in the time period root

Replace nfft and window_size with frequency_resolution and temporal_resolution

Gautzilla commented Nov 21, 2024

Gautzilla commented Oct 22, 2024 •

edited

Loading

Gautzilla commented Oct 24, 2024 •

edited

Loading

Replace `nfft` and `window_size` with `frequency_resolution` and `temporal_resolution`