Merge branch 'main' into set-probe

catalystneuro · Nov 28, 2023 · 9afd9a9 · 9afd9a9
2 parents 6f5f9ba + 554e07b
commit 9afd9a9
Show file tree

Hide file tree

Showing 42 changed files with 1,179 additions and 380 deletions.
diff --git a/.github/workflows/add-to-dashboard.yml b/.github/workflows/add-to-dashboard.yml
@@ -1,35 +1,19 @@
-name: Add Issue or PR to Dashboard
+name: Add Issue or Pull Request to Dashboard
 
 on:
   issues:
-    types: opened
-
+    types:
+      - opened
   pull_request:
     types:
       - opened
 
 jobs:
-  issue_opened:
-    name: Add Issue to Dashboard
-    runs-on: ubuntu-latest
-    if: github.event_name == 'issues'
-    steps:
-      - name: Add Issue to Dashboard
-        uses: leonsteinhaeuser/[email protected]
-        with:
-          gh_token: ${{ secrets.MY_GITHUB_TOKEN }}
-          organization: catalystneuro
-          project_id: 3
-          resource_node_id: ${{ github.event.issue.node_id }}
-  pr_opened:
-    name: Add PR to Dashboard
+  add-to-project:
+    name: Add issue or pull request to project
     runs-on: ubuntu-latest
-    if: github.event_name == 'pull_request' && github.event.action == 'opened'
     steps:
-      - name: Add PR to Dashboard
-        uses: leonsteinhaeuser/[email protected]
+      - uses: actions/[email protected]
         with:
-          gh_token: ${{ secrets.MY_GITHUB_TOKEN }}
-          organization: catalystneuro
-          project_id: 3
-          resource_node_id: ${{ github.event.pull_request.node_id }}
+          project-url: https://github.com/orgs/catalystneuro/projects/3
+          github-token: ${{ secrets.PROJECT_TOKEN }}
diff --git a/.github/workflows/dev-testing.yml b/.github/workflows/dev-testing.yml
@@ -18,7 +18,7 @@ env:
 
 jobs:
   run:
-    name: Dev Branch Testing with Python 3.9 and ubuntu-latest
+    name: Ubuntu tests with Python ${{ matrix.python-version }}
     runs-on: ubuntu-latest
     strategy:
       fail-fast: false

diff --git a/.github/workflows/doctests.yml b/.github/workflows/doctests.yml
@@ -4,7 +4,7 @@ on:
 
 jobs:
   run:
-    name: Doctests on ${{ matrix.os }} with Python ${{ matrix.python-version }}
+    name: ${{ matrix.os }} Python ${{ matrix.python-version }}
     runs-on: ${{ matrix.os }}
     strategy:
       fail-fast: false

diff --git a/.github/workflows/formatwise-installation-testing.yml b/.github/workflows/formatwise-installation-testing.yml
@@ -6,7 +6,7 @@ on:
 
 jobs:
   run:
-    name: Formatwise gallery tests for ${{ format.type }}:${{ format.name }} on ${{ matrix.os }} with Python ${{ matrix.python-version }}
+    name: ${{ format.type }}:${{ format.name }} on ${{ matrix.os }} with Python ${{ matrix.python-version }}
     runs-on: ${{ matrix.os }}
     strategy:
       fail-fast: false

diff --git a/.github/workflows/live-service-testing.yml b/.github/workflows/live-service-testing.yml
@@ -12,7 +12,7 @@ env:
 
 jobs:
   run:
-    name: Live service testing on ${{ matrix.os }} with Python ${{ matrix.python-version }}
+    name: ${{ matrix.os }} Python ${{ matrix.python-version }}
     runs-on: ${{ matrix.os }}
     strategy:
       fail-fast: false

diff --git a/.github/workflows/testing.yml b/.github/workflows/testing.yml
@@ -14,7 +14,7 @@ on:
 
 jobs:
   run:
-    name: Minimal and full tests on ${{ matrix.os }} with Python ${{ matrix.python-version }}
+    name: ${{ matrix.os }} Python ${{ matrix.python-version }}
     runs-on: ${{ matrix.os }}
     strategy:
       fail-fast: false

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -5,13 +5,21 @@
 * Changed the metadata schema for `Fluorescence` and `DfOverF` where the traces metadata can be provided as a dict instead of a list of dicts.
   The name of the plane segmentation is used to determine which traces to add to the `Fluorescence` and `DfOverF` containers. [PR #632](https://github.com/catalystneuro/neuroconv/pull/632)
 * Modify the filtering of traces to also filter out traces with empty values. [PR #649](https://github.com/catalystneuro/neuroconv/pull/649)
+* Added tool function `get_default_dataset_configurations` for identifying and collecting all fields of an in-memory `NWBFile` that could become datasets on disk; and return instances of the Pydantic dataset models filled with default values for chunking/buffering/compression. [PR #569](https://github.com/catalystneuro/neuroconv/pull/569)
 * Added `set_probe()` method to `BaseRecordingExtractorInterface`. [PR #639](https://github.com/catalystneuro/neuroconv/pull/639)
 
 ### Fixes
 * Fixed GenericDataChunkIterator (in hdmf.py) in the case where the number of dimensions is 1 and the size in bytes is greater than the threshold of 1 GB. [PR #638](https://github.com/catalystneuro/neuroconv/pull/638)
 * Changed `np.floor` and `np.prod` usage to `math.floor` and `math.prod` in various files. [PR #638](https://github.com/catalystneuro/neuroconv/pull/638)
+* Updated minimal required version of DANDI CLI; updated `run_conversion_from_yaml` API function and tests to be compatible with naming changes. [PR #664](https://github.com/catalystneuro/neuroconv/pull/664)
 
-# v0.4.5
+### Improvements
+* Change metadata extraction library from `fparse` to `parse`. [PR #654](https://github.com/catalystneuro/neuroconv/pull/654)
+* The `dandi` CLI/API is now an optional dependency; it is still required to use the `tool` function for automated upload as well as the YAML-based NeuroConv CLI. [PR #655](https://github.com/catalystneuro/neuroconv/pull/655)
+
+
+
+# v0.4.5 (November 6, 2023)
 
 ### Back-compatibility break
 * The `CEDRecordingInterface` has now been removed; use the `Spike2RecordingInterface` instead. [PR #602](https://github.com/catalystneuro/neuroconv/pull/602)

diff --git a/docs/developer_guide/testing_suite.rst b/docs/developer_guide/testing_suite.rst
@@ -20,7 +20,7 @@ Then install all required and optional dependencies in a fresh environment.
 
 .. code:: bash
 
-  pip install -e . neuroconv[test,full]
+  pip install -e .[test,full]
 
 
 Then simply run all tests with pytest
@@ -29,6 +29,10 @@ Then simply run all tests with pytest
 
   pytest
 
+.. note::
+
+  You will likely observe many failed tests if the test data is not available. See the section 'Testing on Example Data' for instructions on how to download the test data.
+
 
 Minimal
 -------

diff --git a/requirements-minimal.txt b/requirements-minimal.txt
@@ -6,8 +6,9 @@ h5py>=3.9.0
 hdmf>=3.11.0
 hdmf_zarr>=0.4.0
 pynwb>=2.3.2;python_version>='3.8'
+nwbinspector>=0.4.31
+pydantic>=1.10.13,<2.0.0
 psutil>=5.8.0
 tqdm>=4.60.0
-dandi>=0.57.0
 pandas
-fparse
+parse>=1.20.0
diff --git a/requirements-testing.txt b/requirements-testing.txt
@@ -2,6 +2,5 @@ pytest
 pytest-cov
 ndx-events>=0.2.0  # for special tests to ensure load_namespaces is set to allow NWBFile load at all timess
 parameterized>=0.8.1
-scikit-learn  # For SI Waveform tests
-numba; python_version <= '3.10'  # For SI Waveform tests
 ndx-miniscope
+spikeinterface[qualitymetrics]>=0.99.1
diff --git a/setup.py b/setup.py
@@ -18,6 +18,9 @@
     testing_suite_dependencies = f.readlines()
 
 extras_require = defaultdict(list)
+extras_require["dandi"].append("dandi>=0.58.1")
+extras_require["full"].extend(extras_require["dandi"])
+
 extras_require.update(test=testing_suite_dependencies, docs=documentation_dependencies)
 for modality in ["ophys", "ecephys", "icephys", "behavior", "text"]:
     modality_path = root / "src" / "neuroconv" / "datainterfaces" / modality
@@ -75,7 +78,7 @@
     extras_require=extras_require,
     entry_points={
         "console_scripts": [
-            "neuroconv = neuroconv.tools.yaml_conversion_specification.yaml_conversion_specification:run_conversion_from_yaml_cli",
+            "neuroconv = neuroconv.tools.yaml_conversion_specification._yaml_conversion_specification:run_conversion_from_yaml_cli",
         ],
     },
     license="BSD-3-Clause",

diff --git a/src/neuroconv/datainterfaces/ecephys/requirements.txt b/src/neuroconv/datainterfaces/ecephys/requirements.txt
@@ -1,2 +1,2 @@
-spikeinterface>=0.98.2
+spikeinterface>=0.99.1
 packaging<22.0
diff --git a/src/neuroconv/tools/__init__.py b/src/neuroconv/tools/__init__.py
@@ -1,3 +1,4 @@
+"""Collection of all helper functions that require at least one external dependency (some being optional as well)."""
 from .importing import get_package
 from .nwb_helpers import get_module
 from .path_expansion import LocalPathExpander

diff --git a/src/neuroconv/tools/data_transfers/__init__.py b/src/neuroconv/tools/data_transfers/__init__.py
@@ -0,0 +1,5 @@
+"""Collection of helper functions for assessing and performing automated data transfers."""
+from ._aws import estimate_s3_conversion_cost
+from ._dandi import automatic_dandi_upload
+from ._globus import get_globus_dataset_content_sizes, transfer_globus_content
+from ._helpers import estimate_total_conversion_runtime
diff --git a/src/neuroconv/tools/data_transfers/_aws.py b/src/neuroconv/tools/data_transfers/_aws.py
@@ -0,0 +1,34 @@
+"""Collection of helper functions for assessing and performing automated data transfers related to AWS."""
+
+
+def estimate_s3_conversion_cost(
+    total_mb: float,
+    transfer_rate_mb: float = 20.0,
+    conversion_rate_mb: float = 17.0,
+    upload_rate_mb: float = 40.0,
+    compression_ratio: float = 1.7,
+):
+    """
+    Estimate potential cost of performing an entire conversion on S3 using full automation.
+
+    Parameters
+    ----------
+    total_mb: float
+        The total amount of data (in MB) that will be transferred, converted, and uploaded to dandi.
+    transfer_rate_mb : float, default: 20.0
+        Estimate of the transfer rate for the data.
+    conversion_rate_mb : float, default: 17.0
+        Estimate of the conversion rate for the data. Can vary widely depending on conversion options and type of data.
+        Figure of 17MB/s is based on extensive compression of high-volume, high-resolution ecephys.
+    upload_rate_mb : float, default: 40.0
+        Estimate of the upload rate of a single file to the DANDI Archive.
+    compression_ratio : float, default: 1.7
+        Estimate of the final average compression ratio for datasets in the file. Can vary widely.
+    """
+    c = 1 / compression_ratio  # compressed_size = total_size * c
+    total_mb_s = (
+        total_mb**2 / 2 * (1 / transfer_rate_mb + (2 * c + 1) / conversion_rate_mb + 2 * c**2 / upload_rate_mb)
+    )
+    cost_gb_m = 0.08 / 1e3  # $0.08 / GB Month
+    cost_mb_s = cost_gb_m / (1e3 * 2.628e6)  # assuming 30 day month; unsure how amazon weights shorter months?
+    return cost_mb_s * total_mb_s
diff --git a/src/neuroconv/tools/data_transfers/_dandi.py b/src/neuroconv/tools/data_transfers/_dandi.py
@@ -0,0 +1,118 @@
+"""Collection of helper functions for assessing and performing automated data transfers for the DANDI archive."""
+import os
+from pathlib import Path
+from shutil import rmtree
+from tempfile import mkdtemp
+from typing import Union
+from warnings import warn
+
+from pynwb import NWBHDF5IO
+
+from ...utils import FolderPathType, OptionalFolderPathType
+
+
+def automatic_dandi_upload(
+    dandiset_id: str,
+    nwb_folder_path: FolderPathType,
+    dandiset_folder_path: OptionalFolderPathType = None,
+    version: str = "draft",
+    staging: bool = False,
+    cleanup: bool = False,
+    number_of_jobs: Union[int, None] = None,
+    number_of_threads: Union[int, None] = None,
+):
+    """
+    Fully automated upload of NWBFiles to a DANDISet.
+
+    Requires an API token set as an envrinment variable named DANDI_API_KEY.
+
+    To set this in your bash terminal in Linux or MacOS, run
+        export DANDI_API_KEY=...
+    or in Windows
+        set DANDI_API_KEY=...
+
+    DO NOT STORE THIS IN ANY PUBLICLY SHARED CODE.
+
+    Parameters
+    ----------
+    dandiset_id : str
+        Six-digit string identifier for the DANDISet the NWBFiles will be uploaded to.
+    nwb_folder_path : folder path
+        Folder containing the NWBFiles to be uploaded.
+    dandiset_folder_path : folder path, optional
+        A separate folder location within which to download the dandiset.
+        Used in cases where you do not have write permissions for the parent of the 'nwb_folder_path' directory.
+        Default behavior downloads the DANDISet to a folder adjacent to the 'nwb_folder_path'.
+    version : {None, "draft", "version"}
+        The default is "draft".
+    staging : bool, default: False
+        Is the DANDISet hosted on the staging server? This is mostly for testing purposes.
+        The default is False.
+    cleanup : bool, default: False
+        Whether to remove the dandiset folder path and nwb_folder_path.
+        Defaults to False.
+    number_of_jobs : int, optional
+        The number of jobs to use in the DANDI upload process.
+    number_of_threads : int, optional
+        The number of threads to use in the DANDI upload process.
+    """
+    from dandi.download import download as dandi_download
+    from dandi.organize import organize as dandi_organize
+    from dandi.upload import upload as dandi_upload
+
+    assert os.getenv("DANDI_API_KEY"), (
+        "Unable to find environment variable 'DANDI_API_KEY'. "
+        "Please retrieve your token from DANDI and set this environment variable."
+    )
+
+    dandiset_folder_path = (
+        Path(mkdtemp(dir=nwb_folder_path.parent)) if dandiset_folder_path is None else dandiset_folder_path
+    )
+    dandiset_path = dandiset_folder_path / dandiset_id
+    # Odd big of logic upstream: https://github.com/dandi/dandi-cli/blob/master/dandi/cli/cmd_upload.py#L92-L96
+    if number_of_threads is not None and number_of_threads > 1 and number_of_jobs is None:
+        number_of_jobs = -1
+
+    url_base = "https://gui-staging.dandiarchive.org" if staging else "https://dandiarchive.org"
+    dandiset_url = f"{url_base}/dandiset/{dandiset_id}/{version}"
+    dandi_download(urls=dandiset_url, output_dir=str(dandiset_folder_path), get_metadata=True, get_assets=False)
+    assert dandiset_path.exists(), "DANDI download failed!"
+
+    # TODO: need PR on DANDI to expose number of jobs
+    dandi_organize(
+        paths=str(nwb_folder_path), dandiset_path=str(dandiset_path), devel_debug=True if number_of_jobs == 1 else False
+    )
+    organized_nwbfiles = dandiset_path.rglob("*.nwb")
+
+    # DANDI has yet to implement forcing of session_id inclusion in organize step
+    # This manually enforces it when only a single session per subject is organized
+    for organized_nwbfile in organized_nwbfiles:
+        if "ses" not in organized_nwbfile.stem:
+            with NWBHDF5IO(path=organized_nwbfile, mode="r") as io:
+                nwbfile = io.read()
+                session_id = nwbfile.session_id
+            dandi_stem = organized_nwbfile.stem
+            dandi_stem_split = dandi_stem.split("_")
+            dandi_stem_split.insert(1, f"ses-{session_id}")
+            corrected_name = "_".join(dandi_stem_split) + ".nwb"
+            organized_nwbfile.rename(organized_nwbfile.parent / corrected_name)
+    organized_nwbfiles = dandiset_path.rglob("*.nwb")
+    # The above block can be removed once they add the feature
+
+    assert len(list(dandiset_path.iterdir())) > 1, "DANDI organize failed!"
+
+    dandi_instance = "dandi-staging" if staging else "dandi"  # Test
+    dandi_upload(
+        paths=[str(x) for x in organized_nwbfiles],
+        dandi_instance=dandi_instance,
+        jobs=number_of_jobs,
+        jobs_per_file=number_of_threads,
+    )
+
+    # Cleanup should be confirmed manually; Windows especially can complain
+    if cleanup:
+        try:
+            rmtree(path=dandiset_folder_path)
+            rmtree(path=nwb_folder_path)
+        except PermissionError:  # pragma: no cover
+            warn("Unable to clean up source files and dandiset! Please manually delete them.")