Merge pull request #61 from ArcanaFramework/py.typed

Typing support
ArcanaFramework · Sep 4, 2024 · a9a3715 · a9a3715
2 parents e73dd6b + e68db9c
commit a9a3715
Show file tree

Hide file tree

Showing 78 changed files with 2,312 additions and 1,161 deletions.
diff --git a/.github/workflows/ci-cd.yml b/.github/workflows/ci-cd.yml
@@ -17,7 +17,7 @@ jobs:
     strategy:
       matrix:
         os: [macos-latest, ubuntu-latest, windows-latest]
-        python-version: ["3.8", "3.12"]
+        python-version: ["3.8", "3.9", "3.12"]
       fail-fast: false
     runs-on: ${{ matrix.os }}
     defaults:

diff --git a/README.md b/README.md
@@ -8,23 +8,18 @@
 
 <img src="./docs/source/_static/images/logo_small.png" alt="Logo Small" style="float: right; width: 100mm">
 
-*Fileformats* provides a library of file-format types implemented as Python classes.
-The file-format types were designed to be used in type validation and data movement
-during the construction and execution of data workflows. However, they can can also be
-used some basic data handling methods (e.g. loading data to dictionaries) and format
-conversions between some equivalent types via methods defined in the associated
-[fileformats-extras](https://pypi.org/project/fileformats-extras/) package.
-
-File-format types are typically identified by a combination of file extension
-and "magic numbers" where applicable. However, unlike many other file-type Python packages,
-*FileFormats*, supports multi-file data formats ("file sets") often found in scientific
-workflows, e.g. with separate header/data files. *FileFormats* also provides a flexible
-framework to add custom identification routines for exotic file formats, e.g.
-formats that require inspection of headers to locate data files, directories containing
-certain file types, or to peek at metadata fields to define specific sub-types
-(e.g. functional MRI DICOM file set). It is in the handling of multi-file formats that
-fileformats comes into its own, since it keeps track of auxiliary files when moving/copying
-to different file-system locations and calculating hashes.
+*Fileformats* provides a library of file-format types implemented as Python classes for
+validation, detection, typing and provide hooks for extra functionality and format
+conversions. Formats are typically validated/identified by a combination of file extension
+and "magic numbers" where applicable. Unlike other file-type packages, *FileFormats*,
+supports multi-file data formats ("file sets"), which are often found in scientific
+workflows, e.g. with separate header/data files.
+
+*FileFormats* provides a flexible extension framework to add custom identification
+routines for exotic file formats, e.g. formats that require inspection of headers to
+locate data files, directories containing certain file types, or to peek at metadata
+fields to define specific sub-types (e.g. functional MRI DICOM file set). These file-sets
+with auxiliary files can be moved, copied and hashed like they are a single file object.
 
 See the [extension template](https://github.com/ArcanaFramework/fileformats-extension-template)
 for instructions on how to design *FileFormats* extensions modules to augment the
@@ -41,12 +36,12 @@ extensions and magic numbers. As such, many of the formats in the library have n
 tested on real data and so should be treated with some caution. If you encounter any issues with an implemented file
 type, please raise an issue in the [GitHub tracker](https://github.com/ArcanaFramework/fileformats/issues).
 
-Adding support for vendor formats will be relatively straightforward and is planned for v1.0.
+Adding support for vendor formats is planned for v1.0.
 
 
 ## Installation
 
-*FileFormats* can be installed for Python >= 3.7 from PyPI with
+*FileFormats* can be installed for Python >= 3.8 from PyPI with
 
 ```console
     $ python3 -m pip fileformats

diff --git a/conftest.py b/conftest.py
@@ -33,24 +33,24 @@
 # break at it
 if os.getenv("_PYTEST_RAISE", "0") != "0":
 
-    @pytest.hookimpl(tryfirst=True)
-    def pytest_exception_interact(call):
+    @pytest.hookimpl(tryfirst=True)  # type: ignore
+    def pytest_exception_interact(call: ty.Any) -> None:
         raise call.excinfo.value
 
-    @pytest.hookimpl(tryfirst=True)
-    def pytest_internalerror(excinfo):
+    @pytest.hookimpl(tryfirst=True)  # type: ignore
+    def pytest_internalerror(excinfo: ty.Any) -> None:
         raise excinfo.value
 
 
 @pytest.fixture
-def work_dir():
+def work_dir() -> Path:
     work_dir = tempfile.mkdtemp()
     return Path(work_dir)
 
 
 def write_test_file(
-    fpath: Path, contents: ty.Union[str, bytes] = "some contents", binary=False
-):
+    fpath: Path, contents: ty.Union[str, bytes] = "some contents", binary: bool = False
+) -> Path:
     fpath.parent.mkdir(exist_ok=True, parents=True)
     with open(fpath, "wb" if binary else "w") as f:
         f.write(contents)

diff --git a/docs/source/conf.py b/docs/source/conf.py
@@ -12,18 +12,19 @@
 # All configuration values have a default; values that are commented out
 # serve to show the default.
 from __future__ import print_function
+import typing as ty
 from pathlib import Path
 import re
 import datetime
 
 from fileformats.core import __version__  # noqa
 
 with open(Path(__file__).parent / ".." / ".." / "AUTHORS") as f:
-    authors = [
-        re.match(r"([a-zA-Z\-\. ]+) <([a-zA-Z\-\._@]+)>", ln).groups()
-        for ln in f.read().split("\n")
-        if ln
-    ]
+    authors = []
+    for ln in f.read().splitlines():
+        match = re.match(r"([a-zA-Z\-\. ]+) <([a-zA-Z\-\._@]+)>", ln)
+        assert match
+        authors.append(match.groups())
 
 # If extensions (or modules to document with autodoc) are in another directory,
 # add these directories to sys.path here. If the directory is relative to the
@@ -97,7 +98,7 @@
 
 # List of patterns, relative to source directory, that match files and
 # directories to ignore when looking for source files.
-exclude_patterns = []
+exclude_patterns: ty.List[str] = []
 
 # The reST default role (used for this markup: `text`) to use for all
 # documents.

diff --git a/docs/source/detection.rst b/docs/source/detection.rst
@@ -10,28 +10,31 @@ Validation
 ----------
 
 In the basic case, *FileFormats* can be used for checking the format of files and
-directories against known types. Typically, there are two layers of checks, ones
-performed on the file-system paths alone,
+directories against known types. Typically this will involve checking the file extension
+and magic number if applicable
 
 .. code-block:: python
 
     from fileformats.image import Jpeg
 
     jpeg_file = Jpeg("/path/to/image.jpg")  # PASSES
-    jpeg_file = Jpeg("/path/to/image.png")  # FAILS!
+    Jpeg("/path/to/image.png")  # FAILS!
 
+    fake_fspath = "/path/to/fake-image.jpg"
 
-The second layer of checks, which typically require reading the file and peeking at its
-contents for magic numbers and the like
+    with open(fake_fspath, "w") as f:
+        f.write("this is not a valid JPEG file")
 
-.. code-block:: python
+    Jpeg(fake_fspath)  # FAILS!
 
-    fspath = "/path/to/fake-image.jpg"
+To check whether a format matches without attempting to initialise the object use the
+:meth:`FileSet.matches()` method
 
-    with open(fspath, "w") as f:
-        f.write("this is not a valid JPEG file")
 
-    jpeg_file = Jpeg(fspath)  # FAILS!
+.. code-block:: python
+
+    if Jpeg.matches("/path/to/image.jpg"):
+        ...
 
 
 Directories are classified by the contents of the files within them, via the
@@ -70,7 +73,7 @@ despite the presence of the ``.DS_Store`` directory and the ``catalog.xml`` file
 
 In addition to statically defining `Directory` formats such as the Dicom example above,
 dynamic directory types can be created on the fly by providing the content types as
-arguments to the `DirectoryOf[]` method,
+"classifier" arguments to the `DirectoryOf[]` class (see :ref:`Classifiers`),
 e.g.
 
 .. code-block:: python
@@ -82,9 +85,6 @@ e.g.
     def my_task(image_dir: DirectoryOf[Png]) -> Csv:
         ... task implementation ...
 
-.. _Pydra: https://pydra.readthedocs.io
-.. _Fastr: https://gitlab.com/radiology/infrastructure/fastr
-
 
 Identification
 --------------
@@ -94,7 +94,7 @@ The ``find_matching`` function can be used to list the formats that match a give
 .. code-block::
 
     >>> from fileformats.core import find_matching
-    >>> find_matching("/path/to/word.doc")
+    >>> find_matching(["/path/to/word.doc"])
     [<class 'fileformats.application.Msword'>]
 
 .. warning::
@@ -103,4 +103,26 @@ The ``find_matching`` function can be used to list the formats that match a give
    If you are only interested in formats covered in the main fileformats package then
    you should use the ``standard_only`` flag
 
-Alter
+For loosely formats without many constraints, ``find_matching`` may return multiple
+formats that are not plausible for the given use case, in which case the ``candidates``
+argument can be passed to restrict the possible formats that can be returned
+
+.. code-block::
+
+    >>> from fileformats.datascience import MatFile, RData, Hdf5
+    >>> find_matching(["/path/to/text/matrix/file.mat"])
+    [fileformats.datascience.data.TextMatrix]
+    >>> find_matching(["/path/to/matlab/file.mat"])
+    [fileformats.datascience.data.TextMatrix, fileformats.datascience.data.MatFile]
+    >>> find_matching(["/path/to/matlab/file.mat"], candidates=[MatFile, RData, Hdf5])
+    [fileformats.datascience.data.MatFile]
+
+``from_paths`` can be used to return an initialised object instead of a list of matching
+files, however, since you need to be confident that there is only than one possible format
+it is advisable to also provide a list of candidate formats
+
+.. code-block::
+
+    >>> from fileformats.core import from_paths
+    >>> repr(from_paths(["/path/to/matlab/file.mat"], candidates=[MatFile, RData, Hdf5]))
+    fileformats.datascience.data.MatFile({"/path/to/matlab/file.mat"})
diff --git a/docs/source/developer.rst b/docs/source/developer.rst
@@ -131,12 +131,12 @@ Custom format patterns
 ----------------------
 
 While the standard mixin classes should cover 90% of all formats, in the wild-west of
-scientific data formats you might need to write custom validators using the
-``@fileformats.core.mark.required`` and ``@fileformats.core.mark.check`` decorators.
+scientific data formats you might need to write custom validators. This is simply done
+by adding a new property to the class using the `@property` decorator.
 
 Take for example the `GIS shapefile structure <https://www.earthdatascience.org/courses/earth-analytics/spatial-data-r/shapefile-structure/>`_,
 it is a file-set consisting of 3 to 6 files differentiated by their extensions. To
-implement this class we use the ``required`` decorator. We inherit from the ``WithAdjacentFiles``
+implement this class we use the ``@property`` decorator. We inherit from the ``WithAdjacentFiles``
 mixin so that neighbouring files (i.e. files with the same stem but different extension)
 are included when the class is instantiated with just the primary ".shp" file.
 
@@ -183,46 +183,39 @@ are included when the class is instantiated with just the primary ".shp" file.
 
         ext = ".shp"  # the main file that will be mapped to fspath
 
-        @mark.required
         @property
         def index_file(self):
             return GisShapeIndex(self.select_by_ext(GisShapeIndex))
 
-        @mark.required
         @property
         def features_file(self):
             return GisShapeFeatures(self.select_by_ext(GisShapeFeatures))
 
-        @mark.required
         @property
         def project_file(self):
             return WellKnownText(self.select_by_ext(WellKnownText), allow_none=True)
 
-        @mark.required
         @property
         def spatial_index_n_file(self):
             return GisShapeSpatialIndexN(
                self.select_by_ext(GisShapeSpatialIndexN), allow_none=True
             )
 
-        @mark.required
         @property
         def spatial_index_n_file(self):
             return GisShapeSpatialIndexB(
                self.select_by_ext(GisShapeSpatialIndexB), allow_none=True
             )
 
-        @mark.required
         @property
         def geospatial_metadata_file(self):
             return GisShapeGeoSpatialMetadata(
                self.select_by_ext(GisShapeGeoSpatialMetadata), allow_none=True
             )
 
-By marking the properties as required, means that they need to be able to return a
-value without raising a ``FormatsMismatchError`` for the class to be initiated. Required
-properties, that appear in ``fspaths`` attribute of the object are considered to be
-"required paths", and are copied along side the main path in the ``copy_to`` method.
+Properties that appear in ``fspaths`` attribute of the object are considered to be
+"required paths", and are copied along side the main path in the ``copy_to`` method
+even when the ``trim`` argument is set to True.
 
 After the required properties have been deeper checks can be by using the ``check``
 decorator. Take the ``fileformats.image.Tiff`` class