Skip to content

Commit

Permalink
Merge pull request #61 from ArcanaFramework/py.typed
Browse files Browse the repository at this point in the history
Typing support
  • Loading branch information
tclose authored Sep 4, 2024
2 parents e73dd6b + e68db9c commit a9a3715
Show file tree
Hide file tree
Showing 78 changed files with 2,312 additions and 1,161 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/ci-cd.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ jobs:
strategy:
matrix:
os: [macos-latest, ubuntu-latest, windows-latest]
python-version: ["3.8", "3.12"]
python-version: ["3.8", "3.9", "3.12"]
fail-fast: false
runs-on: ${{ matrix.os }}
defaults:
Expand Down
33 changes: 14 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,23 +8,18 @@

<img src="./docs/source/_static/images/logo_small.png" alt="Logo Small" style="float: right; width: 100mm">

*Fileformats* provides a library of file-format types implemented as Python classes.
The file-format types were designed to be used in type validation and data movement
during the construction and execution of data workflows. However, they can can also be
used some basic data handling methods (e.g. loading data to dictionaries) and format
conversions between some equivalent types via methods defined in the associated
[fileformats-extras](https://pypi.org/project/fileformats-extras/) package.

File-format types are typically identified by a combination of file extension
and "magic numbers" where applicable. However, unlike many other file-type Python packages,
*FileFormats*, supports multi-file data formats ("file sets") often found in scientific
workflows, e.g. with separate header/data files. *FileFormats* also provides a flexible
framework to add custom identification routines for exotic file formats, e.g.
formats that require inspection of headers to locate data files, directories containing
certain file types, or to peek at metadata fields to define specific sub-types
(e.g. functional MRI DICOM file set). It is in the handling of multi-file formats that
fileformats comes into its own, since it keeps track of auxiliary files when moving/copying
to different file-system locations and calculating hashes.
*Fileformats* provides a library of file-format types implemented as Python classes for
validation, detection, typing and provide hooks for extra functionality and format
conversions. Formats are typically validated/identified by a combination of file extension
and "magic numbers" where applicable. Unlike other file-type packages, *FileFormats*,
supports multi-file data formats ("file sets"), which are often found in scientific
workflows, e.g. with separate header/data files.

*FileFormats* provides a flexible extension framework to add custom identification
routines for exotic file formats, e.g. formats that require inspection of headers to
locate data files, directories containing certain file types, or to peek at metadata
fields to define specific sub-types (e.g. functional MRI DICOM file set). These file-sets
with auxiliary files can be moved, copied and hashed like they are a single file object.

See the [extension template](https://github.com/ArcanaFramework/fileformats-extension-template)
for instructions on how to design *FileFormats* extensions modules to augment the
Expand All @@ -41,12 +36,12 @@ extensions and magic numbers. As such, many of the formats in the library have n
tested on real data and so should be treated with some caution. If you encounter any issues with an implemented file
type, please raise an issue in the [GitHub tracker](https://github.com/ArcanaFramework/fileformats/issues).

Adding support for vendor formats will be relatively straightforward and is planned for v1.0.
Adding support for vendor formats is planned for v1.0.


## Installation

*FileFormats* can be installed for Python >= 3.7 from PyPI with
*FileFormats* can be installed for Python >= 3.8 from PyPI with

```console
$ python3 -m pip fileformats
Expand Down
14 changes: 7 additions & 7 deletions conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,24 +33,24 @@
# break at it
if os.getenv("_PYTEST_RAISE", "0") != "0":

@pytest.hookimpl(tryfirst=True)
def pytest_exception_interact(call):
@pytest.hookimpl(tryfirst=True) # type: ignore
def pytest_exception_interact(call: ty.Any) -> None:
raise call.excinfo.value

@pytest.hookimpl(tryfirst=True)
def pytest_internalerror(excinfo):
@pytest.hookimpl(tryfirst=True) # type: ignore
def pytest_internalerror(excinfo: ty.Any) -> None:
raise excinfo.value


@pytest.fixture
def work_dir():
def work_dir() -> Path:
work_dir = tempfile.mkdtemp()
return Path(work_dir)


def write_test_file(
fpath: Path, contents: ty.Union[str, bytes] = "some contents", binary=False
):
fpath: Path, contents: ty.Union[str, bytes] = "some contents", binary: bool = False
) -> Path:
fpath.parent.mkdir(exist_ok=True, parents=True)
with open(fpath, "wb" if binary else "w") as f:
f.write(contents)
Expand Down
13 changes: 7 additions & 6 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,18 +12,19 @@
# All configuration values have a default; values that are commented out
# serve to show the default.
from __future__ import print_function
import typing as ty
from pathlib import Path
import re
import datetime

from fileformats.core import __version__ # noqa

with open(Path(__file__).parent / ".." / ".." / "AUTHORS") as f:
authors = [
re.match(r"([a-zA-Z\-\. ]+) <([a-zA-Z\-\._@]+)>", ln).groups()
for ln in f.read().split("\n")
if ln
]
authors = []
for ln in f.read().splitlines():
match = re.match(r"([a-zA-Z\-\. ]+) <([a-zA-Z\-\._@]+)>", ln)
assert match
authors.append(match.groups())

# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
Expand Down Expand Up @@ -97,7 +98,7 @@

# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
exclude_patterns = []
exclude_patterns: ty.List[str] = []

# The reST default role (used for this markup: `text`) to use for all
# documents.
Expand Down
54 changes: 38 additions & 16 deletions docs/source/detection.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,28 +10,31 @@ Validation
----------

In the basic case, *FileFormats* can be used for checking the format of files and
directories against known types. Typically, there are two layers of checks, ones
performed on the file-system paths alone,
directories against known types. Typically this will involve checking the file extension
and magic number if applicable

.. code-block:: python
from fileformats.image import Jpeg
jpeg_file = Jpeg("/path/to/image.jpg") # PASSES
jpeg_file = Jpeg("/path/to/image.png") # FAILS!
Jpeg("/path/to/image.png") # FAILS!
fake_fspath = "/path/to/fake-image.jpg"
The second layer of checks, which typically require reading the file and peeking at its
contents for magic numbers and the like
with open(fake_fspath, "w") as f:
f.write("this is not a valid JPEG file")
.. code-block:: python
Jpeg(fake_fspath) # FAILS!
fspath = "/path/to/fake-image.jpg"
To check whether a format matches without attempting to initialise the object use the
:meth:`FileSet.matches()` method

with open(fspath, "w") as f:
f.write("this is not a valid JPEG file")

jpeg_file = Jpeg(fspath) # FAILS!
.. code-block:: python
if Jpeg.matches("/path/to/image.jpg"):
...
Directories are classified by the contents of the files within them, via the
Expand Down Expand Up @@ -70,7 +73,7 @@ despite the presence of the ``.DS_Store`` directory and the ``catalog.xml`` file
In addition to statically defining `Directory` formats such as the Dicom example above,
dynamic directory types can be created on the fly by providing the content types as
arguments to the `DirectoryOf[]` method,
"classifier" arguments to the `DirectoryOf[]` class (see :ref:`Classifiers`),
e.g.

.. code-block:: python
Expand All @@ -82,9 +85,6 @@ e.g.
def my_task(image_dir: DirectoryOf[Png]) -> Csv:
... task implementation ...
.. _Pydra: https://pydra.readthedocs.io
.. _Fastr: https://gitlab.com/radiology/infrastructure/fastr

Identification
--------------
Expand All @@ -94,7 +94,7 @@ The ``find_matching`` function can be used to list the formats that match a give
.. code-block::
>>> from fileformats.core import find_matching
>>> find_matching("/path/to/word.doc")
>>> find_matching(["/path/to/word.doc"])
[<class 'fileformats.application.Msword'>]
.. warning::
Expand All @@ -103,4 +103,26 @@ The ``find_matching`` function can be used to list the formats that match a give
If you are only interested in formats covered in the main fileformats package then
you should use the ``standard_only`` flag

Alter
For loosely formats without many constraints, ``find_matching`` may return multiple
formats that are not plausible for the given use case, in which case the ``candidates``
argument can be passed to restrict the possible formats that can be returned

.. code-block::
>>> from fileformats.datascience import MatFile, RData, Hdf5
>>> find_matching(["/path/to/text/matrix/file.mat"])
[fileformats.datascience.data.TextMatrix]
>>> find_matching(["/path/to/matlab/file.mat"])
[fileformats.datascience.data.TextMatrix, fileformats.datascience.data.MatFile]
>>> find_matching(["/path/to/matlab/file.mat"], candidates=[MatFile, RData, Hdf5])
[fileformats.datascience.data.MatFile]
``from_paths`` can be used to return an initialised object instead of a list of matching
files, however, since you need to be confident that there is only than one possible format
it is advisable to also provide a list of candidate formats

.. code-block::
>>> from fileformats.core import from_paths
>>> repr(from_paths(["/path/to/matlab/file.mat"], candidates=[MatFile, RData, Hdf5]))
fileformats.datascience.data.MatFile({"/path/to/matlab/file.mat"})
19 changes: 6 additions & 13 deletions docs/source/developer.rst
Original file line number Diff line number Diff line change
Expand Up @@ -131,12 +131,12 @@ Custom format patterns
----------------------

While the standard mixin classes should cover 90% of all formats, in the wild-west of
scientific data formats you might need to write custom validators using the
``@fileformats.core.mark.required`` and ``@fileformats.core.mark.check`` decorators.
scientific data formats you might need to write custom validators. This is simply done
by adding a new property to the class using the `@property` decorator.

Take for example the `GIS shapefile structure <https://www.earthdatascience.org/courses/earth-analytics/spatial-data-r/shapefile-structure/>`_,
it is a file-set consisting of 3 to 6 files differentiated by their extensions. To
implement this class we use the ``required`` decorator. We inherit from the ``WithAdjacentFiles``
implement this class we use the ``@property`` decorator. We inherit from the ``WithAdjacentFiles``
mixin so that neighbouring files (i.e. files with the same stem but different extension)
are included when the class is instantiated with just the primary ".shp" file.

Expand Down Expand Up @@ -183,46 +183,39 @@ are included when the class is instantiated with just the primary ".shp" file.
ext = ".shp" # the main file that will be mapped to fspath
@mark.required
@property
def index_file(self):
return GisShapeIndex(self.select_by_ext(GisShapeIndex))
@mark.required
@property
def features_file(self):
return GisShapeFeatures(self.select_by_ext(GisShapeFeatures))
@mark.required
@property
def project_file(self):
return WellKnownText(self.select_by_ext(WellKnownText), allow_none=True)
@mark.required
@property
def spatial_index_n_file(self):
return GisShapeSpatialIndexN(
self.select_by_ext(GisShapeSpatialIndexN), allow_none=True
)
@mark.required
@property
def spatial_index_n_file(self):
return GisShapeSpatialIndexB(
self.select_by_ext(GisShapeSpatialIndexB), allow_none=True
)
@mark.required
@property
def geospatial_metadata_file(self):
return GisShapeGeoSpatialMetadata(
self.select_by_ext(GisShapeGeoSpatialMetadata), allow_none=True
)
By marking the properties as required, means that they need to be able to return a
value without raising a ``FormatsMismatchError`` for the class to be initiated. Required
properties, that appear in ``fspaths`` attribute of the object are considered to be
"required paths", and are copied along side the main path in the ``copy_to`` method.
Properties that appear in ``fspaths`` attribute of the object are considered to be
"required paths", and are copied along side the main path in the ``copy_to`` method
even when the ``trim`` argument is set to True.

After the required properties have been deeper checks can be by using the ``check``
decorator. Take the ``fileformats.image.Tiff`` class
Expand Down
Loading

0 comments on commit a9a3715

Please sign in to comment.