Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactored documentation #70

Merged
merged 8 commits into from
Aug 29, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 18 additions & 4 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,8 @@ FileFormats
:target: https://arcanaframework.github.io/fileformats/
:alt: Documentation Status

.. image:: ./docs/source/_static/images/logo_small.png
:alt: Logo

*Fileformats* provides a library of file-format types implemented as Python classes.
The file-format types were designed to be used in type validation and data movement
Expand Down Expand Up @@ -110,19 +112,31 @@ There are 2 main functions that can be used for format identification
``from_mime``
~~~~~~~~~~~~~

As the name suggests, this function is used to return the FileFormats class corresponding to a given `MIME <https://www.iana.org/assignments/media-types/media-types.xhtml>`__ string. All non-vendor official MIME-types are supported. Non-official types can be loaded using the `application/x-name-of-type`
form as long as the name of the type is unique amongst all installed format types. To avoid name clashes between different extension types, the "MIME-like" string can be used instead, where informal registries corresponding to the fileformats extension namespace are used instead, e.g. `medimage/nifti-gz` or `datascience/hdf5`.
As the name suggests, this function is used to return the FileFormats class corresponding
to a given `MIME <https://www.iana.org/assignments/media-types/media-types.xhtml>`__ string.
All non-vendor official MIME-types are supported. Non-official types can be loaded using
the `application/x-name-of-type` form as long as the name of the type is unique amongst
all installed format types. To avoid name clashes between different extension types, the
"MIME-like" string can be used instead, where informal registries corresponding to the
fileformats extension namespace are used instead, e.g. `medimage/nifti-gz` or `datascience/hdf5`.

``find_matching``
~~~~~~~~~~~~~~~~~

Given a set of file-system paths, by default, ``find_matching`` will iterate through all installed fileformats classes and return all that validate successfully (formats without any specific constraints are excluded by default). The potential candidate classes can be restricted by using the `candidates` keyword argument.
Given a set of file-system paths, by default, ``find_matching`` will iterate through all
installed fileformats classes and return all that validate successfully (formats without
any specific constraints are excluded by default). The potential candidate classes can be
restricted by using the `candidates` keyword argument.


Format Conversion
-----------------

While not implemented in the main File-formats itself, file-formats provides hooks for other packages to implement extra behaviour such as format conversion. The `fileformats-extras <https://github.com/ArcanaFramework/fileformats-extras>`__ implements a number of converters between standard file-format types, e.g. archive types to/from generic file/directories, which if installed can be called using the `convert()` method.
While not implemented in the main File-formats itself, file-formats provides hooks for
other packages to implement extra behaviour such as format conversion.
The `fileformats-extras <https://github.com/ArcanaFramework/fileformats-extras>`__
implements a number of converters between standard file-format types, e.g. archive types
to/from generic file/directories, which if installed can be called using the `convert()` method.

.. code-block:: python

Expand Down
Binary file added docs/logo_dev/logo.webp
Binary file not shown.
Binary file added docs/logo_dev/snake-around-folder.webp
Binary file not shown.
Binary file added docs/logo_dev/snake-transparent.psd
Binary file not shown.
Binary file added docs/logo_dev/snake-transparent.webp
Binary file not shown.
Binary file added docs/logo_dev/snake-trimmed-mod.psd
Binary file not shown.
Binary file added docs/logo_dev/snake-trimmed.psd
Binary file not shown.
Binary file added docs/source/_static/images/logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/source/_static/images/logo_small.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
67 changes: 67 additions & 0 deletions docs/source/api.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
Public API
==========

Functions
~~~~~~~~~

.. autofunction:: fileformats.core.to_mime

.. autofunction:: fileformats.core.from_mime

.. autofunction:: fileformats.core.find_matching

.. autofunction:: fileformats.core.from_paths


Core
~~~~

.. autoclass:: fileformats.core.FileSet
:members: mime_type, mime_like, from_mime, strext, unconstrained, possible_exts, metadata, select_metadata, select_by_ext, matching_exts, convert, get_converter, register_converter, all_formats, standard_formats, hash, hash_files, mock, sample, decomposed_fspaths, from_paths, copy, move

.. autoclass:: fileformats.core.Field
:members: mime_like, from_mime, to_primitive, from_primitive


Generic
~~~~~~~

.. autoclass:: fileformats.generic.FsObject

.. autoclass:: fileformats.generic.File

.. autoclass:: fileformats.generic.Directory

.. autoclass:: fileformats.generic.DirectoryOf

.. autoclass:: fileformats.generic.SetOf


Field
~~~~~

.. autoclass:: fileformats.field.Text

.. autoclass:: fileformats.field.Integer

.. autoclass:: fileformats.field.Decimal

.. autoclass:: fileformats.field.Boolean

.. autoclass:: fileformats.field.Array


Mixins
~~~~~~

.. autoclass:: fileformats.core.mixin.WithMagicNumber

.. autoclass:: fileformats.core.mixin.WithMagicVersion

.. autoclass:: fileformats.core.mixin.WithAdjacentFiles

.. autoclass:: fileformats.core.mixin.WithSeparateHeader

.. autoclass:: fileformats.core.mixin.WithSideCars

.. autoclass:: fileformats.core.mixin.WithClassifiers
2 changes: 1 addition & 1 deletion docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -160,7 +160,7 @@

# The name of an image file (relative to this directory) to place at the top
# of the sidebar.
# html_logo = "_static/images/logo_small.png"
html_logo = "_static/images/logo_small.png"

# The name of an image file (within the static path) to use as favicon of the
# docs. This file should be a Windows icon file (.ico) being 16x16 or 32x32
Expand Down
106 changes: 106 additions & 0 deletions docs/source/detection.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@

Detection
=========

*FileFormats* has been designed to detect whether a set of files matches a given
format specification. This can be used either be in the form of validating file types
in workflows or identifying the format in which user input files have been provided.

Validation
----------

In the basic case, *FileFormats* can be used for checking the format of files and
directories against known types. Typically, there are two layers of checks, ones
performed on the file-system paths alone,

.. code-block:: python

from fileformats.image import Jpeg

jpeg_file = Jpeg("/path/to/image.jpg") # PASSES
jpeg_file = Jpeg("/path/to/image.png") # FAILS!


The second layer of checks, which typically require reading the file and peeking at its
contents for magic numbers and the like

.. code-block:: python

fspath = "/path/to/fake-image.jpg"

with open(fspath, "w") as f:
f.write("this is not a valid JPEG file")

jpeg_file = Jpeg(fspath) # FAILS!


Directories are classified by the contents of the files within them, via the
``content_types`` class attribute, e.g.

.. code-block:: python

from fileformats.generic import File, Directory

class Dicom(WithMagicNumber, File):
magic_number = b"DICM"
magic_number_offset = 128

class DicomDir(Directory):
content_types = (Dicom,)


Note that only one file within the directory needs to match the specified content type
for it to be considered a match and additional files will be ignored. For example,
the ``Dicom`` type would be considered valid on the following directory structure
despite the presence of the ``.DS_Store`` directory and the ``catalog.xml`` file.

.. code-block::

dicom-directory
├── .DS_Store
│ ├── deleted-file1.txt
│ ├── deleted-file2.txt
│ └── ...
├── 1.dcm
├── 2.dcm
├── 3.dcm
├── ...
├── 1024.dcm
└── catalog.xml

In addition to statically defining `Directory` formats such as the Dicom example above,
dynamic directory types can be created on the fly by providing the content types as
arguments to the `DirectoryOf[]` method,
e.g.

.. code-block:: python

from fileformats.generic import Directory
from fileformats.image import Png
from fileformats.text import Csv

def my_task(image_dir: DirectoryOf[Png]) -> Csv:
... task implementation ...

.. _Pydra: https://pydra.readthedocs.io
.. _Fastr: https://gitlab.com/radiology/infrastructure/fastr


Identification
--------------

The ``find_matching`` function can be used to list the formats that match a given file

.. code-block::

>>> from fileformats.core import find_matching
>>> find_matching("/path/to/word.doc")
[<class 'fileformats.application.Msword'>]

.. warning::
The installation of extension packages may cause detection code to break if one of
the newly added formats also matches the file and your code doesn't handle this case.
If you are only interested in formats covered in the main fileformats package then
you should use the ``standard_only`` flag

Alter
6 changes: 3 additions & 3 deletions docs/source/developer.rst
Original file line number Diff line number Diff line change
Expand Up @@ -260,8 +260,8 @@ files and another one for little endian files. Therefore we can't just use the
``fileformats.core.mark.check``.


Converters
----------
Implementing converters
-----------------------

Converters between two equivalent formats are defined using Pydra_ dataflow engine
`tasks <https://pydra.readthedocs.io/en/latest/components.html>`_. There are two types
Expand Down Expand Up @@ -409,7 +409,7 @@ a warning if the import fails, when get_converter is called on a format in that
namespace.


.. note::
.. warning::
If the converters aren't imported successfully, then you will receive a
``FormatConversionError`` error saying there are no converters between FormatA and
FormatB.
Expand Down
93 changes: 93 additions & 0 deletions docs/source/extras.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@

Read, write and convert
=======================

In addition to the basic features of validation and path handling, it is possible to
implement methods to interact with the data of file format objects via "extras hooks".
Such features are added to selected format classes on a needs basis (pull requests
welcome 😊, see :ref:`Developer Guide`), so are by no means comprehensive, and
are provided "as-is".

Since these features typically rely on a range of external libraries, they are kept in
separate *extras* packages (e.g.
`fileformats-extras <https://pypi.org/project/fileformats-extras/>`__,
`fileformats-medimage-extras <https://pypi.org/project/fileformats-medimage-extras/>`__),
which need to be installed separately.


Metadata
--------

If there has been an extras overload registered for the ``read_metadata`` method,
then metadata associated with the fileset can be accessed via the ``metadata`` property,
e.g.

.. code-block:: python

>>> dicom.metadata["SeriesDescription"]
"localizer"

Formats the ``WithSeparateHeader`` and ``WithSideCars`` mixin classes will attempt the
side car if a metadata reader is implemented (e.g. JSON) and merge that with any header
information read from the primary file.


Reading and writing
-------------------

Several classes in the base fileformats package implement ``load`` and ``save`` methods.
An advantage of implementing them in the format class is that objects instantiated from
them can then be duck-typed in calling functions/methods. For example, both ``Yaml`` and
``Json`` formats (both inherit from the ``DataSerialization`` type) implement the
``load`` method, which returns a dictionary

.. code-block:: python

from fileformats.application import DataSerialization # i.e. JSON or YAML

def read_serialisation(serialized: DataSerialization) -> dict:
return serialized.load()


Converters
----------

Several conversion methods are available between equivalent file-formats in the standard
classes. For example, archive types such as ``Zip`` can be converted into and generic
file/directories using the ``convert`` classmethod of the target format to convert to

.. code-block:: python

from fileformats.application import Zip
from fileformats.generic import Directory

# Example round trip from directory to zip file
zip_file = Zip.convert(Directory("/path/to/a/directory"))
extracted = Directory.convert(zip_file)

The converters are implemented in the Pydra_ dataflow framework, and can be linked into
wider Pydra_ workflows by accessing the underlying converter task with the ``get_converter``
classmethod

.. code-block:: python

import pydra
from pydra.tasks.mypackage import MyTask
from fileformats.image import Gif, Png

wf = pydra.Workflow(name="a_workflow", input_spec=["in_gif"])
wf.add(
Png.get_converter(Gif, name="gif2png", in_file=wf.lzin.in_gif)
)
wf.add(
MyTask(
name="my_task",
in_file=wf.gif2png.lzout.out_file,
)
)
...



.. _Pydra: https://pydra.readthedocs.io
.. _Analyze: https://en.wikipedia.org/wiki/Analyze_(imaging_software)
Loading
Loading