Skip to content

Latest commit

 

History

History
236 lines (178 loc) · 21.1 KB

README.md

File metadata and controls

236 lines (178 loc) · 21.1 KB

awkward-1.0

Development of Awkward 1.0, to replace scikit-hep/awkward-array in 2020.

Motivation for a new Awkward Array

Awkward Array has proven to be a useful way to analyze variable-length and tree-like data in Python, by extending Numpy's idioms from rectilinear arrays to arrays of complex data structures. For over a year, physicists have been using Awkward Array both in and out of uproot; it is already one of the most popular Python packages in particle physics.

However, its pure-NumPy implementation is hard to extend (finding for-loop-free implementations of operations on nested data is hard) and maintain (most bugs are NumPy corner cases). Also, the feedback users have given me through GitHub, StackOverflow, and in-person tutorials have pointed out some design mistakes. A backward-incompatible release will allow us to fix design mistakes while providing freedom to make deep changes in the implementation.

The Awkward 1.0 project is a major investment, a six-month sprint from late August 2019 to late February 2020. The time spent on a clean, robust Awkward Array is justified by the widespread adoption of Awkward 0.x: its usefulness to the community has been demonstrated.

Main goals of Awkward 1.0

  • Full access to create and manipulate Awkward Arrays in C++ with no Python dependencies. This is so that C++ libraries can produce and share data with Python front-ends.
  • Easy installation with pip install and conda install for most users (Mac, Windows, and most Linux).
  • Imperative (for-loop-style) access to Awkward Arrays in Numba, a just-in-time compiler for Python. This is so that physicists can write critical loops in straightforward Python without a performance penalty.
  • A single awkward.Array class that hides the details of how columnar data is built, with a suite of operations that apply to all internal types.
  • Conformance to NumPy, where Awkward and NumPy overlap.
  • Better control over "behavioral mix-ins," such as LorentzVector (i.e. adding methods like pt() to arrays of records with px and py fields). In Awkward 0.x, this was achieved with multiple inheritance, but that was brittle.
  • Support for set operations and database-style joins, which can be put to use in a declarative analysis language, but requires database-style accounting of an index (like a Pandas index).
  • Better interoperability with Pandas, NumExpr, and Dask, while maintaining support for ROOT, Arrow, and Parquet.
  • Ability to add GPU implementations of array operations in the future.
  • Better error messages and extensive documentation.

Architecture of Awkward 1.0

To achieve these goals, Awkward 1.0 is separated into four layers:

  1. The user-facing Python layer with a single awkward.Array class, whose data is described by a datashape type.
  2. The columnar representation (i.e. nested ListArray, RecordArray, etc.) is accessible but hidden, and these are all C++ classes presented to Python through pybind11.
  3. Two object models for the columnar representation, one in C++11 (with only header-only dependencies) and the other as Numba extensions. This is the only layer in which array-allocation occurs.
  4. A suite of operations on arrays, computing new values but not allocating memory. The first implementation of this suite is in C++ with a pure-C interface; the second may be CUDA (or other GPU language). With one exception (FillableArray), iterations over arrays only occur at this level, so performance optimizations can focus on this layer.

The Awkward transition

Since Awkward 1.0 is not backward-compatible, existing users of Awkward 0.x will need to update their scripts or only use the new version on new scripts. Awkward 1.0 is already available to early adopters as awkward1 in pip (pip install awkward1 and import awkward1 in Python). When uproot is ready to use the new Awkward Array,

  • it will be released as uproot 4.0,
  • awkward1 will be renamed awkward, and
  • the old Awkward 0.x will be renamed awkward0.

The original Awkward 0.x will be available in perpetuity as awkward0, but only minor bugs will be fixed, and that only for the duration of 2020. This repository will replace scikit-hep/awkward-array on GitHub.

Normal installation

Normally, you would install Awkward from PyPI using pip

pip install awkward1

to get the latest release of Awkward 1.0 as a precompiled wheel. If a wheel does not exist for your combination of operating system and Python version, the above command attempts to compile from source, downloading any dependencies it needs to do that.

Manually installing from source

If you need to force an installation from source, get it from GitHub via

git clone --recursive https://github.com/scikit-hep/awkward-1.0.git

(note the recursive git-clone; it is required to get C++ dependencies) and compile+install with dependencies via

pip install .

or

pip install .[test,dev]

to perform tests ([test]) on all optional dependencies ([dev]).

Development workflow

If you are developing Awkward Array, manually installing from source will work, but it doesn't cache previously compiled code for rapid recompilation. Instead, get it from GitHub via

git clone --recursive https://github.com/scikit-hep/awkward-1.0.git

(note the recursive git-clone; it is required to get C++ dependencies) and compile+install with dependencies via

python localbuild.py --pytest tests

The --pytest tests optionally runs selected tests. See

python localbuild.py --help

for more information. The build is based on CMake; see localbuild.py if you need to run CMake directly.

Continuous integration (CI) and continuous deployment (CD) are hosted by Azure Pipelines:

buildtest-awkward (CI) and deploy-awkward (CD)

Building projects that depend on Awkward

Python projects can simply use awkward1 as a Python library.

C++ projects can either link against the shared libraries libawkward-cpu-kernels.so and libawkward.so or the static libraries, whose names end in -static. All four libraries, as well as their C++ header files, are shipped with the Python library. Even if you installed Awkward Array with pip, you'll have everything you need to build an Awkward C++ program.

If you also want to bind your C++ to Python and share Awkward Arrays between modules in Python, see the dependent-project example. This is a small CMake project bound to Python with pybind11 that can produce and consume Awkward Arrays in Python. Such projects depend on a specific version of Awkward Array, but we intend to stabilize the ABI for more flexibility.

Roadmap

The six-month sprint marathon:

  • September 2019: Set up CI/CD; define jagged array types in C++; pervasive infrastructure like database-style indexing.
  • October 2019: NumPy-compliant slicing; the Numba implementation. Feature parity will be maintained in Numba continuously.
  • November 2019: Fillable arrays to create columnar data; high-level type objects; all list and record types.
  • December 2019: The awkward.Array user interface; behavioral mix-ins, including the string type.
  • January 2020: NEP 13 and NEP 18; the rest of the array nodes: option and union types, indirection.
  • February 2020: The array operations: flattening, padding, concatenating, combinatorics, etc. and array types needed for Uproot and Arrow/Parquet (chunked, virtual, masked, etc.).

Updating dependent libraries:

Most users will see Awkward 1.0 for the first time when uproot 4.0 is released.


Progress is a little behind: operations and Arrow conversions are not done.


Checklist of features for the six-month sprint

Completed items are ☑check-marked. See closed PRs for more details. All remaining items have been assigned an issue and a milestone.

  • Cross-platform, cross-Python version build and deploy process. Regularly deploying 30 wheels after closing each PR.
  • Basic NumpyArray, ListArray, and ListOffsetArray with __getitem__ for int/slice and __iter__ in C++/pybind11 to establish structure and ensure proper reference counting.
  • Introduce Identity as a Pandas-style index to pass through __getitem__.
  • Reproduce all of the above as Numba extensions (make NumpyArray, ListArray, and ListOffsetArray usable in Numba-compiled functions).
  • Error messages with location-of-failure information if the array has an Identity (except in Numba).
  • Fully implement __getitem__ for int/slice/intarray/boolarray/tuple (placeholders for newaxis/ellipsis), with perfect agreement with Numpy basic/advanced indexing, to all levels of depth.
  • Appendable arrays (a distinct phase from readable arrays, when the type is still in flux) to implement awkward.fromiter in C++.
    • Implemented all types but records; tested all primitives and lists.
    • Expose appendable arrays to Numba.
    • Implement appendable records.
    • Test all (tested in mock studies/fillable.py).
  • JSON → Awkward via header-only RapidJSON and awkward.fromiter.
  • Extend __getitem__ to take jagged arrays of integers and booleans (same behavior as old; issue #67).
  • Full suite of array types:
    • EmptyArray: 1-dimensional array with length 0 and unknown type (result of UnknownFillable, compatible with all types of arrays).
    • RawArray: flat, 1-dimensional array type for pure C++ (header-only).
    • NumpyArray: rectilinear, N-dimensional array type without Python/pybind11 dependencies, but intended for Numpy.
    • ListArray: the new JaggedArray, based on starts and stops (i.e. fully general).
    • ListOffsetArray: the JaggedArray case with no unreachable data between reachable data (gaps).
    • RegularArray: for building rectilinear, N-dimensional arrays of arbitrary contents, e.g. putting jagged dimensions inside fixed dimensions.
    • RecordArray: the new Table without lazy-slicing.
    • IndexedArray and IndexedOptionArray: the old IndexedArray and IndexedMaskedArray; the latter has option type.
    • UnionArray (issue #54): same as the old version.
    • BitMaskedArray (issue #58): for nullable data with a bit mask (for Arrow).
    • UnmaskedArray (issue #59): for optional type without actually having a mask.
    • ChunkedArray (issue #56): same as the old version, except that the type is a union if chunks conflict, not an error, and knowledge of all chunk sizes is always required. Also, this will only be available on the top of a hierarchy (without nesting).
    • VirtualArray (issue #57): same as old VirtualArray.
  • Describe high-level types using datashape and possibly also an in-house schema. (Emit datashape strings from C++.)
  • Translation to and from Apache Arrow and Parquet in C++ (issue #68).
  • Layer 1 interface Array:
    • Pass through to the layout classes in Python and Numba.
    • Pass through Numpy ufuncs using NEP 13 (as before; issue #60).
    • Pass through other Numpy functions using NEP 18 (this would be new; issue #61).
    • RecordArray fields (not called "columns" anymore) through Layer 1 __getattr__ (issue #62).
    • Special Layer 1 Record type for RecordArray elements, supporting some methods and a visual representation based on Identity if available, all fields if recordtype == "tuple", or the first field otherwise.
    • Mechanism for adding user-defined Methods like LorentzVector, as before, but only on Layer 1.
      • High-level classes for characters and strings.
    • Inerhit from Pandas so that all Layer 1 arrays can be DataFrame columns (issue #63).
  • Full suite of operations:
    • awkward.tolist: same as before.
    • awkward.fromiter: same as before.
    • awkward.typeof: reports the high-level type (accepting some non-awkward objects).
    • awkward.tonumpy (issue #65): to force conversion to Numpy, if possible. Neither Layer 1 nor Layer 2 will have an __array__ method; in the Numpy sense, they are not "array-like" or "array-compatible."
    • awkward.flatpandas (issue #80): flattening jaggedness into MultiIndex rows and nested records into MultiIndex columns. This is distinct from the arrays' inheritance from Pandas, distinct from the natural ability to use any one of them as DataFrame columns.
    • awkward.flatten: same as old with an axis parameter (issue #51).
    • Reducers, such as awkward.sum, awkward.max, etc., supporting an axis method (issue #69).
    • The non-reducers: awkward.moment, awkward.mean, awkward.var, awkward.std (addendum to issue #69).
    • awkward.argmin, awkward.argmax (issue #70): return values and None instead of singleton and empty lists.
    • awkward.argsort, and awkward.sort (issue #74): same as old.
    • awkward.where (issue #75): like numpy.where; old doesn't have this yet, but we'll need it.
    • awkward.concatenate (issue #76): same as old, but supporting axis at any depth.
    • awkward.zip (issue #77): makes jagged tables; this is a naive version of awkward.join below.
    • awkward.pad (issue #73): same as old, but without the clip option (use slicing instead).
    • awkward.fillna (issue #72): same as old.
    • awkward.cross (and awkward.argcross, issue #78): to make combinations by cross-joining multiple arrays; option to use Identity index.
    • awkward.choose (and awkward.argchoose, issue #79): to make combinations by choosing a fixed number from a single array; option to use Identity index and an option to include same-object combinations.

Soon after the six-month sprint

  • Update hepvector to be Derived classes, replacing the TLorentzVectorArray in uproot-methods.
  • Update uproot (on a branch) to use Awkward 1.0.
  • Start the awkward → awkward0, awkward1 → awkward transition.

Thereafter

  • GPU implementations of the cpu-kernels in Layer 4, with the Layer 3 C++ passing a "device" variable at every level of the layout to indicate whether the data pointers refer to main memory or a particular GPU.
  • CPU-acceleration of the cpu-kernels using vectorization and other tricks.
  • Explicit interface with NumExpr.
  • Explicit interface with Dask.
  • Demonstrate Awkward 1.0 as a C++ wrapping library with FastJet.
  • Deferred array types:
    • ByteMaskedArray: for nullable data with a byte mask (for NumPy).
    • RedirectArray: an explicit weak-reference to another part of the structure (no hard-linked cycles). Often used with an IndexedArray.
    • SparseUnionArray: the additional UnionArray case found in Apache Arrow.
    • SparseArray: same as the old version, but now we need a good lookup mechanism.
    • RegularChunkedArray: like a ChunkedArray, but all chunks are known to have the same size.
    • AmorphousChunkedArray: a ChunkedArray without known chunk lengths (maybe not ever).
    • UnionArray in Numba. Have to somehow deal with heterogeneity.
    • Strings in Numba (issue #124).
  • Deferred operations:
    • awkward.join: performs an inner join of multiple arrays; requires Identity. Because the Identity is a surrogate index, this is effectively a per-event intersection, zipping all fields.
    • awkward.union: performs an outer join of multiple arrays; requires Identity. Because the Identity is a surrogate index, this is effectively a per-event union, zipping fields where possible.
  • Persistence to any medium that stores named binary blobs, as before, but accessible via C++ (especially for writing). The persistence format might differ slightly from the existing one (break backward compatibility, if needed).
  • Describe mid-level "persistence types" with no lengths, somewhat minimal JSON, optional dtypes/compression.
  • Describe low-level layouts independently of filled arrays (JSON or something)?
  • Universal array.get[...] as a softer form of array[...] that inserts None for non-existent indexes, rather than raising errors.