- Introduction
- Features provided by an awkward-array library
- General properties of all array classes
- Properties and methods of all arrays
- Reducer methods on all arrays
- Global switches and types
- Jaggedness
- Product types
- Sum types
- Option types
- Indirection
- Opaque objects
- Non-contiguousness
- Laziness
- Row-wise to columnar conversion
- Serialization
Array programming is a general programming paradigm, similar to functional programming or object-oriented programming in its scope. Pioneered by APL in 1962, this paradigm has appeared primarily in interactive data processing languages aimed at data analysts and statisticians: S (1976), MATLAB (1984), S-PLUS (1988), R (1993), and Numpy (2005). Numpy, unlike its predecessors, is not a language in itself, but a foundational library for science and statistics in Python.
Array programming is mostly used to solve problems with regular, rectangular grids. It is possible to represent and operate on more complex data structures, but it is awkward to do so in current frameworks. Fortunately, this awkwardness may be hidden in a suite of higher-level array classes that present modest extensions of the array programming paradigm to the user.
awkward-array (awkward on PyPI) is an implementation of complex data structures and operations in Python, depending only on Numpy. Drop-in replacements optimized with compiled code or leveraging GPUs are foreseen, though they would present the same user interface as the Numpy-only implementation. Extensions for using awkward-array as Pandas columns, in Numba-compiled functions, and in distributed calculations with Dask are also foreseen. These arrays can already be used to view Arrow data and can be persisted and reconstituted in any data format supporting Numpy arrays, such as HDF5, or named binary blobs, such as ZIP or key-value stores.
As an array programming environment for general data structures, awkward-array represents arbitrary data in a columnar form (compatible with and a generalization of Arrow). Most operations are vectorized or vectorizable because they are applied to flat, homogeneous arrays in memory masquerading as the nested and variable-length data structures that are presented to the user. This allows for memory and computational efficiency that are hard to match with implementations on traditional “row-wise” data, as well as automatic porting to pure vector coprocessors, such as GPUs. The key is a random access, composable representation of jagged arrays in terms of unstructured Numpy arrays.
Despite awkward-array’s focus on Python and Numpy, the data model it represents is language-agnostic. This document describes awkward-array as a protocol for generalizing array programming to complex structures in any environment. It is a normative document in the sense that it describes what awkward-array in Python should do; discrepancies between this document and the Python implementation would usually be decided in favor of the document (though the document itself may have “bugs” to be fixed in a later version). It is also normative in the sense that other implementations, in C++ or MATLAB for instance, that claim to “adhere to the awkward-array specification” should also behave as described here.
The goal of this document is to describe an array programming interface that extends a basic array programming interface. The extended interface is called “awkward-array” and the basic interface is a well-known starting point, such as Numpy. For simplicity, we will use Python syntax and Numpy function names, but a library in a different language or based on a different array library may still adhere to the awkward-array specification by providing a translation of these idioms.
It is not a goal of this document to explain how the awkward-array features are implemented. Drop-in replacements for the Numpy-only awkward-array library and libraries in other environments are free to implement the specification differently. They will thus have different performance characteristics. This specification cannot provide any performance hints.
Documentation is not an explicit goal of this specification, though users may read it to learn how to use any compliant implementation. Unlike documentation, however, this document describes what an awkward-array library should do, not necessarily what it does. Users are encouraged to file issues here if a discrepancy is found in the Numpy-only implementation.
The major and minor version numbers of this specification, xx.yy
, correspond to the major and minor version numbers of the Numpy-only implementation: xx.yy
in xx.yy.zz
. The revision number, zz
, increases as the implementation strives toward compliance with the specification. This particular version of the specification is a pre-release; compliance is only required for released specifications.
A library may implement more features than those described in this specification without affecting its compliance. If it implements fewer classes or fewer methods, it can only be said to be “partially compliant,” and would have to enumerate the missing classes or methods.
Necessary features of the underlying array library are described in this section for two reasons: to establish a syntax used in the rest of this document and as a basis for translating idioms of non-Numpy libraries into the language of this document. For instance, another language may use a function to extract slices of an array, rather than Python’s array[start:stop:step]
syntax. If so, the awkward-array idiom in that language would use the same function name and arguments, rather than the Python ones.
The underlying array library must have the following features or some translation thereof.
It must be possible to represent ordered sequences of primitive values, known as arrays. An array is homogeneous: all values in an array have the same dtype, or primitive data type. Primitive types include integers, floating point numbers, and boolean (true/false) values. An awkward-array library inherits the primitives of the underlying library — if a base array library doesn’t support complex numbers (for instance), then neither does its awkward-array extension. Arrays need not be fixed-size, contiguous in memory, with only fixed-bytesize primitives, but these are the simplest cases.
It must be possible to construct N-dimensional arrays for any positive integer N. We will refer to an array’s size in all dimensions as its shape, a tuple of positive integers whose product is the total number of elements in the array. The length of an array is the array’s size in its first dimension.
t must be possible to extract an array element by integer index or a tuple of N integer indexes for an N-dimensional array. In this document, we will use 0-indexing: 0
extracts the first element, 1
extracts the second, etc. If an index is greater than or equal to N, it is out of bounds. An array library with 1-indexing (1
extracts the first element) would correspondingly have 1-indexing in its awkward-array extensions. We will use a number between square brackets after the array, as in myarray[5]
, or a comma-separated tuple like myarray[3, 1, 4]
, to represent extraction.
The array library may provide a mechanism for counting from the end of an array dimension. In Python, -1
refers to the last element, -2
to the second-to-last element, etc. We will use that convention. Given a negative index i
and a dimension size d
, the standard index is i + d
. If the standard index is still negative, the index is out of bounds.
It must be possible to slice an array dimension to get a subsequence of that array along any dimension. A slice is defined by a start index (lower bound element number), a stop index (upper bound element number), and an optional step (stride length, for skipping a regular number of input elements between each returned element). In this document, we will use inclusive start values (the lower bound is included in the resulting subsequence) and exclusive stop values (the upper bound is not included). A step of 1
is equivalent to no step, and the step must not be 0
, though negative values may be allowed (to reverse order). If either start or stop is not provided, they may be assumed to be 0
and the size of the dimension (respectively) if step is positive, or one less than the size of the dimension and one less than 0
(respectively) if step is negative. If the difference between step and start is not an integer multiple of step, we take that the subsequence to be truncated at the last element before stop. If either start or step are beyond the bounds of the array, we take them to be truncated to the nearest legal values, which may result in an empty subsequence, but not an error. If negative indexes are allowed for element extraction, they may be allowed for slicing as well. The Python syntax for this operation is myarray[start:stop:step]
(in which any start
, stop
, or step
may be missing or None
, and the second colon may be omitted if there is no step
). We will use this syntax in this document.
It must be possible to mask elements in an array dimension by a 1-dimensional boolean array of the same size as that array dimension. The result of such an operation is a sequence in the same order containing only the elements for which the matching boolean value is true. The Numpy syntax for this operation is to put the boolean mymask
array in square brackets after the array: myarray[mymask]
, but it may be a named function call.
It must be possible to gather elements in an array dimension by a 1-dimensional integer array, using the integer array as extraction indexes. The result of such an operation, denoted myarray[myindexes]
, is a sequence with the same length and order as the indexing array myindexes
, containing elements from myarray
. The same rules apply to the elements of the indexing array as for single-element extraction. (In Numpy, this is sometimes called “fancy indexing,” though sometimes that term encompasses masking as well, so we will use “gather” in this document, as this is what the operation is called in SIMD programming.) As with masking, this may be a named function call.
It must either be possible to apply selections to multiple dimensions in a single call or to apply a selection to a specified dimension, not necessarily the first. For instance, we could extract from the first dimension, slice the second, mask the third, and gather the fourth in Numpy by separating requests with commas: myarray[5, start:stop:step, mymask, myindexes]
. Selecting in multiple dimensions would allow selection in a specified dimension by passing all-inclusive slices to all dimensions before the dimension of interest: myarray[:, :, :, selection]
. Selecting a specified dimension would allow selecting multiple dimensions by composition, so either is sufficient.
It must be possible to map arithmetic operations across all elements of one or more arrays. Any kernel function of n primitive type arguments returning a primitive type result can be applied to n equal-shape arrays and return a single new array of results with the same shape. The kernel function must be pure (no side effects), and many would be expressible as special syntax, such as +
for addition, -
for subtraction, etc. In Numpy, these are called “universal functions” or “ufuncs,” but this is such a specific protocol that we use a more general word, mapped kernels.
If any arguments in a mapped kernel have a scalar primitive type, rather than an array, they should be replaced by a constant array of the correct shape before mapping the kernel. If an argument has the correct dimensionality but some of its dimensions have size 1
where the other arguments have a size greater than 1
, this dimension should be similarly expanded to a constant before mapping. These expansions do not need to be literal — the result is calculated as though the scalar or singleton dimension were a constant array. This conceptual expansion is known as broadcasting in Numpy and in this document.
It must be possible to reduce an array by a binary arithmetic operation along a given dimension. The array is reduced in dimension by one; 1-dimensional arrays are reduced to primitive scalars. Empty dimensions or arrays may be reduced to the operation’s identity if the operation has an identity — it must return an error otherwise. The identity for addition is 0
, multiplication is 1
, and we may take the identity for minimization and maximization to be the largest and smallest values available in the primitive data type, respectively. For instance, the minimum of an empty array of floating point numbers may be taken to be infinity.
Any array library supporting these basic features may be extended as specified in this document.
An awkward-array library provides the above features in the following new contexts.
-
Jaggedness: multidimensional arrays with a non-constant shape. A jagged array is an array of arbitrary-length lists.
-
Product types: extend a primitive type system with record structures — objects with named, typed fields.
-
Sum types: extend a primitive type system with tagged unions — values that may have one of a set of enumerated types. This permits arrays to be heterogeneous in a controlled way.
-
Option types: extend a primitive type system with a nullable type — values that may be “missing.”
-
Cross-references and cyclic references: extend a primitive type system with values that may be references to another array, including a parent of the array in which they reside. This adds “pointers” to the type system in a controlled way: references must point to elements of a specified array.
-
Opaque object types: allow array elements to virtually contain instances of any type expressible in a programming language, provided that it can be constructed strictly from elements of the array.
-
Object methods: adds user-defined methods to arrays, usually to emulate object methods as mapped kernels.
-
Indirection: allows arrays to be defined as a cherry-picked subset of other arrays.
-
Sparseness: allows arrays to be defined at a minority of their index positions.
-
Non-contiguousness: allows arrays to be non-contiguous in memory by mapping indexes. This virtually concatenates data from separate chunks into a single logical array without copying.
-
Laziness: allows arrays to be loaded or generated on demand, allowing arrays that have not yet been materialized to be treated on the same footing with arrays that have.
Jaggedness, product types, sum types, option types, and references extend the expressivity of basic arrays to a complete, hierarchical data model. General data containers like Protocol buffers, Thrift, Avro, and Parquet present this data model, with the exception of references.
Object types and methods generalize it further, allowing any type permitted by a programming language such as Python, with a loss of cross-language compatibility.
Indirection, sparseness, non-contiguousness, and laziness do not affect data type: they are purely low-level features.
Taken together, these features promote array programming to a wider set of applications.
To describe how awkward-array extends basic array types, we start by defining a notation that encompasses both. Basic arrays can be fully described by their dtype and shape. These parameters are not sufficient for awkward-array.
Any array, including those in awkward-array, can be thought of as a function that maps extraction indexes to values. The functional type of a basic, 1-dimensional array with value type T
and length n
could be written as
[0, n) -> T
That is, the array is a function that takes an integer greater than or equal to 0
and less than n
as its only argument, and returns a value of type T
. The possible value types are the primitive types of the basic array library. Knowing the array’s type signature as an extraction function is enough to deduce its behavior in slicing, masking, and gathering.
A 2-dimensional array with shape (n, m)
is a function that returns a function.
[0, n) -> [0, m) -> T
That is, if we pass an integer [0, n)
to the 2-dimensional array, we get a 1-dimensional array; if we pass an integer [0, m)
to that 1-dimensional array, we get a primitive value of type T
. This currying can be applied indefinitely to describe arrays of any dimensionality. The shape in a tuple syntax like Numpy’s is more concise, but we will need the longer form.
Jagged arrays are like multidimensional arrays in the number of integer arguments that must be passed before obtaining a scalar primitive type, but not all of the arguments have precise domains. A simple jagged array of length n
, only one level of jaggedness, and primitive type T
would be expressed as
[0, n) -> [0, inf) -> T
because the second argument may be any non-negative integer. Unlike a basic array, some values for this second parameter, allowed by the above expression, would be rejected in practice. For example, the array
[[1, 2, 3], [], [4, 5]]
would accept only [0, 3)
as a second argument if the first argument is 0
, would accept nothing (empty domain) if the first argument is 1
, and would accept only [0, 2)
if the first argument is 2
. We have a choice between expressing the type signature fully, such that any arguments satisfying that signature would never fail to return a primitive value, and underexpressing it with [0, inf)
, which has the advantage that the length of the type signature does not scale with the length of the array itself. Brevity is more useful for our purposes.
In this document, we refer to a full listing of the sizes of subarrays as a jagged array’s jagged structure. Some operations on two or more jagged arrays can only be performed if they have the same jagged structure.
Continuing this line of reasoning, a doubly jagged array, such as
[[[1, 2, 3], []], [[4, 5]], []]
would have type
[0, n) -> [0, inf) -> [0, inf) -> T
Product types are variously known as “records,” “structs,” “compounds,” “classes,” etc. They are values that contain a fixed set of named, typed attributes. They are “products” in the sense of Cartesian products: the set of records containing a floating point x
and an integer i
is the Cartesian product of the set of floating point values and the set of integer values.
To extract an attribute from a record, we give the record the attribute’s name. In a dynamically typed language, this amounts to passing a string argument; in a statically typed language, this string is usually a parsed, checked, compile-time literal. In either case, we can express it as an extraction function like
'x' -> float64
'i' -> int64
The names are enumerated because there is a fixed set of choices, and each choice returns a potentially different type. An array of length n
containing these records is
[0, n) -> 'x' -> float64
'i' -> int64
In general, the form is a sequence of name, type specification pairs called fields that can be embedded in a type specification. Product types with the same set of fields, in any order, are equivalent, though an awkward-array library should maintain the user-specified order for readability.
Jagged arrays, basic arrays, and record structure may be freely intermixed. An awkward-array type with these elements is, in general, a tree:
[0, n) -> [0, inf) -> 'one' -> bool
'two' -> [0, inf) -> int64
'three' -> 'x' -> [0, inf) -> float64
'y' -> complex128
Extraction indexes (integers) and field names (strings) can be commuted (swapped in order). Extraction indexes do not commute with other extraction indexes, as this would violate dimension order, and field names do not commute with other field names, as this would violate records of different nesting depths, but extraction indexes commute with field names. Reversing an integer index and a string name amounts to selecting column before row or row before column in a table.
Ignoring this distinction hides the distinction between an array of structs and a struct of arrays, so that array manipulation code does not depend on this difference (just as Numpy hides C vs. Fortran order as an internal array flag). For instance,
[0, n) -> 'x' -> [0, m) -> T1
'y' -> [0, m) -> T2
is equivalent to
[0, n) -> [0, m) -> 'x' -> T1
'y' -> T2
because all fields in the record have the same array dimension m
, this array dimension can be commuted toward the root. The second form is considered canonical: extraction indexes should be commuted toward the root to reduce redundancy.
The same commutation is possible for jagged dimensions (perhaps surprisingly). A jagged array of records is equivalent to a record of jagged arrays, all with the same jagged structure. If we had a detailed type schema that encoded this jagged structure, rather than hiding it under the symbol [0, inf)
, we could perform the same commutation on jagged arrays as on regular arrays. However, as a limitation of this notation,
[0, n) -> 'x' -> [0, inf) -> T1
'y' -> [0, inf) -> T2
can’t be commuted to
[0, n) -> [0, inf) -> 'x' -> T1
'y' -> T2
because we do not know if fields x
and y
have the same jagged structure. (A record with a single field can be commuted.)
Sum types are known as “tagged unions” in programming languages that support them. They are values that may be any one of an enumerated set of types. They are “sums” in the sense that they are dual to product types: if a record type with two fields of types T1 and T2 is denoted T1 * T2, a union type that can be T1 or T2 is T1 + T2, obeying a distributive law. Unions are useful for building heterogeneous arrays in a controlled way. Class inheritance in object-oriented programming is a limited case of sum typing.
We delimit the enumerated types in a sum type with a vertical bar. Consider, for instance, a value that may be a floating point number or a jagged list of booleans.
float64 |
[0, inf) -> bool
These may be embedded in any type specification.
[0, n) -> (float64 |
[0, inf) -> bool )
Parentheses are required because the fields of a product type (denoted by adjacency) have higher operator precedence than the enumerations of a sum type. For instance,
[0, n) -> ('x' -> float64
'y' -> float64
'z' -> float64 |
[0, inf) -> bool )
for unions of x, y, z records with jagged arrays of booleans, and
[0, n) -> ('x' -> float64
'y' -> float64
'z' -> float64 |
'x' -> int64
'y' -> int64 )
for unions of x, y, z floating point records with x, y integer records.
Sum types with the same set of enumerated types, in any order, are equivalent, though an awkward-array library should maintain the user-specified order for readability and an unambiguous type resolution order. (A value may be a member of more than one of the enumerated types.)
Sum types may be nested, though they are equivalent to their flattened form. For instance,
[0, n) -> (bool |
int64 |
(float64 |
complex128 ) )
is equivalent to
[0, n) -> (bool |
int64 |
float64 |
complex128 )
An important special case of sum types is to describe missing data with a null
, None
, na
, or NaN
token. This could be expressed as a union of the non-missing data type with a unit type for None
, but building these constructions manually (as in Avro) becomes unwieldy when most data can be missing (is “nullable”) and it forces us to consider an array that can only be filled with None
as a legitimate array type. Instead, we introduce another element: an option type.
Option types have one parameter, the non-missing type T
, and indicate that values may be missing. When extracting data from such an array, the values might have type T
or might be None
(in Python). We denote this with a question mark and parentheses; for example,
[0, n) -> ?(float64)
or
[0, n) -> [0, inf) -> ?([0, 3) -> int64)
or
[0, n) -> ?('x' -> float64
'y' -> float64)
Option types may be nested, though they are equivalent to their flattened form. For instance,
[0, n) -> ?(?(float64))
is equivalent to
[0, n) -> ?(float64)
Note that Numpy has a numpy.ma.MaskedArray
type, but awkward-array has its own awkward.MaskedArray
. The awkward-array masked type can contain non-basic arrays (jagged arrays, tables, etc.) and uses None
to represent missing values, rather than numpy.ma.masked
, an object with surprising properties.
Option types of sum types introduces another equivalence: a masked union is equivalent to a union in which any one (or more than one) of the union’s enumerated types is masked. For example, [1, "two", None]
is a union of integers and strings and is masked. This may be
[0, 3) -> ?(int64 |
<class 'str'> )
- or -
[0, 3) -> (?int64 |
<class 'str'> )
- or -
[0, 3) -> (int64 |
?<class 'str'> )
- or -
[0, 3) -> (?int64 |
?<class 'str'> )
The canonical form is the first, with the option type outside the union type, and no option types on any of the union’s enumerated types.
As described above, an awkward-array type can be a tree of primitive types, basic arrays, jagged arrays, product types, sum types, and option types. An arbitrary tree of types is powerful (see Protocol buffers, Thrift, Avro, and Parquet), but a given type tree sets an upper limit on the depth of values that are members of that type. There would be, for instance, no type specification for JSON, or even the subset of JSON corresponding to “lists that contain numbers or other lists.”
This constraint can be lifted by allowing the type specification to contain references to other parts of itself. We represent these references by assigning a subexpression with :=
and then referring to that subexpression by name, rather than rewriting the subexpression. If the reference is not contained within the subexpression it references, it is a cross-reference and this notation reduces verbosity.
In the following example of a cross-reference, a dataset contains x, y, z data points and moving windows representing contiguous ranges of these data points. The window has access to data beyond its own array element, but it does not have access to anything beyond the data
field.
[0, n) -> 'data' -> T0 := 'x' -> float64
'y' -> float64
'z' -> float64
'window' -> [0, inf) -> T0
If the reference is contained within the subexpression it references, it is a cyclic reference and the assignment notation avoids infinite recursion.
For example, consider lists that contain numbers or other lists, such as the value below.
[[1.1, [2.2, [3.3, 4.4, []]]]]
Lists with this depth and no greater could be expressed as belonging to a sum type of jagged arrays nested four levels deep. Lists like the above at any depth can be expressed as belonging to
[0, n) -> T0 := (float64 |
[0, inf) -> T0 )
The subexpression T0
represents floating point numbers or jagged arrays of T0
. Trees of any depth belong to this type, as do cyclic graphs. With these elements, we can represent a wide variety of data structures, as general as most programming languages.
Data constructed from primitives, basic arrays, jagged arrays, product types, sum types, option types, cross-references, and cyclic references are called pure constructions. They are entirely expressible in terms of basic arrays, which are themselves portable across environments. However, if a programming task requires special types, such as instances with particular methods or inheritance from a third-party interface, then pure constructions would have to be wrapped.
Wrapped data are represented in the type system as opaque types, types outside of awkward-array’s system. For instance, we may emulate an array of opaque objects by constructing these objects from an awkward-array upon extraction. This opaque constructor, which may be a helper function rather than a class’s built-in constructor, is included in the type specification to represent such a case.
For example, an array of Python strings would be
[0, n) -> <class 'str'>
with a jagged array of character primitives hidden behind the string constructor.
Opaque objects cannot be shared across platforms the way that pure constructions can. Gaining flexibility in one way diminishes it in another.
In awkward-array, the above types are represented by the following classes in awkward.type
:
-
Type
is an abstract base class for type objects. Types may beTypes
, Numpy dtypes, or Python callables. -
ArrayType(n, m, ..., to)
constructs a linear sequence like[0, n) -> [0, m) -> ... -> to
. At least two arguments are required, and all arguments but the last must be positive integers or infinity (which is floating point) or a string, to make a single-field table. The last argument must be a basic array dtype (for a primitive type) or a Python callable (for an opaque type).ArrayType
objects are recursively linked lists: each instance has two members,takes
andto
, wheretakes
is a single number andto
is the rest of the sequence. -
TableType(**fields)
constructs a product type from a mapping of field names to field types. The preferred way to construct aTableType
is through the&
operator, which glues single-field tables into a multi-field table. TheTableType
has get-item and set-item methods for manipulation as a Python dict. -
UnionType(*possibilities)
constructs a sum type from a sequence of types. The preferred way to construct aUnionType
is through the|
operator. TheUnionType
has get-item and set-item methods, as well asappend
, for manipulation as a list. -
OptionType(type)
constructs an option type from a type parameter. It has atype
attribute. -
There are no special classes for cross-references and cyclic references. Build an internally referenced type object to represent an internally referenced type. All
Type
methods except for those that attempt to yield equivalent Numpy types (dtype
andshape
) should be recursion-safe.
This module has two functions:
-
fromarray(array)
returns the type specification of any array, whether an awkward-array or a basic (Numpy) array. -
fromnumpy(shape, dtype, masked=False)
returns the type specification of a basic (Numpy) array directly from shape, dtype, and masked parameters.
Array classes are defined in the awkward.array
submodule but are all accessible directly from the top-level awkward
module. They all have the following properties.
-
They have primary constructors that define the array in terms of the most general components; for example,
starts
,stops
, andcontent
forJaggedArray
andtags
,offsets
, andcontents
forUnionArray
. -
They have class-method constructors with more convenient constructors, such as
JaggedArray.fromoffsets(offsets, content)
andUnionArray.fromtags(tags, contents)
. -
Primary constructor arguments are read-write properties of the array object.
-
Single-property validity tests, which verify conditions that are a function of only one property, are performed in the constructor and upon assignment. Errors raise exceptions.
-
Whole-array validity tests, which verify conditions that are relationships among properties in an array, are performed just before array evaluation. Errors raise exceptions.
-
All arrays have get-item methods that perform extraction, slicing, masking, and gathering. If the array contains a product type, a string selects a column and a sequence of strings selects multiple columns.
-
String indexes commute with extraction/slicing/masking/gathering, but they can’t be used in the same set of square brackets, the way extraction/slicing/masking/gathering can.
-
All arrays have a set-item method to add columns to a
Table
(product type), but only for this purpose. Elements of an awkward-array cannot be changed in place without digging down to the basic array (and that can have hard-to-predict consequences). -
All arrays have a length (for
len
) and Pythonic iteration (forfor
loops). -
String representations (
str
andrepr
in Python) show logical content in square brackets without commas (spaces only), limited to 6 elements at each level of depth (filling in missing with three dots:...
). In Python, therepr
representation includes the top-most class name butstr
does not. String representations trigger whole-array validity tests. -
All arrays can be pickled, which invokes the full serialization protocol described at the end of this document.
All arrays have the following read-only properties and methods.
-
type
: a representation of the array type. -
dtype
andshape
: attempt to express the array as a basic (Numpy) array, which can involve information loss. Raises an error if the array cannot be expressed in Numpy’s terms. -
columns
: a list of field names for the shallowestTable
within the array that are valid Python identifiers. Theallcolumns
read-only property is not restricted to valid Python identifiers. -
tolist()
: returns a Pythonic representation of the data (also valid JSON, if all floating point values are finite). Nested arrays become nested Python lists andTable.Rows
become dicts of name, field value pairs. -
valid()
: tests whole-array validity without raising an exception; returnsTrue
orFalse
. -
copy(...)
: copies an array object without recursively copying array content. Any explicitly passed arguments (none are required, but otherwise same arguments as the array’s constructor) are used instead of copying the array’s properties. Passing all arguments is equivalent to calling the class’s primary constructor. -
deepcopy(...)
: like the above except that all array content is recursively copied. -
empty_like(...)
,zeros_like(...)
, andones_like(...)
: create arrays with the same structure but uninitialized, zeros, or ones for content. -
astype(dtype)
: return a copy of the whole structure with basic arrays at the deepest level converted todtype
.
Reducers convert arrays into scalars, reducing the dimensionality by one. In awkward-array, this applies to the innermost dimension (like axis = len(array.shape) - 1
in Numpy) and it removes one level of jaggedness (the innermost) from jagged arrays. They are all mask-aware and treat NaN
values in floating point types as though they were masked.
-
any()
: returnsTrue
if any values are non-masked and non-zero. Arrays without non-masked values yieldFalse
. -
all()
: returnsTrue
if all non-masked values are non-zero. Arrays without non-masked values yieldTrue
. -
count()
: returns the number of non-masked values. -
count_nonzero()
: returns the number of non-masked, non-zero values. -
sum()
: returns the sum of all non-masked values or0
if there are no non-masked values. -
prod()
: returns the product of all non-masked values or1
if there are no non-masked values. -
min()
: returns the minimum of non-masked values. Arrays without non-masked values yieldinf
for floating point types and the maximum integer value for integer types. -
max()
: returns the maximum of non-masked values. Arrays without non-masked values yield-inf
for floating point types and the minimum integer value for integer types.
The methods argmin()
and argmax()
are not reducers: they return an array of zero or one elements: zero if there are no non-masked values, a single index position of the non-masked value that minimizes or maximizes the array otherwise. Data in this form is usable as a gather index. When applied to another array, it returns an empty array or a singleton, depending on the length of the argmin()
or argmax()
.
The awkward.array.base.AwkwardArray
abstract base class has switches to control global behavior. They may be set at any level: on the base class to affect all awkward-array types, all instances, or on a concrete class for all instances of that class, or on a single instance. In languages other than Python, an alternative mechanism may be substituted.
-
allow_tonumpy
default:True
, ifFalse
, any operation that would convert an awkward-array type into a basic (Numpy) array instead raises aRuntimeError
. -
allow_iter
default:True
, ifFalse
, any attempt to iterate over an awkward-array in Python (except viastr
orrepr
) would raise aRuntimeError
. -
check_prop_valid
default:True
, ifFalse
, skip single-property validity checks. -
check_whole_valid
default:True
, ifFalse
, skip whole-array validity checks.
The default primitive types are also defined on awkward.array.base.AwkwardArray
.
Awkward type | Numpy dtype | Purpose |
---|---|---|
|
|
default array content |
|
|
array content for byte-granularity arrays |
|
|
default type for indexes (signed so that subtraction does not change type) |
|
|
default type for tagged union tags |
|
|
type for byte masks |
|
|
type for bit masks |
|
|
type for boolean data |
Jagged arrays have a logical structure that is independent of how they are represented in memory, but since awkward-array defines this structure in terms of a basic array library (Numpy), the structure we choose is a visible part of the awkward-array specification. This section presents many ways to represent jagged arrays, their advantages and disadvantages, before specifying the JaggedArray
class itself. The JaggedArray
class uses the most general representation internally with conversions to and from the other forms.
One natural way to represent a jagged array is to introduce markers in the serialized content where each variable-length nested list begins or ends, or to insert nested list sizes before each nested list (as in the Avro protocol) to avoid having to distinguish content values from markers. However, this “row-wise” representation interrupts vectorized processing of the content. Another natural way is to create an array of pointers to nested lists, like Numpy’s object array, but this is even worse because it additionally increases memory latency.
Columnar representations keep the contents of the nested lists in a single, contiguous array (a “column”). The ROOT file format was probably the first columnar representation of jagged arrays (1995), though the intention was for efficient packing and compression on disk, rather than processing in memory. However, the columnar arrays of a ROOT file may be transplanted into memory for efficient computation as well. The Parquet file format (2013) has a different columnar representation of jagged arrays, though it modifies (“shreds”) the data in a way that is hard to use without fully restructuring it. The Arrow format (2016) uses one of the methods described below to perform efficient calculations on data in memory.
The simplest way to represent a jagged array with columnar arrays is to store flattened content in one array and counts of the number of elements in each interior list in another array. The starting and stopping index of one element — an interior list — can unambiguously be determined by summing counts up to the element of interest. This operation is O(N) in array length N, unfortunately. It is, however, composable, in that nested lists of nested lists (and so on) can be constructed by setting one jagged array as the content of another. For example, to represent the following nested structure:
[[], [[1.1, 2.2, 3.3], [], [4.4, 5.5]], [[6.6, 7.7], [8.8]]]
we note that the first level of depth contains lists of length 0
, length 3
, and length 2
. Inside that (and ignoring boundaries of the first level of depth), the second level of depth contains lists of length 3
, 0
, 2
, 2
, and 1
. Inside that, the content consists of floating point numbers. (The type for this doubly jagged array is [0, inf) -> [0, inf) -> float64
.) It can be represented by three arrays:
-
outer counts:
0, 3, 2
-
inner counts:
3, 0, 2, 2, 1
-
inner content:
1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8
The inner jagged array instance has inner counts and inner content as its counts and content, and the outer jagged array instance has outer counts as its counts and the whole inner jagged array as its content. Recursively, we can construct jaggedness of any depth from a single JaggedArray
class.
To address the random access problem, we can consider replacing counts with its integral, offsets. An offsets array is a cumulative sum of counts, which avoids the need to recompute the sum for each lookup. Given a counts array, we compute the offsets by allocating an array one larger than counts, filling its first element with 0
, and filling each subsequent element i
with offsets[i] = offsets[i - 1] + counts[i - 1]
. Inversely, counts is the derivative of offsets, and can be derived with a vectorized counts = offsets[1:] - offsets[:-1]
. (There is a vectorized algorithm for computing the cumulative sum as well.) The nested list at index i
is content[offsets[i]:offsets[i + 1]]
. The Arrow in-memory format uses offset arrays to define arbitrary length lists.
Like jagged arrays defined by counts, jagged arrays defined by offsets are composable, but unlike counts, any element may be accessed in O(1) time. There are only a few situations in which counts may be preferable:
-
counts are non-negative small integers, which can be packed more efficiently with variable width encoding and/or lightweight compression (both of which destroy O(1) lookup time anyway);
-
counts are position-independent, allowing a large dataset to be processed in parallel without knowing the absolute positions of each parallel worker’s chunks. This is particularly useful for generating large sequences when the total size of each chunk is not known until fully generated.
One shortcoming that counts and offsets share is that they can only describe dense content. The data for list i + 1
must appear directly after the data for list i
. If we wish to view the jagged array with any interior elements removed, we would have to make a new copy of the content with those lists removed, which could trigger a deep recursive copy. It would be more efficient to allow the content to contain unreachable elements, so that these selections can be zero-copy views.
A jagged array based on counts can have unreachable elements: any content at indexes greater than or equal to sum(counts)
are not in the logical view of the jagged array. A jagged array based on offsets can have uncreachable elements at indexes less than offsets[0]
and greater than or equal to offsets[-1]
, assuming that we allow offsets[0]
to be greater than 0
. To allow interior elements to be unreachable, we have to generalize offsets into two arrays, starts and stops. These two arrays (nominally) have the same shape as each other and define the shape of the jagged array. The nested list at index i
is content[starts[i]:stops[i]]
. Given an offsets array, we can compute starts and stops by starts = offsets[:-1]
and stops = offsets[1:]
.
A jagged array defined by starts and stops can skip any interior content, can repeat elements, can list elements in any order, and can even make nested lists partially overlap. Skipping elements is useful for masking, repeating elements is useful for gathering, and reordering elements is useful for optimizing data to minimize disk page-reads. (No use for partial overlaps is currently known.) A potential cost of separate starts and stops is that it can double memory use and time spent in validation tests. However, if the starts and stops happen to be dense and in order, they can be views of a single offsets array and if this case is detected, simplified calculations may be performed.
These three arrays — starts, stops, and content — overrepresent the logical structure of a jagged array. Two jagged arrays constructed from different starts/stops/content may be compatible for elementwise operations and may even be equal. An easy way to see this is to consider the fact that the starts/stops scheme allows content to be reordered without affecting the data it represents. Another consideration is that unreachable content may differ in values or length. Only an array defined by offsets (and their starts/stops equivalent) in which offsets[0] == 0
and offsets[-1] == len(content)
have a one-to-one relationship between the logical elements of the jagged array and their underlying representation in terms of starts, stops, and content.
The starts/stops scheme is a very general way to describe a jagged array from the outside in, for efficient extraction, slicing, masking, and gathering. It is a tree structure with pointers (indexes) from the root toward the leaves. For reduction operations, however, we need pointers from the leaves toward the root: an array with (nominally) the same length as the content, indicating where each nested list begins and ends. (This is similar to database normalization, and the scheme used by Parquet, though the latter is highly transformed and bit-packed.)
The simplest inside-out scheme is to associate an integer with each content element, and distinct values of these integers indicate different nested lists. (This is closest to database normalization: aggregation over nested lists could then be performed by an SQL group-by.) For efficient access, especially if the jagged array is distributed and acted upon in parallel, we can stipulate that identical values must be contiguous, since content belonging to the same nested list must be contiguous in the starts/stops scheme. Such an array is called a uniques array. It underrepresents a jagged array in two ways:
-
it doesn’t specify an ordering of elements (though we can assume the content is in increasing order), and
-
it can’t express any empty lists (though we can assume that there are none).
Because of this underrepresentation, a uniques array can be used to generate a jagged array but can’t be used to represent one that is already defined by starts and stops. We can modify the definition of uniques to more fully specify a jagged array by requiring the unique values associated with every nested list to be the index of the corresponding starts element. This specialized uniques array is called parents.
For example, with a jagged array logically defined as
[[], [1.1, 2.2, 3.3], [], [4.4, 5.5], [6.6, 7.7], [8.8], []]
the starts, stops, and content are
-
starts:
0, 0, 3, 3, 5, 7, 8
-
stops:
0, 3, 3, 5, 7, 8, 8
-
content:
1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8
and the parents array is
-
parents:
1, 1, 1, 3, 3, 4, 4, 5
The first three elements of parents (1, 1, 1
) associate the first three contents (1.1, 2.2, 3.3
) with element 1
of starts and stops. The next two elements of parents (3, 3
) associate the next two contents (4.4, 5.5
) with element 3
of starts and stops. The fact that parents lacks 0
and 2
indicate that these are empty lists. Only empty lists at the end of the jagged array are unrepresented unless the total length of the jagged array is also given. Out of order elements can easily be expressed because parents does not need to be an increasing array. Unreachable elements can also be expressed by setting these parents elements to a negative value, such as -1
. However, repeated elements cannot be expressed, so a parents array cannot represent the result of a gather operation. Likewise, partial overlaps cannot be expressed.
Given a starts array and its corresponding parents, the following invariant holds for all 0 <= i < len(starts)
:
parents[starts[i]] == i
and the following holds for all 0 <= j < len(content)
that are at the beginning of a nested list:
starts[parents[j]] == j
Although parents is a highly expressive inside-out representation, another that is sometimes useful, called index, consists of integers that are zero at the start of each nested list and increase by one for each content element. For instance, the above example has the following index:
-
index:
0, 1, 2, 0, 1, 0, 1, 0
These values are local indexes for elements within the nested lists. For all 0 <= j < len(content)
, the following invariant holds:
starts[parents[j]] + index[j] == j
It is also useful to wrap the index array as a jagged array with the same jagged structure as the original jagged array, because then it can be used in gather operations.
All of the above discussion has focused on jagged arrays and nested jagged arrays without any regular array dimensions — that is, without dimensions whose sizes are known to be constant. Jagged arrays are more general, so a regular array may be emulated by a jagged array with constant counts, but this clearly less efficient than storing the regular dimension sizes only once. Regular dimensions that appear after (or “inside”) a jagged dimension can be represented by simply including a multidimensional array as content in a jagged array. That is, to get an array of type
[0, inf) -> [0, m) -> T
construct a jagged array whose content is an array of type [0, m) -> T
. Regular dimensions that appear before (or “outside”) a jagged dimension are harder: the starts and stops of the jagged array must both have the shape of these regular dimensions. That is, to get an array of type
[0, n) -> [0, inf) -> T
the starts and stops must be arrays of type [0, n) -> INDEXTYPE
. In a counts representation, the counts must be an array of this type. This cannot be expressed in an offsets representation because offsets elements do not have a one-to-one relationship with logical jagged array elements (another argument for starts and stops over offsets).
Some applications of awkward-array may require data that is being filled while it is being accessed. This is possible if whole-array validity constraints on array shapes are not too strict. Assuming that basic arrays can be appended atomically, or at least their lengths can be increased atomically to reveal content filled before increasing their lengths, jagged arrays can atomically grow by
-
appending content first,
-
then appending stops,
-
then appending starts.
The length of the content is allowed to be greater than or equal to the maximum stop value, and the length of stops is allowed to be greater than or equal to the length of starts. The logical length of the jagged array is taken to be the length of starts. As described above, starts and stops must have the same shape, but only for dimensions other than the first dimension.
Likewise, the length of the content may be greater than or equal to the length of the parents array. The parents array must have the same shape as the content in all dimensions other than the first.
A JaggedArray
is defined by three arrays, starts, stops, and content, which are the arguments of its constructor. Below are their single-property validity conditions. They may be generated from any Python iterable, with default types chosen in the case of empty iterables.
-
starts
: basic array of integer dtype (default isINDEXTYPE
) with at least one dimension and all non-negative values. -
stops
: basic array of integer dtype (default isINDEXTYPE
) with at least one dimension and all non-negative values. -
content
: any array (default is a basic array ofDEFAULTTYPE
).
The whole-array validity conditions are:
-
starts
must have the same (or shorter) length thanstops
. -
starts
andstops
must have the same dimensionality (shape[1:]
). -
stops
must be greater than or equal tostarts
. -
The maximum of
starts
for non-empty elements must be less than the length ofcontent
. -
The maximum of
stops
for non-empty elements must be less than or equal to the length ofcontent
.
The starts
, stops
, and content
properties are read-write; setting them invokes the same single-property validity check as the constructor. In addition, a JaggedArray
has the following read-write properties:
-
offsets
: basic array of integer dtype (default isINDEXTYPE
) with exactly one dimension, at least one element, and all non-negative values. Getting it would raise an error if thestarts
andstops
are not compatible with a dense sequence of offsets. Setting it overwritesstarts
andstops
. -
counts
: basic array of integer dtype (default isINDEXTYPE
) with at least one dimension and all non-negative values. Setting it overwritesstarts
andstops
. -
parents
: basic array of integer dtype (default isINDEXTYPE
) with at least one dimension. Setting it overwritesstarts
andstops
.
JaggedArray
has the following read-only properties and methods:
-
index
: index array with jagged structure. -
regular()
: returns a basic N-dimensional array if this jagged array happens to have regular structure; raises an error if not. -
flatten()
: returns the content without nested list boundaries. Equivalent tocontent
in a special case: when the jagged structure is describable by an offsets array andoffsets[0] == 0
andoffsets[-1] == len(content)
. Use this method instead ofcontent
to ensure generality.
When a jagged array myarray
is passed a selection
in square brackets, it obeys the following rules.
If selection
is an integer, the element at that index is extracted (handling negative indexes, if applicable). If the provided index is beyond the array’s range, an error is raised. For example,
myarray = awkward.JaggedArray.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]])
myarray[0]
# returns array([1.1, 2.2, 3.3])
myarray[1]
# returns array([], dtype=float64)
myarray[-1]
# returns array([4.4, 5.5])
If selection
is a slice, elements selected by the slice are returned as a new jagged array (handling negative indexes, if applicable). For example,
myarray = awkward.JaggedArray.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]])
myarray[1:]
# returns <JaggedArray [[] [4.4 5.5]] at 7f02018afc18>
myarray[100:]
# returns <JaggedArray [] at 7f020c214438>
If selection
is a non-jagged list or array of booleans, elements corresponding to True
values in the mask are returned as a new jagged array. The mask must be 1-dimensional and the mask and jagged array must have the same length, or an error is raised. For example,
myarray = awkward.JaggedArray.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]])
mask = numpy.array([True, True, False])
myarray[mask]
# returns <JaggedArray [[1.1 2.2 3.3] []] at 7f020e8122b0>
If selection
is a jagged array of booleans, sub-elements corresponding to True
values in the jagged mask are returned as a new jagged array. If the jagged mask and the jagged array do not have the same jagged structure, an error is raised. For example,
myarray = awkward.JaggedArray.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]])
mask = awkward.JaggedArray.fromiter([[False, True, True], [], [True, False]])
myarray[mask]
# returns <JaggedArray [[2.2 3.3] [] [4.4]] at 7f02018af8d0>
If selection
is a non-jagged list or array of integers, elements identified by the integer indexes are gathered as a new jagged array (handling negative indexes, if applicable). For example,
myarray = awkward.JaggedArray.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]])
myarray[[2, 0, 1, -1]]
# returns <JaggedArray [[4.4 5.5] [1.1 2.2 3.3] [] [4.4 5.5]] at 7f020c214438>
If selection
is a jagged array of integers, sub-elements identified by the integer local indexes are gathered as a new jagged array (handling negative indexes, if applicable). If the length of the indexes is not equal to the length of the jagged array, an error is raised. For example,
myarray = awkward.JaggedArray.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]])
indexes = awkward.JaggedArray.fromiter([[2, 2, 0], [], [1]])
myarray[indexes]
# returns <JaggedArray [[3.3 3.3 1.1] [] [5.5]] at 7f02018afa58>
If selection
is a tuple, a multidimensional extract/slice/mask/gather operation (in any combination) is performed. Any errors encountered along the way are raised. For example,
myarray = awkward.JaggedArray.fromcounts([2, 0, 1], awkward.JaggedArray.fromiter(
[[1.1, 2.2, 3.3], [], [4.4, 5.5]]))
myarray
# returns <JaggedArray [[[1.1 2.2 3.3] []] [] [[4.4 5.5]]] at 7f02018afba8>
myarray[2, 0, 1]
# returns 5.5
myarray[myarray.counts > 0, 0, -2:]
# returns <JaggedArray [[2.2 3.3] [4.4 5.5]] at 7f020c214438>
If selection
is a string or a list or array of strings, the jagged column of the nested table or jagged subtable, respectively, for that column or those columns is returned. If there are no Table
instances nested within content
, this raises an error. For example,
myarray = awkward.JaggedArray.fromcounts([3, 0, 2], awkward.Table(
x=[1, 2, 3, 4, 5],
y=[1.1, 2.2, 3.3, 4.4, 5.5],
z=[True, False, True, False, False]))
myarray["x"]
# returns <JaggedArray [[1 2 3] [] [4 5]] at 7f020e8122b0>
myarray[["x", "y"]]
# returns <JaggedArray [[<Row 0> <Row 1> <Row 2>] [] [<Row 3> <Row 4>]] at 7f02018af860>
myarray[["x", "y"]].columns
# returns ['x', 'y']
A string or a list or array of strings is also the only acceptable argument to set-item. Columns may be added to a jagged table, provided that the jagged structure of the new columns matches that of the table.
If jagged arrays are passed into a Numpy ufunc (or equivalent mapped kernel), they are computed elementwise at the deepest level of jaggedness, adjusting for different starts/stops/content representations of the same logical structure, and broadcasting scalars and non-jagged values to the jagged structure. If not all jagged arrays have the same logical jagged structure or non-jagged arrays are not broadcastable to this structure (because they have different lengths), an error is raised.
For example,
a = awkward.JaggedArray.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]])
b = awkward.JaggedArray([0, 3, 4], [3, 3, 6], [10, 20, 30, -9999, 40, 50])
c = numpy.array([100, 200, 300])
d = 1000
defines a
as [[1.1, 2.2, 3.3], [], [4.4, 5.5]]
and b
as [[10, 20, 30], [], [40, 50]]
(-9999
is unreachable). These have the same logical strucutre, but a different physical structure.
a.starts, a.stops
# returns (array([0, 3, 3]), array([3, 3, 5]))
b.starts, b.stops
# returns (array([0, 3, 4]), array([3, 3, 6]))
Nevertheless, they can be combined in the same ufunc because they have the same logical structure, matching sub-element to sub-element before computing. Basic array c
is (conceptually) promoted to a jagged array before operating as an instance of jagged broadcasting, and d
is promoted as usual for scalar broadcasting.
numpy.add(a, b)
# returns <JaggedArray [[11.1 22.2 33.3] [] [44.4 55.5]] at 7f02018afc50>
numpy.add(a, c)
# returns <JaggedArray [[101.1 102.2 103.3] [] [304.4 305.5]] at 7f02018afba8>
numpy.add(a, d)
# returns <JaggedArray [[1001.1 1002.2 1003.3] [] [1004.4 1005.5]] at 7f02018afd30>
Unary and binary operators corresponding to mapped kernels should have the same behavior. Thus, the above could have been a + b
, a + c
, and a + d
.
JaggedArray
reducers differ from generic reducers in that they only reduce the innermost level of jaggedness: inner nested lists are replaced with scalars, but the total structure is still an array. Hence, a reduced singly-jagged array is a non-jagged array, and a reduced doulby-jagged array is a singly-jagged array. The reduced array has the same length as the unreduced jagged array.
-
any()
: returns an array ofBOOLTYPE
; each isTrue
if the corresponding nested list has any non-masked, non-zero values andFalse
if not or if the nested list has no non-masked values at all. -
all()
: returns an array ofBOOLTYPE
; each isTrue
if the corresponding nested list’s only non-masked values are non-zero, including the case in which the nested list has no non-masked values at all;False
otherwise. -
count()
: returns an array ofINDEXTYPE
, the number of non-masked values in each nested list. -
count_nonzero()
: returns an array ofINDEXTYPE
, the number of non-masked, non-zero values in each nested list. -
sum()
: returns an array with the same dtype as thecontent
(ifcontent
has a well-defineddtype
), the sum of non-masked values in each nested list. Lists with no non-masked values yield0
. -
prod()
: returns an array with the same dtype as thecontent
(ifcontent
has a well-defineddtype
), the product of non-masked values in each nested list. Lists with no non-masked values yield1
. -
min()
: returns an array with the same dtype as thecontent
(ifcontent
has a well-defineddtype
), the minimum of non-masked values in each nested list. Lists with no non-masked values yieldinf
for floating point types and the maximum integer value for integer types. -
max()
: returns an array with the same dtype as thecontent
(ifcontent
has a well-defineddtype
), the maximum of non-masked values in each nested list. Lists with no non-masked values yield-inf
for floating point types and the minimum integer value for integer types.
The jagged argmin()
and argmax()
methods are not reducers: they return jagged arrays of the local index that minimizes or maximizes the non-masked values in each nested list. If a nested list has no non-masked values, the corresponding nested list in the output is empty. If an output nested list is not empty, it has exactly one value. Data in this form is usable in gather operations.
JaggedArray
has the following structure manipulation methods:
-
cross(other)
: creates a jagged table with columns"0"
,"1"
,"2"
, etc. that is the cross-join of nested list inself
andother
.self
andother
must have the same length, and the resulting jagged table has the same length. This meethod can be chained:a.cross(b).cross(c)
. -
argcross(other)
: likecross(other)
, except that the values in the table are not elements ofcontent
but their local indexes, usable in gather operations. Unlikecross(other)
, chains ofargcross(other)
produce nested tables with only"0"
and"1"
columns. -
pairs()
andargpairs()
: likecross(self)
andargcross(self)
except that if the pair corresponding to local indexesi
andj
are included, the pair corresponding to local indexesj
andi
are not. -
distincts()
andargdistincts()
: likepairs()
andargpairs()
except that pairs corresponding to local indexesi
andi
are not included. -
JaggedArray.concatenate(arrays)
andinstance.concatenate(arrays)
: concatenates the jagged arrays, includinginstance
if called as an instance method. Thearrays
is must be a list of jagged arrays, likenumpy.concatenate
. -
JaggedArray.zip(columns)
andinstance.zip(columns)
: builds a jagged table from a set ofcolumns
(same constructor specification as theTable
class, defined below). Includesinstance
if called as an instance method.
A JaggedArray
may be created from one of the following alternate constructors.
-
iterable
: a list of lists of a primitive type, corresponding to a jagged array of some fixed depth:[0, n) -> [0, inf) -> T
,[0, n) -> [0, inf) -> [0, inf) -> T
, etc.
-
offsets
: basic array of integer dtype (default isINDEXTYPE
) with exactly one dimension, at least one element, and all non-negative values. -
content
: any array (default is a basic array ofDEFAULTTYPE
).
-
offsets
: basic array of integer dtype (default isINDEXTYPE
) with at least one dimension and all non-negative values. -
content
: any array (default is a basic array ofDEFAULTTYPE
).
-
uniques
: basic array of integer dtype (default isINDEXTYPE
) with exactly one dimension and the same length ascontent
. -
content
: any array (default is a basic array ofDEFAULTTYPE
).
-
parents
: basic array of integer dtype (default isINDEXTYPE
) with exactly one dimension and the same length ascontent
. -
content
: any array (default is a basic array ofDEFAULTTYPE
). -
length
: if notNone
, a non-negative integer setting the length of the resulting jagged array; useful for adding empty lists at the end or truncating.
-
index
: basic array or jagged array of integer dtype (default isINDEXTYPE
). If a jagged array, only a flattened version of the jagged array is considered. The basic or flattenedindex
must have exactly one dimension and the same length ascontent
. -
content
: any array (default is a basic array ofDEFAULTTYPE
). -
validate
: ifTrue
, raise an error if non-zero values are not exactly one greater than the previous and raise an error ifindex
is jagged and the jagged structure ofindex
differs from the jagged structure derived from its values.
-
jagged
: jagged array to convert to the given class (without copying data, if possible).
-
regular
: basic array (default hasDEFAULTTYPE
) with more than one dimension. The array’s regular shape is replaced with the corresponding jagged structure.
The awkward.array.jagged
submodule may define helper functions, such as the following.
-
offsetsaliased(starts, stops)
: returnsTrue
if the starts and stops arrays overlap in memory and are consistent with a single offsets array atstarts.base
(or equivalently,stops.base
);False
otherwise. -
counts2offsets(counts)
: convert a counts array to an offsets array. -
offsets2parents(offsets)
: convert an offsets array to a parents array. -
startsstops2parents(starts, stops)
: convert a general starts/stops pair to a parents array. -
parents2startsstops(parents, length=None)
: convert a parents array to a starts/stops pair, optionally with a givenlength
. Thislength
may cause empty nested lists to be added at the end of thestarts
andstops
representing a jagged structure or it may truncate the jagged structure, depending on whether it is greater or less thanparents.max()
. -
uniques2offsetsparents(uniques)
: convert a uniques array to a 2-tuple of offsets and parents. -
aligned(*jaggedarrays)
: returnTrue
if alljaggedarrays
have the same jagged structure;False
otherwise.
Product types, or arrays of records with a fixed set of named, typed fields can be conceptually represented as tables. The “row-wise” vs. columnar representations discussed in the Jaggedness section were first developed in the context of tables. The “row” and “table” terminology came from a discussion of tables: named, typed attributes are conventionally associated with columns of a data table, while anonymous data points fill the rows. A row-wise data representation can be replaced with a columnar representation by simply transposing it in memory, or at least writing each column of data to a separate, equal-length array. Columnar layouts have been used in tabular databases since TAXIR in 1969.
Numpy has a product type called a structured array or record array. This is a row-wise data representation, which would be hard to mix with columnar jagged arrays. Instead of using structured arrays from the base library directly, awkward-array defines a Table
type with the same syntax.
Like Numpy’s structured arrays, Table
columns are selected by strings in a get-item, these string get-items commute with extract/slice/mask/gather get-items, and they can’t be used in the same multidimensional tuple with extract/slice/mask/gather get-items. (Despite the tabular metaphors, columns are not a dimension in the sense of N-dimensional arrays; they’re a qualitatively different kind of accessor.) Unlike Numpy’s structured arrays, Table
columns have no constraints on where they reside in memory: they may be strides across a Numpy structured array, they may be fully columnar arrays in an Arrow buffer, or they may be Numpy arrays, scattered in memory.
The Table
interface hides the distinction between an array of structs and a struct of arrays, an important transformation for preparing data for vectorization. It is used to create objects whose attributes may be widely dispersed in memory, or (through a VirtualArray
) not all loaded into memory. (To avoid materializing a VirtualArray
, the string representation of Table.Row
does not show internal data.)
Regularly divided tables, such as
[0, n) -> [0, m) -> "one" -> bool
"two" -> int64
"three" -> float64
can be expressed by giving all columns the same dimensionality (shape[1:]
). This is because the above is equivalent to
[0, n) -> "one" -> [0, m) -> bool
"two" -> [0, m) -> int64
"three" -> [0, m) -> float64
which is a Table
whose column arrays all have shape (n, m)
.
A Table
is defined by an arbitrary number of named arrays, which are columns of the table. A Table
need not represent purely tabular data; if it is nested within a JaggedArray
, it is a jagged table, and if it contains any JaggedArray
, it is a stringy table. Columns may be generated from any basic array, awkward-array, or Python iterable, with DEFAULTTYPE
as the default type of empty iterables.
The Table
constructor permits the following argument patterns:
-
Table(column1, column2, ...)
: initialize with unnamed column arrays. Column names are strings of integers starting with zero ("0"
,"1"
,"2"
, etc.). -
Table({"column1": column1, "column2": column2, ...})
: initialize with a single dict (may be an ordered dict). Column names are keys of the dict. -
Table(column1=column1, column2=column2)
: initialize with keywords. Column names are the keywords.
Pattern 1 and pattern 2 are incompatible; the first argument is either a subclass of dict or not. More than one positional argument in pattern 2 is not allowed. Both of the first two patterns are compatible with pattern 3: they may be freely mixed, as long as column names are never repeated (impossible with pattern 1).
After construction, columns can be added, overwritten, and removed using Table’s
set-item and del-item methods. The fact that Tables
may be nested is the only reason awkward-arrays have set-item and del-item methods: to pass a new column to a nested Table
or request that one of its columns be deleted. Columns maintain their order (following Python’s ordered dict semantics).
Table
has no whole-array validity conditions. The columns might have different lengths, but the total length of the Table
is given by the minimum length of all contained columns (zero if there are no columns).
A Table
applies slices, masks, and gather indexes lazily: rather than immediately applying these selections, they are stored as an internal view and applied when a single column is selected. Thus, if any columns are VirtualArrays
, they won’t be materialized unless that particular column is requested. Internal views must therefore be composed.
Table
has the following read-write properties:
-
rowname
: defaults to"Row"
, but may be any string. Can also be set by theTable.named
alternate constructor. <<`Table.named(rowname, ...)`,See below>> for an explanation. -
contents
: the columns as an ordered dict. (This is an assignable view, not a copy.)
Table
has the following read-only properties and methods:
-
base
: if thisTable
is a view,base
is the original table. If not,base
isNone
.
When a table myarray
is passed a selection
in square brackets, it obeys the following rules.
If selection
is a string, one column is pulled from the table. If the column lengths do not match, its length is truncated to the table length — the minimum of all column lengths. For example,
myarray = awkward.Table(x=[0.0, 1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8],
y=[100, 101, 102, 103, 104, 105, 106],
n=[0, 1, 2, 3, 4])
myarray
# returns <Table [<Row 0> <Row 1> <Row 2> <Row 3> <Row 4>] at 72afb63cba90>
myarray["x"]
# returns array([0. , 1.1, 2.2, 3.3, 4.4])
myarray["y"]
# returns array([100, 101, 102, 103, 104])
myarray["n"]
# returns array([0, 1, 2, 3, 4])
myarray[["x", "y"]]
# returns <Table [<Row 0> <Row 1> <Row 2> ... <Row 4> <Row 5> <Row 6>] at 7005965b6400>
myarray[["x", "y"]].columns
# returns ['x', 'y']
myarray[["x", "y"]].tolist()
# returns [{'x': 0.0, 'y': 100}, {'x': 1.1, 'y': 101}, {'x': 2.2, 'y': 102},
{'x': 3.3, 'y': 103}, {'x': 4.4, 'y': 104}, {'x': 5.5, 'y': 105},
{'x': 6.6, 'y': 106}]
If selection
is any integer, slice, list or array of booleans, or list or array of integers, the extraction/slicing/masking/gathering operation is applied to the rows, as though it were any other array. For example,
myarray = awkward.Table(x=[0.0, 1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8],
n=[0, 1, 2, 3, 4])
myarray
# returns <Table [<Row 0> <Row 1> <Row 2> <Row 3> <Row 4>] at 70e1687f9a58>
myarray[3]
# returns <Row 3>
>>> myarray[3:]
# returns <Table [<Row 3> <Row 4>] at 7e55fe51a278>
The subset of rows have persistent numbers (e.g. “Row 3” in the sliced output is the same object as “Row 3” in the base) because Table
views remember their internal viewing state.
Column-projection and extraction/slicing/masking/gathering is order-independent: get-item operations applied in either order return the same output (they commute). For example,
myarray["x"][-3:]
# returns array([2.2, 3.3, 4.4])
myarray[-3:]["x"]
# returns array([2.2, 3.3, 4.4])
This is because a single row of a table is represented by a Table.Row
, which has a get-item method for its place in a Table
. If a Table.Row
is iterated over, its length and iteration correspond to the fields named as consecutive integer strings, starting from zero: "0"
, "1"
, "2"
, etc.
Column-projection and extraction/slicing/masking/gathering cannot be performed in the same tuple, and column-projection of nested tables cannot be performed in the same tuple. Nor do column-projections of nested tables commute. Attempting to do so would raise an erorr. For example,
points = awkward.Table(x=[0.0, 1.1, 2.2, 3.3], y=[0, 100, 101, 102, 103])
myarray = awkward.Table(points=points, n=[0, 1, 2, 3])'
myarray["points"]["x"]
# returns array([0. , 1.1, 2.2, 3.3])
myarray["points"]["y"]
# returns array([ 0, 100, 101, 102])
myarray["n"]
# returnsarray([0, 1, 2, 3])
Tables inside of other awkward-array components may not be strictly rectangular. For example, a JaggedArray
of Table
is a jagged table:
myarray = awkward.JaggedArray.fromcounts([3, 0, 2], awkward.Table(
x=[0.0, 1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8],
n=[0, 1, 2, 3, 4]))
myarray
# returns <JaggedArray [[<Row 0> <Row 1> <Row 2>] [] [<Row 3> <Row 4>]] at 7e33f10569e8>
myarray["x"]
# returns <JaggedArray [[0. 1.1 2.2] [] [3.3 4.4]] at 7e33e188c438>
myarray["n"]
# returns <JaggedArray [[0 1 2] [] [3 4]] at 7e33e188c470>
Other awkward-array components inside of tables may not be strictly rectangular. For example, a Table
containing a JaggedArray
is a stringy table:
myarray = awkward.Table(
x=awkward.JaggedArray.fromcounts(
[4, 0, 2, 2, 1],
[0.0, 1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8]),
n=[0, 1, 2, 3, 4])
myarray
# returns <Table [<Row 0> <Row 1> <Row 2> <Row 3> <Row 4>] at 73ab6e406a20>
myarray["x"]
# returns <JaggedArray [[0. 1.1 2.2 3.3] [] [4.4 5.5] [6.6 7.7] [8.8]] at 73ab6a1a3e48>
myarray["n"]
# returns array([0, 1, 2, 3, 4])
TODO: multidimensional indexes through a Table
.
If tables are passed into a Numpy ufunc (or equivalent mapped kernel), the ufunc is applied separately to each column. If multiple tables are passed into the same ufunc with different sets of columns, an error is raised, and if they have different lengths, an error is raised. For example,
a = awkward.Table(x=[0.0, 1.1, 2.2, 3.3, 4.4], n=[0, 1, 2, 3, 4])
b = awkward.Table(x=[0, 100, 200, 300, 400], n=[0, 100, 200, 300, 400])'
numpy.add(a, b)
# returns <Table [<Row 0> <Row 1> <Row 2> <Row 3> <Row 4>] at 74ce37c32320>
numpy.add(a, b).tolist()
# returns [{'x': 0.0, 'n': 0}, {'x': 101.1, 'n': 101}, {'x': 202.2, 'n': 202},
{'x': 303.3, 'n': 303}, {'x': 404.4, 'n': 404}]
Unary and binary operators corresponding to mapped kernels should have the same behavior. Thus, the above could have been a + b
.
-
rowname
: a string to labelTable.Row
objects.
The row name is used for display purposes (so that “rows” have a more meaningful name in a science domain) and may be used by methods to distinguish types that are structurally identical. For instance, “positions” and “directions” in a 3-dimensional space may both contain columns named "x"
, "y"
, and "z"
, but they should be transformed differently when a coordinate system is rotated.
The existence of a label allows what would usually be a structural type system (tables are identified by the fields they contain) to be treated as a nominative type system (tables are identified by their type name).
-
view
:None
or 3-tuple ofstart
,step
,length
(integers) or base array of gather indexes -
base
: anotherTable
Constructs a view into an existing Table
, using a representation of views. None
means no view (the new Table
is identical to the base
). The 3-tuple represents a slice in a basis that is independent of table length and is easier to compose: start
is the starting element, same as a slice but strictly non-negative, step
is a step size, same as a slice (cannot be zero), and length
is the number of steps to take, rather than truncating by a stop
. Gather indexes are the same as indexes that would be passed to get-item. A boolean mask can be converted into gather indexes with numpy.nonzero
.
Sum types, or tagged unions, allow us to build heterogeneous arrays. As a data type, tagged unions are needed to express a collection that mixes data of incompatible types, but our use of tagged unions is broader: we may want to mix data that reside in different columnar arrays, regardless of whether they’re different types. This allows us to express the result of a blend (in the SIMD sense) without copying data. For example, SparseArray
needs to blend data from a sparse lookup table with zeros from a different source when it is sliced; it uses a UnionArray
to represent that result.
The general structure of a UnionArray
is a collection of arrays with a tags array to specify which is active in each element. If tags[i]
is 3
, then the array value at i
is drawn from array 3
. In Arrow terminology, the tags array is the “types buffer.”
If we always draw element i
from the array at tags[i]
, then all other arrays would have to be padded with unreachable elements at i
, what Arrow calls a “sparse union.” Instead, we add another array, an index to identify the elements to draw from the selected arrays; we use what Arrow calls a “dense union.” (Arrow calls this index the “offsets,” but it is more similar to the index of our IndexedArray
than the offsets of our JaggedArray
.)
Given a set of arrays contents
, a tags array tags
, and an index array index
, the element at i
is:
contents[tags[i]][index[i]]
It is possible to emulate an Arrow sparse union by setting the index to a simple numeric range (numpy.arange(len(tags))
). It is possible to generate an index for a union whose contents are in order and have no padding:
index = numpy.full(tags.shape, -1)
for tag, content in enumerate(contents):
mask = (tags == tag)
index[mask] = numpy.arange(numpy.count_nonzero(mask))
In circumstances where the index can be derived, it does not need to be stored.
Regularly divided unions, such as
[0, n) -> [0, m) -> (int64 |
complex128)
can be expressed by giving the tags and index arrays a multidimensional shape. The length of the tags must be less than or equal to the length of the index, but all dimension sizes after the first must be identical.
A UnionArray
is defined by two arrays and an ordered sequence of arrays. Below are their single-property validity conditions. Arrays may be generated from any Python iterable, with default types chosen in the case of empty iterables.
-
tags
: basic array of integer dtype (default isTAGTYPE
) with at least one dimension and all non-negative values. -
index
: basic array of integer dtype (default isINDEXTYPE
) with at least one dimension and all non-negative values. -
contents
(note plural): non-empty Python iterable of any arrays (default are basic arrays ofDEFAULTTYPE
).
The whole-array validity conditions are:
-
tags
length must be less than or equal toindex
length. -
tags
andindex
must have the same dimensionality (shape[1:]
). -
The maximum of
tags
must be less than the number of arrays incontents
. -
The maximum of
index
must be less than the minimum length ofcontents
arrays.
The tags
, index
and contents
properties are read-write; setting them invokes the same single-property validity check as the constructor. In addition, a UnionArray
has the following read-only properties:
-
issequential
: isTrue
if allcontents
are in order with no padding; in which case, theindex
is redundant and could be generated byUnionArray.fromtags
.
When a union array myarray
is passed a selection
in square brackets, it obeys the usual rules: an integer performs extraction, a slice performs slicing, a 1-dimensional list or array of booleans with the same length as myarray
performs masking, and a 1-dimensional list or array of integers performs a gather operation. Tuples perform these operations in multiple dimensions. String selections
are passed down to a nested Table
, if it exists.
For example,
myarray = awkward.UnionArray.fromtags([0, 1, 1, 0, 0, 1], [
numpy.array([1.1, 2.2, 3.3]),
awkward.JaggedArray.fromiter([[100, 200, 300], [], [400, 500]])])
myarray
# returns <UnionArray [1.1 [100 200 300] [] 2.2 3.3 [400 500]] at 7f5e1aceb7b8>
myarray[1:5]
# returns <UnionArray [[100 200 300] [] 2.2 3.3] at 7f5e1acf0f98>
myarray[1, 2]
# returns 300
Some of these selections
may not be valid for all contents
. Whether their application raises an error depends on which contents
are touched by the selection
. That is, a user can avoid an indexing error by applying an appropriate mask to avoid selecting rows or columns from nested content where those rows or columns do not exist. For example,
myarray = awkward.UnionArray.fromtags([0, 1, 0, 0, 1], [
numpy.array([1.1, 2.2, 3.3]),
awkward.JaggedArray.fromiter([[100, 200, 300], [400, 500]])])
myarray
# returns <UnionArray [1.1 [100 200 300] 2.2 3.3 [400 500]] at 7f5e1aceb630>
myarray[myarray.tags == 1, :2]
# returns <JaggedArray [[100 200] [400 500]] at 7f5e1aceb7b8>
A second dimensional index would be wrong for contents[0]
, a basic 1-dimensional array of floating point numbers. By masking with myarray.tags == 1
, we ensure that this index is not applied where it shouldn’t be.
If union arrays are passed into a Numpy ufunc (or equivalent mapped kernel), they are computed separately for each of the contents
(if possible) and those results are combined into a new union array as output. They do not need to have the same set of tags, but they need to have the same lengths.
For example,
a = awkward.UnionArray.fromtags([0, 1, 1, 0, 0, 1], [
numpy.array([1.1, 2.2, 3.3]),
awkward.JaggedArray.fromiter([[100, 200, 300], [], [400, 500]])])
a
# returns <UnionArray [1.1 [100 200 300] [] 2.2 3.3 [400 500]] at 7f5e1aceb710>
numpy.add(a, 10)
# returns <UnionArray [11.1 [110 210 310] [] 12.2 13.3 [410 510]] at 7f5e1aceb6d8>
Unary and binary operators corresponding to mapped kernels should have the same behavior. Thus, the above could have been a + 10
.
In type theory, option types may be considered a special case of sum types: ?T
is the sum of T
with a unit type; a unit type has only one possible value, null. As described above, we do not wish to introduce an array type whose only information content is the shape of the array.
Additionally, we implement option types in a different way from unions: as boolean masks. With the exception of IndexedMaskedArray
, Each missing value in a masked array has only one bit of information, the fact that it is missing. A single boolean mask array suffices. An awkward-array library has three masked array types:
-
MaskedArray
(superclass): the mask array has one boolean per byte. -
BitMaskedArray
: the mask array has one boolean per bit, with padding to fill a whole number of bytes. -
IndexedMaskedArray
: the mask array functions both as a mask, with a negative value like-1
indicating that an element is missing, and as an index, so that the content does not need to have unreachable elements. This can be important if content values are large, such as a wideTable
.
Numpy has a numpy.ma.MaskedArray
type that uses one boolean per byte to indicate missing values. Arrow defines all types as potentially masked with one boolean per bit to indicate missing values. Neither have an equivalent for IndexedMaskedArray
.
With MaskedArray
and BitMaskedArray
, there is a two-fold ambiguity: should True
mean that a value is missing or that a value is present? Both classes have a maskedwhen
argument indicating which boolean value is a masked value (default is True
, values of True
in the mask array mean data are missing). Numpy’s numpy.ma.MaskedArray
has maskedwhen = True
, and Arrow’s bitmasks have maskedwhen = False
.
With BitMaskedArray
, there is another two-fold ambiguity: should bits read from most significant to least significant or least significant to most significant in each byte? This is a bit-level equivalent of the endianness ambiguity, but it is not decided by hardware because most CPU instruction sets don’t operate on individual bits. BitMaskedArray
has an lsborder
that is True
for Least Significant Bit (LSB) ordering and False
for Most Significant Bit (MSB) ordering. Arrow’s bitmasks have lsborder = True
.
IndexedMaskedArray
has an integer-typed mask array, so it has no maskedwhen
. Any negative value corresponds to being masked.
Regularly divided optional types, such as
[0, n) -> [0, m) -> ?T
can be expressed by giving the mask arrays a multidimensional shape. This is not possible for BitMaskedArray
, since bits cannot be shaped, nor can an exact length be prescribed, since bits must pack into bytes and therefore pad up to seven values. Therefore, BitMaskedArray
additionally has a maskshape
to define the sizes of all dimensions, including the first (length).
The value returned for missing data is MaskedArray.mask
, which is by default None
. BitMaskedArray
and IndexedMaskedArray
inherit from MaskedArray
, so setting MaskedArray.mask
changes the return value for missing data globally.
A MaskedArray
is defined by two arrays and a boolean maskedwhen
. Below are their single-property validity conditions. The arrays may be generated from any Python iterable, with default types chosen in the case of empty iterables.
-
mask
: basic array of boolean dtype (default isMASKTYPE
) with at least one dimension. -
content
: any array (default is a basic array ofDEFAULTTYPE
). -
maskedwhen
: boolean; elementi
is considered missing ifmask[i] == maskedwhen
(default isTrue
).
The whole-array validity conditions are:
-
flattened
mask
length must be less than or equal to thecontent
length.
The length of the MaskedArray
is determined by the length of the mask
array.
Masked arrays (all types) have the following read-only properties:
-
masked
: boolean per byte array with the length of the array;True
where values are masked,False
where they are not (independent ofmaskedwhen
). -
unmasked
: negation ofmasked
.
When a masked array (any type) myarray
is passed a selection
in square brackets, it obeys the usual rules: an integer performs extraction, a slice performs slicing, a 1-dimensional list or array of booleans with the same length as myarray
performs masking, and a 1-dimensional list or array of integers performs a gather operation. Tuples perform these operations in multiple dimensions. String selections
are passed down to a nested Table
, if it exists.
For example,
myarray = awkward.MaskedArray([False, True, True, False], awkward.JaggedArray.fromiter([[1.1, 2.2, 3.3], [], [999], [4.4, 5.5]])) myarray # returns <MaskedArray [[1.1 2.2 3.3] None None [4.4 5.5]] at 7f5e1aceb7b8> myarray[0] # returns array([1.1, 2.2, 3.3]) myarray[1] # returns None myarray[myarray.isunmasked, 1:] # returns <MaskedArray [[2.2 3.3] [5.5]] at 7f5e1acf0f60>
If masked arrays (any type) are passed into a Numpy ufunc (or equivalent mapped kernel), values that are not masked in all inputs (including any non-masked arrays) are converted into IndexedMaskedArrays
without padding before applying the ufunc. Unnecessary values do not enter the calculation.
For example,
a = awkward.MaskedArray([False, False, True, False, True], [1.1, 2.2, 3.3, 4.4, 5.5])
b = awkward.MaskedArray([False, True, True, False, False], [100, 200, 300, 400, 500])
a
# returns <MaskedArray [1.1 2.2 None 4.4 None] at 7f5e1aceb6d8>
b
# returns <MaskedArray [100 None None 400 500] at 7f5e1aceb710>
numpy.add(a, b)
# returns <IndexedMaskedArray [101.1 None None 404.4 None] at 7f5e1acf0f98>
numpy.add(a, b).content
# returns array([101.1, 404.4])
Unary and binary operators corresponding to mapped kernels should have the same behavior. Thus, the above could have been a + b
.
MaskedArray
and its subclasses (BitMaskedArray
and IndexedMaskedArray
) have the following methods:
-
boolmask(maskedwhen=None)
: return themask
as boolean bytes. Ifmaskedwhen
isNone
, use the instance’smaskedwhen
. Otherwise, override it. (IndexedMaskedArray.boolmask
has a defaultmaskedwhen
ofTrue
.) -
indexed()
: convert to anIndexedMaskedArray
.
A BitMaskedArray
is defined by two arrays, a boolean maskedwhen
, a boolean lsborder
, and a shape parameter maskshape
. Below are their single-property validity conditions. The arrays may be generated from any Python iterable, with default types chosen in the case of empty iterables.
-
mask
: basic array with exactly one dimension; will be viewed asBITMASKTYPE
. -
content
: any array (default is a basic array ofDEFAULTTYPE
). -
maskedwhen
: boolean; same meaning as inMaskedArray
. -
lsborder
: boolean; ifTrue
, bits inmask
are interpreted in LSB (least significant bit) order; ifFalse
, bits inmask
are interpreted in MSB (most significant bit) order. -
maskshape
:None
, a non-negative integer, or a tuple of positive integers (first may be zero); the sizes of the logical mask dimensions. If an integer,maskshape
will be converted to(maskshape,)
. IfNone
(the default), themaskshape
will be assumed to be(len(content),)
. A value ofNone
is persistent, so an unspecifiedmaskshape
scales with changes incontent
.
The whole-array validity conditions are:
-
The length of the
BitMaskedArray
must be less than or equal to thecontent
length. -
The length of the
mask
must be greater than or equal to8
times the length of theBitMaskArray
.
The length of the BitMaskedArray
depends on maskshape
: if None
, the length is the content
length. Otherwise, the length is maskshape[0]
.
In addition to methods defined in MaskedArray
, a BitMaskedArray
has the following static methods:
-
BitMaskedArray.bit2bool(bitmask, lsborder=False)
: converts one boolean per bit into one boolean per byte with a specifiedlsborder
. -
BitMaskedArray.bool2bit(boolmask, lsborder=False)
: converts one boolean per byte into one boolean per bit with a specifiedlsborder
.
A BitMaskedArray
may be created from one of the following alternate constructors.
-
mask
: one boolean per byte array; converted to one boolean per bit withBitMaskedArray.bool2bit(mask, lsborder=lsborder)
. -
content
: same as primary constructor. -
maskedwhen
: same as primary constructor. -
lsborder
: same as primary constructor. -
maskshape
: same as primary constructor.
An IndexedMaskedArray
is defined by two arrays. Below are their single-property validity conditions. The arrays may be generated from any Python iterable, with default types chosen in the case of empty iterables.
-
mask
: a basic array of integer dtype (default isINDEXTYPE
) with at least one dimension. -
content
: any array (default is a basic array ofDEFAULTTYPE
).
The whole-array validity conditions are:
-
maximum of
mask
(if non-negative) must be less than thecontent
length.
The length of the IndexedMaskedArray
is the length of the mask
.
Most programming environments have a concept of a “pointer” or “reference” that allows one object to be logically nested within another without being nested in the memory layout. The referenced object may be anywhere in memory and might not conform to the structure required of its type (depending on how strictly the language maintains type-safety). Completely general pointers cannot be emulated with arrays unless the entirety of a program’s memory were put into a single array. However, a limited form of indirection can be implemented through arrays of indexes.
As described in the types section, awkward-array allows the same data to appear in multiple parts of the data structure or even to contain themselves. In Python, awkward-arrays are Python instances whose members can be reassigned after construction, so nothing prevents an array from appearing in multiple parts of a structure or from containing itself.
To facilitate this kind of indirection, the IndexedArray
class represents a delayed gather operation: it contains an array of indexes and a content array: extraction, slicing, masking, and gathering are filtered through the indexes before selecting contents. Its content could be itself, allowing the creation of graphs, though a JaggedArray
or UnionArray
in between would be needed to keep the graph finite.
IndexedArray
acts as a bound for bounded pointers: part of a data structure with IndexedArray
type can point to any element of the IndexedArray’s
content. To bind pointers to more than one pool, combine them with UnionArray
.
In a sense, a SparseArray
is the opposite of an IndexedArray
. A SparseArray
contains logical indexes where the contents are not zero (or some other default) and content for each of those indexes, known as coordinate format (COO). Whereas logical element i
of an IndexedArray
is at content index index[i]
, content element j
of a SparseArray
is at logical index index[j]
. An IndexedArray
applies its index array as a function to obtain elements, a SparseArray
inverts its index array as a function to obtain elements.
Since SparseArray
must invert its index with every extraction, the index should be monatonically increasing (sorted). If a set of (index, content) pairs are known, they could be loaded into a SparseArray
like this:
index, content # coordinates as two equal-length arrays
order = numpy.argsort(index)
awkward.SparseArray(length, index[order], content[order])
IndexedArray
and SparseArray
both have the data type of their content — they are invisible at the type level, providing low-level features.
An IndexedArray
is defined by two arrays. Below are their single-property validity conditions. The arrays may be generated from any Python iterable, with default types chosen in the case of empty iterables.
-
index
: basic array of integer dtype (default isINDEXTYPE
) with at least one dimensions and all non-negative values. -
content
: any array (default is a basic array ofDEFAULTTYPE
). -
dictencoding
: boolean (default isFalse
). IfTrue
, equality tests (==
and!=
ornumpy.equal
andnumpy.not_equal
) do not propagate through to the content, but apply at theIndexedArray
level and check for equality of the indexes. This makesIndexedArray
usable as a dictionary encoding for categorical data.
The whole-array validity conditions are:
-
The maximum of
index
must be less than the length ofcontent
.
The length of an IndexedArray
is the length of the index
array.
When an indexed array myarray
is passed a selection
in square brackets, it obeys the usual rules: an integer performs extraction, a slice performs slicing, a 1-dimensional list or array of booleans with the same length as myarray
performs masking, and a 1-dimensional list or array of integers performs a gather operation. Tuples perform these operations in multiple dimensions. String selections
are passed down to a nested Table
, if it exists.
For example,
myarray = awkward.IndexedArray([2, 2, 1, 4], [0.0, 1.1, 2.2, 3.3, 4.4, 5.5])
myarray
# returns <IndexedArray [2.2 2.2 1.1 4.4] at 772e306077f0>
myarray[2]
# returns 1.1
myarray[2:]
# returns array([1.1, 4.4])
Here is another example, this one using a cyclic reference to build arbitrary depth trees.
myarray = awkward.IndexedArray([0],
awkward.UnionArray.fromtags([1, 0, 1, 0, 1, 0, 0, 1], [
numpy.array([1.1, 2.2, 3.3, 4.4]),
awkward.JaggedArray([1, 3, 5, 8], [3, 5, 8, 8], [])])) # the [] will be replaced
myarray.content.contents[1].content = myarray.content
myarray
# returns <IndexedArray [[1.1 [2.2 [3.3 4.4 []]]]] at 746bf6c422b0>
myarray[0, 1]
# returns <UnionArray [2.2 [3.3 4.4 []]] at 746bf6c422e8>
myarray[0, 1, 1]
# returns <UnionArray [3.3 4.4 []] at 746bf6c42390>
myarray[0, 1, 1, 2]
# returns array([], dtype=float64)
The depth of this tree is not a function of the depth of the IndexedArray
of UnionArray
of basic and JaggedArray
that built it. The depth of this tree is a function of the values of the index
array, the tags
array, and the starts
/stops
arrays. This construction is a purely columnar tree of numbers and sub-trees.
If dictencoding
is True
, the equality tests (==
and !=
or numpy.equal
and numpy.not_equal
) do not propagate through to the content, but apply at the IndexedArray
level and check for equality of the indexes.
If indexed arrays are passed into a Numpy ufunc (or equivalent mapped kernel), the delayed gather is applied before computing the result. This even works in arbitrarily nested cases, like the last examples in the previous section.
numpy.sum(myarray, 10)
# returns <JaggedArray [[11.1 [12.2 [13.3 14.4 []]]]] at 746bf6c42400>
Unary and binary operators corresponding to mapped kernels should have the same behavior. Thus the above could have been myarray + 10
.
A SparseArray
is defined by a shape, two arrays, and a default element. Below are their single-property validity conditions. The arrays may be generated from any Python iterable, with default types chosen in the case of empty iterables.
-
indexshape
: non-negative integer or a tuple of positive integers (first may be zero); the sizes of the logical dimensions. If an integer,indexshape
will be converted to(indexshape,)
. -
index
: basic array of integer dtype (default isINDEXTYPE
) with exactly one dimension and all non-negative values. This array must be monatonically increasing (sorted). -
content
: any array (default is a basic array ofDEFAULTTYPE
). -
default
:None
or any value. IfNone
, an appropriate zero will be generated:-
content.dtype.type(0)
ifcontent
is a 1-dimensional basic array; -
numpy.zeros(content.shape[1:], content.dtype)
ifcontent
is a multidimensional basic array; -
empty jagged array if
content
is a jagged array; -
the masked value if
content
is a masked array; -
None
ifcontent
is an object array; -
an empty string if
content
is a string array; -
the first basic array zero if
content
is a union array; the first other type if the union has no basic arrays; -
a
Table.Row
of defaults ifcontent
is a table; -
a decision based on the content of any other type.
-
The whole-array validity conditions are:
-
flattened
index
length must be less than or equal to thecontent
length.
The length of the SparseArray
is determined purely by the indexshape
.
When a sparse array myarray
is passed a selection
in square brackets, it obeys the usual rules: an integer performs extraction, a slice performs slicing, a 1-dimensional list or array of booleans with the same length as myarray
performs masking, and a 1-dimensional list or array of integers performs a gather operation. Tuples perform these operations in multiple dimensions. String selections
are passed down to a nested Table
, if it exists.
For example,
myarray = awkward.SparseArray(1000, [101, 102, 105, 800], [1.1, 2.2, 3.3, 4.4])
myarray
# returns <SparseArray [0.0 0.0 0.0 ... 0.0 0.0 0.0] at 7131e4b9a438>
myarray[100:106]
# returns <SparseArray [0.0 1.1 2.2 0.0 0.0 3.3] at 7131e4b9a518>
myarray[798:803]
# returns <SparseArray [0.0 0.0 4.4 0.0 0.0] at 7131e4b9a550>
If sparse arrays are passed into a Numpy ufunc (or equivalent mapped kernel), the ufunc is computed for all non-default values and separately for the default value, blending the results as a UnionArray
.
For example (reusing myarray
from the previous section),
numpy.add(myarray, 10)[100:106]
# returns <UnionArray [10.0 11.1 12.2, 10.0 10.0 13.3] at 746bf6c41800>
Unary and binary operators corresponding to mapped kernels should have the same behavior. Thus the above could have been (myarray + 10)[100:106]
.
The awkward.array.indexed
submodule may define helper functions, such as the following.
-
invert(permutation)
: returnsinverse
such thatinverse[permutation] == numpy.arange(len(permutation))
is the identity. (Ifpermutation
contains all values from0
tolen(permutation) - 1
, it is also the case thatpermutation[inverse] == numpy.arange(len(permutation))
.) If not all values inpermutation
are distinct, this function raises an error.
The array types defined above are sufficient to create rich data types — most of the types expected in a general programming environment. With columnar layouts in memory, they take a minimum of space and regular operations can be applied on them very quickly. However, all of these are awkward-array types: only Numpy ufuncs and Python get-item know how to operate on them. Situations will arise in which types must satisfy third-party constraints.
Data structures built by combining awkward-arrays are constructive (built by construction), instances of other types are opaque (not known to the awkward-array library). To emulate an array of opaque objects, we wrap it in an ObjectArray
that applies a function to an element i
to generate the object at i
. The object must be a pure function of the data at element i
and not maintain long-lived state.
Get-item selections and mapped kernels perform vectorized operations across all or much of the array, and if the object type has methods, users may want to apply the methods as vectorized operations as well. Instantiating all elements in the array and invoking the method on all of them misses the point (one might as well use a Python list or a Numpy object array), so there is an alternate way to apply them: as vectorized operations on the data used to generate the objects.
Here is a motivating example: a Table
of floating point "x"
and "y"
columns is wrapped in an ObjectArray
with a Point
constructor to effectively make an array of user-defined Point
objects. Point
instances have an angle
method the computes math.atan2(self.y, self.x)
. Users want to compute the angle
of all values in the array without constructing Point
for each. We therefore add a method angle
to ObjectArray
that computes numpy.arctan2(self["x"], self["y"])
.
These methods are added with a mix-in facility that accepts any class containing pure-function methods (no persistent state) and has no init
method. This is where different languages will put the most constraint on what can be done. Mix-ins are equivalent to Java’s Interfaces, but in a statically compiled language, methods can’t be added at runtime. In Java in particular, classes can be created from mix-ins in a nested ClassContext
, but methods from these runtime types can’t be used in the main ClassContext
code because it has already been type-checked. Code that uses the new methods must be compiled after the mix-ins, which means that it must be compiled on the fly. In C++, a just-in-time compiler like Cling would be needed.
A library may be called compliant with awkward-array if it lacks the ability to add mix-in methods.
An important use of ObjectArray
and mix-in methods is StringArray
, which implements an array of strings as a JaggedArray
of CHARTYPE
, generating str
or bytes
objects upon extraction. It is important (for users) that the objects drawn from this array have the native string type of whichever language they’re using. It’s also important to have some vectorized methods, like dropping the last character of all strings (which can actually be a shift to the JaggedArray’s
stops array). StringArray
has its mix-in methods built-in, so it does not suffer the dynamic vs. static issue described above.
Although Numpy can store strings in arrays, its rectangular model requires strings to be padded to the length of the longest string in an array. StringArray
takes advantage of JaggedArray’s
efficient encoding of variable-length contents to store variable-length strings.
For a class to be eligible as a mix-in, it must not have an init
method and must not modify self
in any of its methods. Mix-ins can be added to a class by inheritance or to an instance (in Python) by changing an object’s class
attribute. Convenience functions are provided in Methods
, which is a container of static methods:
-
mixin(methods, awkwardtype)
: given amethods
class (the mix-ins) and anawkwardtype
(the awkward-array class object, likeJaggedArray
orObjectArray
), this returns an array class object with the methods added. This class object can be constructed like the corresponding awkward-array, or it may be assigned to an existing instance’sclass
attribute. -
maybemixin(samples, awkwardtype)
: given asamples
object (an array that might have mix-ins) or list (arrays that might have mix-ins) and anawkwardtype
(the awkward-array class object, likeJaggedArray
orObjectArray
), this returns an array class object with any mix-ins any of thesamples
might have (union of all mix-in methods, in Python subclassing order). It is used to transfer mix-in methods from one array to another.
Mix-in methods are automatically transferred in the following situations:
-
When processing a Numpy ufunc (or equivalent mapped kernel), which includes unary and binary operations like
+
and-
, all mix-in methods of the arguments are transferred to the output. -
When selecting a column from a
Table
, including selections through a nested contents (e.g.jaggedtable["x"]
), the mix-in methods of the table column apply to the output, but the mix-in methods of the original container (e.g.jaggedtable
) do not apply. -
When slicing, masking, or gathering through an array’s get-item (but not extracting!), the array’s mix-ins are retained in the output.
In all other operations, such as reductions and other methods, mix-ins are not carried through.
An ObjectArray
is defined by an array and a generator function with arguments. Below are their single-property validity conditions. The array may be generated from any Python iterable, with the default type chosen in the case of an empty iterable.
-
content
: any array (default is a basic array ofDEFAULTTYPE
). -
generator
: function that produces objecti
fromcontent[i]
. -
args
: a tuple of constant positional arguments to pass togenerator
. If not a tuple, it will be converted to(args,)
. -
kwargs
: a dict of constant keyword arguments to pass togenerator
. If not a dict, an error will be raised. The given dict is shallowly copied to avoid referencing issues. -
dims
: a positive integer (default is1
); the number of dimensions in theObjectArray
.
The whole-array validity conditions are:
-
dims
must be less than or equal tolen(content.shape)
.
The length of the ObjectArray
is the length of content
, and the shape of the ObjectArray
is content.shape[:dims]
.
When an object array myarray
is passed a selection
in square brackets, it obeys the usual rules for all operations except extraction: a slice performs slicing, a 1-dimensional list or array of booleans with the same length as myarray
performs masking, and a 1-dimensional list or array of integers performs a gather operation. An integer, however, extracts from content
and calls
generator(content[i], *args, **kwargs)
on the result to return an output. If dims > 1
, the first dims - 1
elements of a given tuple are passed through content
(so that an ObjectArray
may be multidimensional) and then element dims - 1
of the tuple is run through the generator
function. Any remaining elements of a given tuple are applied to the output of that generator
.
For example,
class Point(object):
def __init__(self, row):
self.x, self.y = row["x"], row["y"]
def __repr__(self):
return "<Point {0} {1}>".format(self.x, self.y)
myarray = awkward.ObjectArray(awkward.Table(x=[1.1, 2.2, 3.3], y=[10, 20, 30]), Point)
myarray
# returns <ObjectArray [<Point 1.1 10> <Point 2.2 20> <Point 3.3 30>] at 7779705f4860>
myarray[1:]
# returns <ObjectArray [<Point 2.2 20> <Point 3.3 30>] at 7779705f49b0>
myarray[1]
# returns <Point 2.2 20>
myarray[1].y
# returns 20
If object arrays are passed into a Numpy ufunc (or equivalent mapped kernel), the ufunc is computed on the contents and the output is re-wrapped as object arrays. This might not be the intended semantics for the objects; if so, overload them with mix-in methods. (The mix-in should define array_ufunc
, described in the Numpy docs and as a NEP.)
Using the class from the previous example,
a = awkward.ObjectArray(awkward.Table(x=[1.1, 2.2, 3.3], y=[10, 20, 30]), Point)
b = awkward.ObjectArray(awkward.Table(x=[10, 20, 30], y=[100, 100, 100]), Point)
numpy.add(a, b)
# returns <ObjectArray [<Point 11.1 110> <Point 22.2 120> <Point 33.3 130>] at 7aea8ce5a358>
Unary and binary operators corresponding to mapped kernels should have the same behavior. Thus the above could have been a + b
.
A StringArray
is an ObjectArray
with awkward.array.objects.StringMethods
mix-ins. Its content
is an internal JaggedArray
and it accepts JaggedArray
constructors. Its primary constructor parameters are:
-
starts
: same asJaggedArray.starts
except that it will apply to byte positions in `content. -
stops
: same asJaggedArray.stops
except that it will apply to byte positions in -
content
: same asJaggedArray.content
except that it will be viewed asCHARTYPE
. -
encoding
:None
(forbytes
) or an encoding name (forstr
). Default is"utf-8"
. This property must be assigned withNone
or an encoding name but its value isNone
or a decoder function fromcodecs.getdecoder
. (If the encoding name is not recognized, an error is raised.)
A StringArray
has the same whole-array validity conditions as JaggedArray
.
The length and shape of StringArray
are the length and shape of starts
.
StringArray
has the same alternate constructors as JaggedArray
: fromiter
, fromoffsets
, fromcounts
, fromparents
, fromuniques
, and fromjagged
, except that the content is always required to be or interpreted as CHARTYPE
. StringArray
additionally has the following constructors:
-
StringArray.fromstr(shape, string)
: duplicates a singlestr
orbytes
object to fill an array with a givenshape
(may be a non-negative integer). -
StringArray.fromnumpy(array)
: converts a Numpy string array into aStringArray
.
As an ObjectArray
with an implicit generator
of awkward.array.objects.tostring
an implicit args
of (encoding,)
, and an implicit dims
of len(starts.shape)
, a StringArray
returns a bytes
or str
for each item.
All Numpy ufuncs (or equivalent mapped functions) apply mathematical operations on the characters of the strings as though they were uint8
integers, except for equality tests (==
and !=
or numpy.equal
and numpy.not_equal
), which are overloaded in awkward.array.objects.StringMethods
to compute string equality.
Many array sources are non-contiguous, usually so that they can be read in releatively small, memory-friendly chunks (e.g. ROOT baskets or Parquet pages). However, a basic array library like Numpy expects its arrays to be fully contiguous in memory, and that can usually only be achieved by copying data.
However, just as we wrap arrays in classes to give them new logical structure, we can wrap a sequence of arrays as a ChunkedArray
to view it as though it were a concatenated version of those arrays. The arrays in the sequence all need to have the same high-level type, but they don’t all need to have the same low-level structure. Some may be basic arrays and others IndexedArrays
to correspond to pages that alternate between a simple encoding and a dictionary encoding. The high-level type of the ChunkedArray
is the same as the high-level type of its chunks.
To extract an element at index i
, it is necessary to know the length of all chunks up to and including the one in which index i
resides, but getting this information might be an expensive operation. Therefore, ChunkedArray
does not require this information up-front, but requests it and retains it as higher indexes are requested. Its string representations (str
and repr
in Python) only show the first few elements and not the last if not all of the counts are known.
A non-contiguous array interface makes it possible to efficiently append rows to an array. Instead of copying a whole array into a larger allocation with each append, we can allocate a chunk, fill it by writing to it and increasing its “end” pointer, then allocate a new chunk when it is full. Since we can address non-contiguous data as a single array, we never have to copy partial results to concatenate. AppendableArray
is an array with appendable rows, and is one of the only two mutable array types in awkward-arrays: AppendableArray
can add new rows and Table
can add, overwrite, and remove columns.
A ChunkedArray
is defined by a list of chunks
(arrays) and a list of counts
(non-negative integers). Below are their single-property validity conditions. The arrays in chunks
may be generated from any Python iterable, with default types chosen in the case of empty iterables.
-
chunks
: a Python list of any array (defaults are basic arrays ofDEFAULTTYPE
). -
counts
: a Python list of non-negative integers. Default is[]
.
The whole-array validity conditions are:
-
chunks
length must be greater than or equal tocounts
length. -
Each count (non-negative integer in
counts
) must be equal to the length of the corresponding chunk (item inchunks
). -
All non-empty
chunks
must have the same high-level type as the first non-empty chunk.
ChunkedArray
fills its counts
as they become known, strictly from first to last. As a public property, these are visible to the user. ChunkedArray
may also internally cache types as they become known (in any order), to avoid repeated queries.
A ChunkedArray
has the following read-only properties and methods:
-
countsknown
:True
ifcounts
has the same length aschunks
;False
otherwise. -
typesknown
:True
if all types are internally cached;False
otherwise. If aChunkedArray
does not cache types, this property may be omitted. -
knowcounts(until=None)
: request and cache the lengths ofchunks
up to and not includinguntil
, or up to the end ifuntil
isNone
. -
knowtype(at)
: request and cache the type of chunkat
. If aChunkedArray
does not cache types, this property may be omitted. -
global2chunkid(index, return_normalized=False)
: convert aChunkedArray
index to the chunk id in which it resides. (chunks[i]
is the chunk at idi
, etc.) Theindex
may be an integer or a 1-dimensional array of integers for a gather operation. Negative indexes are normalized to count from the end of theChunkedArray
. Ifreturn_normalized
isTrue
, the output is a 2-tuple: the chunk id and theindex
normalized to count from the end of theChunkedArray
. -
global2local(index)
: convert aChunkedArray
index to the corresponding chunk and its local index in the chunk. Theindex
may be an integer or a 1-dimensional array of integers for a gather operation. If so, then the chunk output is a Numpy object array of chunks. -
local2global(index, chunkid)
: convert a local chunk index and its chunk id to a globalChunkArray
index. Theindex
may be an integer or a 1-dimensional array of integers for a gather operation.
When a chunked array myarray
is passed a selection
in square brackets, it obeys the usual rules: an integer performs extraction, a slice performs slicing, a 1-dimensional list or array of booleans with the same length as myarray
performs masking, and a 1-dimensional list or array of integers performs a gather operation. Tuples perform these operations in multiple dimensions. String selections
are passed down to a nested Table
, if it exists.
Touching elements can affect which counts
are known and therefore the string representation of the array. For example,
myarray = awkward.ChunkedArray([[0, 1, 2], [], [3, 4], [5, 6, 7, 8], [9]])
myarray
# returns <ChunkedArray [0 1 2 3 4 5 6 ...] at 7f778daed7f0>
myarray[-1]
# returns 9
myarray
# returns <ChunkedArray [0 1 2 ... 7 8 9] at 7f778daed7f0>
If chunked arrays are passed into a Numpy ufunc (or equivalent mapped kernel), the ufunc is computed iteratively on chunk sizes determined by the first chunked array argument, and the return value is a ChunkedArray
with that structure.
For example (reusing myarray
from the previous section),
numpy.add(myarray, 0.1)
# returns <ChunkedArray [0.1 1.1 2.1 3.1 4.1 5.1 6.1 ...] at 7f778daeda20>
numpy.add(myarray, 0.1).chunks
# returns [array([0.1, 1.1, 2.1]), array([], dtype=float64), array([3.1, 4.1]),
# array([5.1, 6.1, 7.1, 8.1]), array([9.1])]
Unary and binary operators corresponding to mapped kernels should have the same behavior. Thus the above could have been myarray + 0.1
.
An AppendableArray
is a ChunkedArray
of primitive type that can be efficiently appended. Below are the single-property validity conditions. The arrays may be generated from any Python iterable, with dfault types chosen in the case of empty iterables.
-
chunkshape
: positive integer or a tuple of positive integers defining the allocated shape of each chunk. -
dtype
: Numpy dtype of the content. -
chunks
: a Python list of basic arrays (default typeDEFAULTTYPE
).
The counts
parameter is read-only and internally managed. In ChunkedArray
, the counts
must be exactly equal to the length of each chunk, but in AppendableArray
, the last count is less than or equal to the length of the last chunk because not all of the allocated chunk may be filled with valid data. Uninitialized data may be visible to the user through chunks[-1]
, but not through get-item and mapped kernels on the AppendableArray
itself.
The whole-array validity conditions are the same as for ChunkedArray
, except that counts
is not required to be equal to the length of each chunk.
AppendableArray
has the following special methods:
-
append(value)
: add one value at the end of the array. -
extend(values)
: add multiple values to the end of the array.
Often, datasets are too large to entirely load into memory or too large to load up-front. Many data-loading libraries offer the ability to load parts of a file or dataset as needed. However, the decisions about when to load data, how much to load, and what to cache are system-dependent, and we might instead want them to be encoded in the array structure itself, so awkward-array has a VirtualArray
class to represent an array that might or might not be in memory, but will be when asked.
Laziness and non-contiguousness are closely related. If a Table
is too big to load but its columns of interest are not, then we may want a Table
of VirtualArrays
, so that each entire column is loaded when touched. However, if a single column is too big to load, then delaying that operation with a VirtualArray
is not enough: we need a ChunkedArray
of VirtualArrays
to load chunks of rows at a time.
Laziness and caching are closely related. If all the data needed for a process is too large to hold in memory, then lazily loading each section and keeping it forever is not enough: we need the loaded data to be evicted when we’re done with it. If the VirtualArray
instance goes out of scope, then Python’s garbage collector does that automatically. If not, then the VirtualArray
must let its loaded data be managed by a cache with explicit eviction rules.
Most cache implementations in Python have a dict-like interface. If it is process-bound, then transient keys based on the Python id
of the VirtualArrays
. If it is not, then permanent identifiers must be assigned somehow.
If absolutely no caching is desired, then a Python MutableMapping with a do-nothing setitem
would act as an immediately forgetful cache (with transient keys).
A Dask delayed array is the equivalent of a ChunkedArray
of VirtualArrays
, for which all of the chunked array’s counts
are known.
A VirtualArray
is defined by a generating function, not any arrays. Below are the single-property validity conditions for all of its primary constructor arguments.
-
generator
: a callable that produces the array. It must accept arguments as given byargs
andkwargs
asgenerator(*args, **kwargs)
. -
args
(default()
): a tuple of arguments for thegenerator
. If not a tuple, it will be converted to(args,)
. -
kwargs
(default{}
): a dict of keyword arguments for thegenerator
. If not a dict, an error will be raised. The given dict is shallowly copied to avoid referencing issues. -
cache
(defaultNone
):None
for no cache or a dict-like object to use as a cache. -
persistentkey
(defaultNone
):None
to use transient keys in a cache or a string to use as a key in a persistent cache. -
type
(defaultNone
):None
or high-level type of the array to use before materializing it. IfNone
, any query that requires type knowledge, such as asking for the length of the array, would cause the array to be materialized. -
persistvirtual
(defaultTrue
): ifTrue
, persist this object as a virtual array, meaning that its data are not stored in the serialized form. If theVirtualArray
depends on the existence of a file at a given path, for instance, the serialized form can’t be deserialized on a system without that file at that path. IfFalse
, persist this object as a concrete array, so that everything needed to reconstruct the data is stored in the serialized form.
There are no whole-array validity conditions in the normal sense, but if the type
parameter is not None
and the materialized array has a different type, an error is raised at that time.
VirtualArray
has the following read-only properties and methods:
-
ismaterialized
:True
if the array has been loaded andFalse
if it has not. -
materialize()
: cause the array to be loaded.
If type
is None
, then attempts to get the VirtualArray
length, type, dtype, shape, etc. will cause the array to be materialized. In any case, an attempt to get-item or use the array in a Numpy ufunc (or equivalent mapped kernel) will cause the array to be materialized.
If cache
is None
, then the materialized array is internally cached in the VirtualArray
object itself. To delete the array, it would be necessary to delete the VirtualArray
.
If cache
is not None
and persistentkey
is None
, then the array is placed in the cache
and a VirtualArray.TransientKey
is used as the key. The transient key is guaranteed to be globally unique in the Python process as long as the VirtualArray
exists. If the VirtualArray
is deleted, its del
method attempts to delete its transient key from the cache
because its global uniqueness can no longer be guaranteed. However, this is fragile because the cache
might have been changed for another cache
, the del
method might not be called before another Python object uses the VirtualArray’s
Python id
, etc. Generally, transient keys should be used when the VirtualArray
objects are known to be long-lived. (If they are short-lived, setting cache
to None
and letting the Python garbage collector manage eviction would be a better policy.) If the cache
only accepts strings as keys, the VirtualArray.TransientKey
has a unique str
representation.
If cache
is not None
and persistentkey
is not None
, then persistentkey
will be used as the key for the cache. The burden of ensuring uniqueness is on the user, and the user will have to decide whether the key needs to be process-unique, machine-unique, or unique in some distributed sense.
VirtualArray
maintains an internal list of columns added, overwritten, or deleted to or from any internal Tables
. If the generated array is ever lost due to cache eviction and needs to be regenerated, these modifications will be replayed so that the apparent content maintains its state. Also, if persistvirtual
is True
and the generated array is not written to a serialized form, the modifications are written to the serialized form, and will be replayed when reconstructed from that serialized form.
An awkward-array library should provide a general facility for creating columnar arrays from row-wise data. This would often be the first step before processing. In a statically typed language, the array structure can be fully determined before filling, but in a dynamically typed language, the array types and nesting structure would have to be discovered while iterating over the data.
The interface has two entry points:
-
fromiter(iterable, **options)
: iterate over data and return an array representing all of the data in a columnar form. In a statically typed language, the type structure would have to be provided somehow, and this function would raise an error if the data do not conform to that type (if possible). -
fromiterchunks(iterable, chunksize, **options)
: iterate over data and yield arrays of lengthchunksize
or less for the last array. This function never yields an empty array, but may yield no arrays. If types are dynamically discovered, this function raises an error if a chunk’s type differs from the first chunk. (Thechunksize
must be chosen large enough to fully discover the type in the first chunk.) In a statically typed language, the same type considerations apply as above.
The following options
are recognized:
-
dictencoding
: boolean or a function returning boolean. IfTrue
, all arrays of bytes/strings are dictionary encoded as anIndexedArray
ofStringArray
. IfFalse
, all arrays of bytes/strings are simplyStringArray
. If a function, this function is called on the list of bytes/strings to determine if it should be dictionary encoded.
Types derived from the data resolve ambiguities by satisfying the following rules.
-
Boolean types are distinct from numbers, but mixed numeric types are resolved in favor of the most general number at a level of the hierarchy. (That is, a single floating point value will make all integers at the same nesting depth floating point.) Booleans and numbers at the same level of hierarchy would be represented by a
UnionArray
. -
In Python (which makes a distinction between raw bytes and strings with an encoding), bytes and strings are different types. Bytes and strings at the same level of hierarchy would be represented by a
UnionArray
. -
All lists are presumed to be variable-length (
JaggedArray
). -
A mapping type (Python dict) is represented as a
Table
, and mappings with different sets of field names are considered distinct. Mappings with the same field names but different field types are not distinct: they areTables
with someUnionArray
columns. Missing fields are different from fields withNone
values. Mappings must have string-typed keys: other types are not supported. -
An empty mapping,
{}
, is considered identical toNone
. -
Table
column names are sorted alphabetically. -
An object type may be represented as an
ObjectArray
or it may not be supported. The Numpy-only implementation generatesObjectArrays
from Python class instances and from namedtuples. -
Optional and sum types are in canonical form:
MaskedArrays
are not nested directly withinMaskedArrays
andUnionArrays
are not nested directly withinUnionArrays
. If a given level of hierarchy is both masked and heterogeneous, theMaskArray
is outside theUnionArray
and none of the union array’scontents
are masked. -
Tables
,ObjectArrays
, andUnionArrays
are masked withIndexedMaskedArray
, while all others are masked withMaskedArray
. -
If a type at some level of hierarchy cannot be determined (all lists at that level are empty), then the type is taken to be
DEFAULTTYPE
.
The purpose of awkward-array is to perform computations on complex data with an array programming idiom, not as a storage or interchange format to rival ROOT, Parquet, or Arrow. However, since the bulk data consist entirely of basic (Numpy) arrays, any storage format that can store a set of named arrays and the awkward metadata can store one or more awkward-arrays.
We define a protocol to store any awkward-array structure in any named array container so that users have a way to stash and share intermediate results and, hopefully, to communicate between awkward-array implementations. The definition consists of a JSON format with as little reference to Python specifics as possible, so that implementations of awkward-array in other languages would be able to share data. This sharing could be through pointer locations in a process, shared memory on a single operating system, files on disk, or a distributed network.
This persistence protocol does not define a method for storing named arrays. We assume that an existing mechanism, such as ROOT, Parquet, or Arrow, or ZIP, HDF5, named files, or an object store manage that. For complete generality, the “named arrays” do not need to keep array metadata (the way HDF5 files do), but can store only named binary blobs (the way ZIP files do). We only fill this container with array data, assign names to them, and additionally provide a JSON object that can be used to reverse the process.
The awkward-array library in Python uses this serialization mechanism to pickle array objects, save
them to and load
them from ZIP files, and to use HDF5 as a read-write database of awkward arrays. ZIP files of awkward-array objects are identified with file extension .awkd
.
The JSON object is a nested expression to be interpreted as a command, like a LISP S-expression. This provides the most flexibility for backward and forward compatibility. For security, deserialization takes a whitelist of allowed functions: maliciously constructed data cannot execute arbitrary functions, only the ones in the whitelist. If a new feature is added that requires the execution of previously unforeseen functions, users may need to manually expand the whitelist for forward compatibility (or they may completely open the whitelist if they trust the data source).
A single JSON object represents a single awkward-array. Multiple awkward-arrays may be saved in the same namespace if they each have a different prefix. In general, it would be the user’s responsibility to ensure that prefixes don’t overlap, but the Numpy-only implementation of awkward-array checks for name collisions when writing to ZIP files and separates namespaces with Groups in HDF5 files. Prefixes can also be filesystem paths up to a directory delimiter — arrays would not overlap if they are saved to different directories.
The top level of the JSON format defines the following fields. If a field has a default, it is optional. If it isn’t labeled with a default or as optional, it is required.
-
awkward
(required, but for information purposes only): the version of the awkward-array library that wrote this object. Only the existence of this field is checked as a format signature. -
doc
(optional, for information purposes only): should be a string -
metadata
(optional, for information purposes only): any JSON structure -
prefix
(default is""
): the prefix at the beginning of the name of each binary blob. -
schema
: the deserialization expression, which recursively defines the awkward-array to build.
If additional fields are filled, the object is not considered invalid, though it risks conflicts with future versions of this specification.
A deserialization expression is a JSON object whose type is determined by the presence of an identifying key. If additional fields beyond the ones described below are filled, the object is not considered invalid, though it risks conflicts with future versions of this specification. However, a deserialization expression must not include more than one identifying key in the same JSON object.
Every deserialization expression has an "id"
field, which is a non-negative integer. These identifiers are used by "ref"
to build cross-references and cyclic references into the arrays within the object. These identifiers are not guaranteed to be in order; a reference to an unrecognized identifier can be deserialized as a VirtualArray
to break a cyclic dependency.
Each of the sections below is labeled with the identifying key first.
The expression is a function to be called if its function specification matches a pattern in the whitelist.
-
call
: function specification, described below. -
args
(default is[]
): list of [deserialization expressions] to use as positional arguments. -
kwargs
(default is{}
): JSON object of [deserialization expressions] to use as keyword, argument pairs. -
cacheable
(default isfalse
): iftrue
, pass the cache as acache
keyword argument to the function. -
whitelistable
(default isfalse
): iftrue
, pass the whitelist as awhitelist
keyword argument to the function.
Functions are specified by a list of at least two strings. The first is a fully qualified module path and rest are objects or objects within objects in that module that terminate in a callable.
For example, ["awkward", "JaggedArray", "fromcounts"]
is the fromcounts
constructor of class JaggedArray
in module awkward
.
The expression is a binary blob to read with a given name.
-
read
: string; the name of the binary blob to read. -
absolute
(defaultfalse
): iftrue
, the name is used as-is, iffalse
, the global prefix must be prepended before the name.
The expression is a list of [deserialization expressions].
The expression is a tuple of [deserialization expressions].
The expression is a dict of name (string), deserialization expression pairs.
The expression is a list of name (string), deserialization expression pairs. It is not a JSON object so that the order is maintained.
The expression is a Numpy dtype, expressed as JSON. Numpy distinguishes between Python tuples and lists, so the following patterns must all be converted to tuples before passing the result to the numpy.dtype
constructor:
-
[..., integer]
-
[..., [all integers]]
-
[string, ...]
The expression is a function object to pass as an argument, not to call. The function specification is the same as in the "call"
type, though.
The function whitelist is globally defined in awkward.persist.whitelist
but it can also be passed into deserialization functions manually. The format is a list of function specifiers with glob-style wildcards. Function specifiers, as described in the "call"
type, are a fully qualified module name followed by a path of objects within objects leading to a callable, like
["awkward", "JaggedArray", "fromcounts"]
for the fromcounts
constructor of the JaggedArray
class in the awkward
module. Whitelist specifiers allow fnmatch wildcards, like
["awkward", "*Array", "from*"]
to allow any array type’s non-primary constructor. A single string is promoted to a specifier and a single specifier is promoted to a list of specifiers, so "*"
by itself is a valid whitelist for allowing any function to run (for trusted data).
A function name that satisfies any wildcard expression is allowed.
awkward.persist.topython
should not be in a default whitelist because unpickling untrusted data can call arbitrary Python functions.
An awkward-array library’s default whitelist is not defined in this specification.
When serializing, users have the option to compress basic (Numpy) arrays. (When deserializing, whatever decompression functions are found are executed, if they are in the whitelist.) Compression has more value for some kinds of arrays than others, so the decision to compress or not compress is parameterized as in a policy.
The default policy is globally defined in awkward.persist.compression
as a list of rules. Each rule has the following format:
-
minsize
: minimum size in bytes; if the basic array is smaller than this size, it is not compressed by this rule. -
types
: list of item types; if the basic array’s item type is not a subclass of one of these types, it is not compressed by this rule. -
contexts
: fnmatch wildcard string or list of such strings; if the basic array’s context (what parameter it belongs to in an awkward-array) does not match any of these patterns, it is not compressed by this rule. -
pair
: 2-tuple of compression function, decompression function specifier. The compression function is a Python callable, which turns a buffer into compressed bytes. The decompression function is a tuple of strings naming the module and object where the function may be found (as in the"call"
type). Whereas the compression function is needed right away, the decompression function need only be specified so that it can be called during deserialization. The compression and decompression functions should be strict inverses of one another, with no parameters needed except the buffer to compress or decompress.
A single pair
is promoted to a rule and a single rule is promoted to a list of rules, so it would be sufficient to pass compression=(zlib.compress, ("zlib", "decompress"))
to a serialization function. If the compression pair is in the awkward.persist.partner
dict, only the compression function is needed: compression=zlib.compress
.
serialize(obj, storage, name=None, delimiter="-", suffix=None, schemasuffix=None, compression=compression, **kwargs)
serializes obj
(an awkard-array) and puts all binary blobs into storage
using names derived from a name
prefix, separated by delimiter
. Binary blobs optionally may have a suffix
(such as ".raw"
) and the schema itself (also inserted into storage
as a binary blob) may have a schemasuffix
(such as ".json"
). The compression option is described above. No return value.
deserialize(storage, name="", whitelist=whitelist, cache=None)
returns an awkward-array from storage
using name
as a prefix and an exact name for finding the schema. The whitelist option is described above. If a cache
is passed, that cache is passed as an argument to every VirtualArray
.
save(file, array, name=None, mode="a", **options)
Save an array
(an awkward-array) into a file
specified as a name, a path, or as a file-like object. If it is a name that does not end in .awkd
, this suffix is appended. The mode
is passed to the zipfile.ZipFile
object; "a"
means append to an existing file or create a file. If there are name conflicts, an erorr is raised before writing anything to the file. No return value.
load(file, **options)
Open file
as a read-only dict-like object containing awkward-arrays. The arrays may be found by asking for the dict-like object’s keys
and extracted with get-item.
hdf5(group, **options)
Interpret an HDF5 file or group (from the h5py library) as containing awkward-arrays, rather than arrays. Low-level binary blobs are hidden in favor of logical awkward arrays. This object can be written to or read from as a dict-like object with get-item, set-item, and del-item.