dask.array
¶nbodykit uses the dask package to store the columns
in CatalogSource
objects. The dask
package implements
a dask.array.Array
object that mimics that interface of the more
familiar numpy array. In this section, we describe what exactly a dask array
is and how it is used in nbodykit.
In nbodykit, the dask array object is a data container that behaves nearly
identical to a numpy array, except for one key difference. When performing
manipulations on a numpy array, the operations are performed immediately.
This is not the case for dask arrays. Instead, dask arrays store these
operations in a task graph and only evaluate the operations when the user
specifies to via a call to a compute()
function. When using
nbodykit, often the first task in this graph is loading data from disk.
Thus, dask provides nbodykit with on-demand IO functionality, allowing the user
to control when data is read from disk.
It is useful to describe a bit more about the nuts and bolts of the dask array to illustrate its full power. The dask array object cuts up the full array into many smaller arrays and performs calculations on these smaller “chunks”. This allows array computations to be performed on large data that does not fit into memory (but can be stored on disk). Particularly useful on laptops and other systems with limited memory, it extends the maximum size of useable datasets from the size of memory to the size of the disk storage. For further details, please see the introduction to the dask array in the dask documentation.
The dask array functionality is best illustrated by example. Here, we
initialize a UniformCatalog
that generates objects with uniformly distributed position and velocity columns.
In [1]: from nbodykit.lab import UniformCatalog
In [2]: cat = UniformCatalog(nbar=100, BoxSize=1.0, seed=42)
In [3]: print(cat)
UniformCatalog(size=96, seed=42)
In [4]: print(cat['Position'])