nbodykit.base.catalog

Functions

column([name, is_default]) Decorator that defines the decorated function as a column in a CatalogSource.

Classes

CatalogSource(comm) An abstract base class representing a catalog of discrete particles.
CatalogSourceBase(comm) An abstract base class that implements most of the functionality in CatalogSource.
ColumnAccessor Provides access to a Column from a Catalog.
ColumnFinder(clsname, bases, attrs) A meta-class that will register all columns of a class that have been marked with the column() decorator.
class nbodykit.base.catalog.CatalogSource(comm)[source]

An abstract base class representing a catalog of discrete particles.

This objects behaves like a structured numpy array – it must have a well-defined size when initialized. The size here represents the number of particles in the source on the local rank.

The information about each particle is stored as a series of columns in the format of dask arrays. These columns can be accessed in a dict-like fashion.

All subclasses of this class contain the following default columns:

  1. Weight
  2. Value
  3. Selection

For a full description of these default columns, see the documentation.

Important

Subclasses of this class must set the _size attribute.

Parameters:

comm – the MPI communicator to use for this object

Attributes:
Index

The attribute giving the global index rank of each particle in the list.

attrs

A dictionary storing relevant meta-data about the CatalogSource.

columns

All columns in the CatalogSource, including those hard-coded into the class’s defintion and override columns provided by the user.

csize

The total, collective size of the CatalogSource, i.e., summed across all ranks.

hardcolumns

A list of the hard-coded columns in the CatalogSource.

size

The number of objects in the CatalogSource on the local rank.

Methods

Selection() A boolean column that selects a subset slice of the CatalogSource.
Value() When interpolating a CatalogSource on to a mesh, the value of this array is used as the Value that each particle contributes to a given mesh cell.
Weight() The column giving the weight to use for each particle on the mesh.
compute(*args, **kwargs) Our version of dask.compute() that computes multiple delayed dask collections at once.
copy() Return a shallow copy of the object, where each column is a reference of the corresponding column in self.
get_hardcolumn(col) Construct and return a hard-coded column.
gslice(start, stop[, end, redistribute]) Execute a global slice of a CatalogSource.
make_column(array) Utility function to convert an array-like object to a dask.array.Array.
read(columns) Return the requested columns as dask arrays.
save(output, columns[, dataset, datasets, …]) Save the CatalogSource to a bigfile.BigFile.
sort(keys[, reverse, usecols]) Return a CatalogSource, sorted globally across all MPI ranks in ascending order by the input keys.
to_mesh([Nmesh, BoxSize, dtype, interlaced, …]) Convert the CatalogSource to a MeshSource, using the specified parameters.
to_subvolumes([domain, position, columns]) Domain Decompose a catalog, sending items to the ranks according to the supplied domain object.
view([type]) Return a “view” of the CatalogSource object, with the returned type set by type.
create_instance  
Index

The attribute giving the global index rank of each particle in the list. It is an integer from 0 to self.csize.

Note that slicing changes this index value.

Selection()[source]

A boolean column that selects a subset slice of the CatalogSource.

By default, this column is set to True for all particles, and all CatalogSource objects will contain this column.

Value()[source]

When interpolating a CatalogSource on to a mesh, the value of this array is used as the Value that each particle contributes to a given mesh cell.

The mesh field is a weighted average of Value, with the weights given by Weight.

By default, this array is set to unity for all particles, and all CatalogSource objects will contain this column.

Weight()[source]

The column giving the weight to use for each particle on the mesh.

The mesh field is a weighted average of Value, with the weights given by Weight.

By default, this array is set to unity for all particles, and all CatalogSource objects will contain this column.

__delitem__(col)

Delete a column; cannot delete a “hard-coded” column.

Note

If the base attribute is set, columns will be deleted from base instead of from self.

__finalize__(other)

Finalize the creation of a CatalogSource object by copying over any additional attributes from a second CatalogSource.

The idea here is to only copy over attributes that are similar to meta-data, so we do not copy some of the core attributes of the CatalogSource object.

Parameters:other – the second object to copy over attributes from; it needs to be a subclass of CatalogSourcBase for attributes to be copied
Returns:return self, with the added attributes
Return type:CatalogSource
__getitem__(sel)

The following types of indexing are supported:

  1. strings specifying a column in the CatalogSource; returns a dask array holding the column data
  2. boolean arrays specifying a slice of the CatalogSource; returns a CatalogSource holding only the revelant slice
  3. slice object specifying which particles to select
  4. list of strings specifying column names; returns a CatalogSource holding only the selected columns

Notes

  • Slicing is a collective operation
  • If the base attribute is set, columns will be returned from base instead of from self.
__len__()[source]

The local size of the CatalogSource on a given rank.

__setitem__(col, value)[source]

Add columns to the CatalogSource, overriding any existing columns with the name col.

attrs

A dictionary storing relevant meta-data about the CatalogSource.

columns

All columns in the CatalogSource, including those hard-coded into the class’s defintion and override columns provided by the user.

Note

If the base attribute is set, the value of base.columns will be returned.

compute(*args, **kwargs)

Our version of dask.compute() that computes multiple delayed dask collections at once.

This should be called on the return value of read() to converts any dask arrays to numpy arrays.

. note::
If the base attribute is set, compute() will called using base instead of self.
Parameters:args (object) – Any number of objects. If the object is a dask collection, it’s computed and the result is returned. Otherwise it’s passed through unchanged.
copy()

Return a shallow copy of the object, where each column is a reference of the corresponding column in self.

Note

No copy of data is made.

Note

This is different from view in that the attributes dictionary of the copy no longer related to self.

Returns:a new CatalogSource that holds all of the data columns of self
Return type:CatalogSource
csize

The total, collective size of the CatalogSource, i.e., summed across all ranks.

It is the sum of size across all available ranks.

If the base attribute is set, the base.csize attribute will be returned.

get_hardcolumn(col)

Construct and return a hard-coded column.

These are usually produced by calling member functions marked by the @column decorator.

Subclasses may override this method and the hardcolumns attribute to bypass the decorator logic.

Note

If the base attribute is set, get_hardcolumn() will called using base instead of self.

gslice(start, stop, end=1, redistribute=True)[source]

Execute a global slice of a CatalogSource.

Note

After the global slice is performed, the data is scattered evenly across all ranks.

Note

The current algorithm generates an index on the root rank and does not scale well.

Parameters:
  • start (int) – the start index of the global slice
  • stop (int) – the stop index of the global slice
  • step (int, optional) – the default step size of the global size
  • redistribute (bool, optional) – if True, evenly re-distribute the sliced data across all ranks, otherwise just return any local data part of the global slice
hardcolumns

A list of the hard-coded columns in the CatalogSource.

These columns are usually member functions marked by @column decorator. Subclasses may override this method and use get_hardcolumn() to bypass the decorator logic.

Note

If the base attribute is set, the value of base.hardcolumns will be returned.

static make_column(array)

Utility function to convert an array-like object to a dask.array.Array.

Note

The dask array chunk size is controlled via the dask_chunk_size global option. See set_options.

Parameters:array (array_like) – an array-like object; can be a dask array, numpy array, ColumnAccessor, or other non-scalar array-like object
Returns:a dask array initialized from array
Return type:dask.array.Array
read(columns)

Return the requested columns as dask arrays.

Parameters:columns (list of str) – the names of the requested columns
Returns:the list of column data, in the form of dask arrays
Return type:list of dask.array.Array
save(output, columns, dataset=None, datasets=None, header='Header')

Save the CatalogSource to a bigfile.BigFile.

Only the selected columns are saved and attrs are saved in header. The attrs of columns are stored in the datasets.

Parameters:
  • output (str) – the name of the file to write to
  • columns (list of str) – the names of the columns to save in the file
  • dataset (str, optional) – dataset to store the columns under.
  • datasets (list of str, optional) – names for the data set where each column is stored; defaults to the name of the column (deprecated)
  • header (str, optional, or None) – the name of the data set holding the header information, where attrs is stored if header is None, do not save the header.
size

The number of objects in the CatalogSource on the local rank.

If the base attribute is set, the base.size attribute will be returned.

Important

This property must be defined for all subclasses.

sort(keys, reverse=False, usecols=None)[source]

Return a CatalogSource, sorted globally across all MPI ranks in ascending order by the input keys.

Sort columns must be floating or integer type.

Note

After the sort operation, the data is scattered evenly across all ranks.

Parameters:
  • keys (list, tuple) – the names of columns to sort by. If multiple columns are provided, the data is sorted consecutively in the order provided
  • reverse (bool, optional) – if True, perform descending sort operations
  • usecols (list, optional) – the name of the columns to include in the returned CatalogSource
to_mesh(Nmesh=None, BoxSize=None, dtype='f4', interlaced=False, compensated=False, resampler='cic', weight='Weight', value='Value', selection='Selection', position='Position', window=None)

Convert the CatalogSource to a MeshSource, using the specified parameters.

Parameters:
  • Nmesh (int, optional) – the number of cells per side on the mesh; must be provided if not stored in attrs
  • BoxSize (scalar, 3-vector, optional) – the size of the box; must be provided if not stored in attrs
  • dtype (string, optional) – the data type of the mesh array
  • interlaced (bool, optional) – use the interlacing technique of Sefusatti et al. 2015 to reduce the effects of aliasing on Fourier space quantities computed from the mesh
  • compensated (bool, optional) – whether to correct for the resampler window introduced by the grid interpolation scheme
  • resampler (str, optional) – the string specifying which resampler interpolation scheme to use; see pmesh.resampler.methods
  • weight (str, optional) – the name of the column specifying the weight for each particle
  • value (str, optional) – the name of the column specifying the field value for each particle
  • selection (str, optional) – the name of the column that specifies which (if any) slice of the CatalogSource to take
  • position (str, optional) – the name of the column that specifies the position data of the objects in the catalog
  • window (str, deprecated) – use resampler instead.
Returns:

mesh – a mesh object that provides an interface for gridding particle data onto a specified mesh

Return type:

CatalogMesh

to_subvolumes(domain=None, position='Position', columns=None)

Domain Decompose a catalog, sending items to the ranks according to the supplied domain object. Using the position column as the Position.

This will read in the full position array and all of the requested columns.

Parameters:
  • domain (:pyclass:`pmesh.domain.GridND` object, or None) – The domain to distribute the catalog. If None, try to evenly divide spatially. An easiest way to find a domain object is to use pm.domain, where pm is a :pyclass:`pmesh.pm.ParticleMesh` object.
  • position (string_like) – column to use to compute the position.
  • columns (list of string_like) – columns to include in the new catalog, if not supplied, all catalogs will be exchanged.
Returns:

A decomposed catalog source, where each rank only contains objects belongs to the rank as claimed by the domain object.

self.attrs are carried over as a shallow copy to the returned object.

Return type:

CatalogSource

view(type=None)

Return a “view” of the CatalogSource object, with the returned type set by type.

This initializes a new empty class of type type and attaches attributes to it via the __finalize__() mechanism.

Parameters:type (Python type) – the desired class type of the returned object.
class nbodykit.base.catalog.CatalogSourceBase(comm)[source]

An abstract base class that implements most of the functionality in CatalogSource.

The main difference between this class and CatalogSource is that this base class does not assume the object has a size attribute.

Note

See the docstring for CatalogSource. Most often, users should implement custom sources as subclasses of CatalogSource.

The names of hard-coded columns, i.e., those defined through member functions of the class, are stored in the _defaults and _hardcolumns attributes. These attributes are computed by the ColumnFinder meta-class.

Parameters:

comm – the MPI communicator to use for this object

Attributes:
attrs

A dictionary storing relevant meta-data about the CatalogSource.

columns

All columns in the CatalogSource, including those hard-coded into the class’s defintion and override columns provided by the user.

hardcolumns

A list of the hard-coded columns in the CatalogSource.

Methods

compute(*args, **kwargs) Our version of dask.compute() that computes multiple delayed dask collections at once.
copy() Return a shallow copy of the object, where each column is a reference of the corresponding column in self.
get_hardcolumn(col) Construct and return a hard-coded column.
make_column(array) Utility function to convert an array-like object to a dask.array.Array.
read(columns) Return the requested columns as dask arrays.
save(output, columns[, dataset, datasets, …]) Save the CatalogSource to a bigfile.BigFile.
to_mesh([Nmesh, BoxSize, dtype, interlaced, …]) Convert the CatalogSource to a MeshSource, using the specified parameters.
to_subvolumes([domain, position, columns]) Domain Decompose a catalog, sending items to the ranks according to the supplied domain object.
view([type]) Return a “view” of the CatalogSource object, with the returned type set by type.
create_instance  
__delitem__(col)[source]

Delete a column; cannot delete a “hard-coded” column.

Note

If the base attribute is set, columns will be deleted from base instead of from self.

__finalize__(other)[source]

Finalize the creation of a CatalogSource object by copying over any additional attributes from a second CatalogSource.

The idea here is to only copy over attributes that are similar to meta-data, so we do not copy some of the core attributes of the CatalogSource object.

Parameters:other – the second object to copy over attributes from; it needs to be a subclass of CatalogSourcBase for attributes to be copied
Returns:return self, with the added attributes
Return type:CatalogSource
__getitem__(sel)[source]

The following types of indexing are supported:

  1. strings specifying a column in the CatalogSource; returns a dask array holding the column data
  2. boolean arrays specifying a slice of the CatalogSource; returns a CatalogSource holding only the revelant slice
  3. slice object specifying which particles to select
  4. list of strings specifying column names; returns a CatalogSource holding only the selected columns

Notes

  • Slicing is a collective operation
  • If the base attribute is set, columns will be returned from base instead of from self.
__setitem__(col, value)[source]

Add new columns to the CatalogSource, overriding any existing columns with the name col.

Note

If the base attribute is set, columns will be added to base instead of to self.

attrs

A dictionary storing relevant meta-data about the CatalogSource.

columns

All columns in the CatalogSource, including those hard-coded into the class’s defintion and override columns provided by the user.

Note

If the base attribute is set, the value of base.columns will be returned.

compute(*args, **kwargs)[source]

Our version of dask.compute() that computes multiple delayed dask collections at once.

This should be called on the return value of read() to converts any dask arrays to numpy arrays.

. note::
If the base attribute is set, compute() will called using base instead of self.
Parameters:args (object) – Any number of objects. If the object is a dask collection, it’s computed and the result is returned. Otherwise it’s passed through unchanged.
copy()[source]

Return a shallow copy of the object, where each column is a reference of the corresponding column in self.

Note

No copy of data is made.

Note

This is different from view in that the attributes dictionary of the copy no longer related to self.

Returns:a new CatalogSource that holds all of the data columns of self
Return type:CatalogSource
get_hardcolumn(col)[source]

Construct and return a hard-coded column.

These are usually produced by calling member functions marked by the @column decorator.

Subclasses may override this method and the hardcolumns attribute to bypass the decorator logic.

Note

If the base attribute is set, get_hardcolumn() will called using base instead of self.

hardcolumns

A list of the hard-coded columns in the CatalogSource.

These columns are usually member functions marked by @column decorator. Subclasses may override this method and use get_hardcolumn() to bypass the decorator logic.

Note

If the base attribute is set, the value of base.hardcolumns will be returned.

static make_column(array)[source]

Utility function to convert an array-like object to a dask.array.Array.

Note

The dask array chunk size is controlled via the dask_chunk_size global option. See set_options.

Parameters:array (array_like) – an array-like object; can be a dask array, numpy array, ColumnAccessor, or other non-scalar array-like object
Returns:a dask array initialized from array
Return type:dask.array.Array
read(columns)[source]

Return the requested columns as dask arrays.

Parameters:columns (list of str) – the names of the requested columns
Returns:the list of column data, in the form of dask arrays
Return type:list of dask.array.Array
save(output, columns, dataset=None, datasets=None, header='Header')[source]

Save the CatalogSource to a bigfile.BigFile.

Only the selected columns are saved and attrs are saved in header. The attrs of columns are stored in the datasets.

Parameters:
  • output (str) – the name of the file to write to
  • columns (list of str) – the names of the columns to save in the file
  • dataset (str, optional) – dataset to store the columns under.
  • datasets (list of str, optional) – names for the data set where each column is stored; defaults to the name of the column (deprecated)
  • header (str, optional, or None) – the name of the data set holding the header information, where attrs is stored if header is None, do not save the header.
to_mesh(Nmesh=None, BoxSize=None, dtype='f4', interlaced=False, compensated=False, resampler='cic', weight='Weight', value='Value', selection='Selection', position='Position', window=None)[source]

Convert the CatalogSource to a MeshSource, using the specified parameters.

Parameters:
  • Nmesh (int, optional) – the number of cells per side on the mesh; must be provided if not stored in attrs
  • BoxSize (scalar, 3-vector, optional) – the size of the box; must be provided if not stored in attrs
  • dtype (string, optional) – the data type of the mesh array
  • interlaced (bool, optional) – use the interlacing technique of Sefusatti et al. 2015 to reduce the effects of aliasing on Fourier space quantities computed from the mesh
  • compensated (bool, optional) – whether to correct for the resampler window introduced by the grid interpolation scheme
  • resampler (str, optional) – the string specifying which resampler interpolation scheme to use; see pmesh.resampler.methods
  • weight (str, optional) – the name of the column specifying the weight for each particle
  • value (str, optional) – the name of the column specifying the field value for each particle
  • selection (str, optional) – the name of the column that specifies which (if any) slice of the CatalogSource to take
  • position (str, optional) – the name of the column that specifies the position data of the objects in the catalog
  • window (str, deprecated) – use resampler instead.
Returns:

mesh – a mesh object that provides an interface for gridding particle data onto a specified mesh

Return type:

CatalogMesh

to_subvolumes(domain=None, position='Position', columns=None)[source]

Domain Decompose a catalog, sending items to the ranks according to the supplied domain object. Using the position column as the Position.

This will read in the full position array and all of the requested columns.

Parameters:
  • domain (:pyclass:`pmesh.domain.GridND` object, or None) – The domain to distribute the catalog. If None, try to evenly divide spatially. An easiest way to find a domain object is to use pm.domain, where pm is a :pyclass:`pmesh.pm.ParticleMesh` object.
  • position (string_like) – column to use to compute the position.
  • columns (list of string_like) – columns to include in the new catalog, if not supplied, all catalogs will be exchanged.
Returns:

A decomposed catalog source, where each rank only contains objects belongs to the rank as claimed by the domain object.

self.attrs are carried over as a shallow copy to the returned object.

Return type:

CatalogSource

view(type=None)[source]

Return a “view” of the CatalogSource object, with the returned type set by type.

This initializes a new empty class of type type and attaches attributes to it via the __finalize__() mechanism.

Parameters:type (Python type) – the desired class type of the returned object.
class nbodykit.base.catalog.ColumnAccessor[source]

Provides access to a Column from a Catalog.

This is a thin subclass of dask.array.Array to provide a reference to the catalog object, an additional attrs attribute (for recording the reproducible meta-data), and some pretty print support.

Due to particularity of dask, any transformation that is not explicitly in-place will return a dask.array.Array, and losing the pointer to the original catalog and the meta data attrs.

Parameters:
  • catalog (CatalogSource) – the catalog from which the column was accessed
  • daskarray (dask.array.Array) – the column in dask array form
  • is_default (bool, optional) – whether this column is a default column; default columns are not serialized to disk, as they are automatically available as columns
Attributes:
A
T
blocks

Slice an array by blocks

chunks
chunksize
dask
dtype
imag
itemsize

Length of one array element in bytes

name
nbytes

Number of bytes in array

ndim
npartitions
numblocks
real
shape
size

Number of elements in array

vindex

Vectorized indexing with broadcasting.

Methods

all([axis, out, keepdims]) Returns True if all elements evaluate to True.
any([axis, out, keepdims]) Returns True if any of the elements of a evaluate to True.
argmax([axis, out]) Return indices of the maximum values along the given axis.
argmin([axis, out]) Return indices of the minimum values along the given axis of a.
argtopk(k[, axis, split_every]) The indices of the top k elements of an array.
astype(dtype, **kwargs) Copy of the array, cast to a specified type.
choose(choices[, out, mode]) Use an index array to construct a new array from a set of choices.
clip([min, max, out]) Return an array whose values are limited to [min, max].
compute() Compute this dask collection
copy() Copy array.
cumprod(axis[, dtype, out]) See da.cumprod for docstring
cumsum(axis[, dtype, out]) See da.cumsum for docstring
dot(b[, out]) Dot product of two arrays.
flatten([order]) Return a flattened array.
map_blocks(*args, **kwargs) Map a function across all blocks of a dask array.
map_overlap(func, depth[, boundary, trim]) Map a function over blocks of the array with some overlap
max([axis, out, keepdims]) Return the maximum along a given axis.
mean([axis, dtype, out, keepdims]) Returns the average of the array elements along given axis.
min([axis, out, keepdims]) Return the minimum along a given axis.
moment(order[, axis, dtype, keepdims, ddof, …]) Calculate the nth centralized moment.
nonzero() Return the indices of the elements that are non-zero.
persist(**kwargs) Persist this dask collection into memory
prod([axis, dtype, out, keepdims]) Return the product of the array elements over the given axis
ravel([order]) Return a flattened array.
rechunk(chunks[, threshold, block_size_limit]) See da.rechunk for docstring
repeat(repeats[, axis]) Repeat elements of an array.
reshape(shape[, order]) Returns an array containing the same data with a new shape.
round([decimals, out]) Return a with each element rounded to the given number of decimals.
squeeze([axis]) Remove single-dimensional entries from the shape of a.
std([axis, dtype, out, ddof, keepdims]) Returns the standard deviation of the array elements along given axis.
store(targets[, lock, regions, compute, …]) Store dask arrays in array-like objects, overwrite data in target
sum([axis, dtype, out, keepdims]) Return the sum of the array elements over the given axis.
swapaxes(axis1, axis2) Return a view of the array with axis1 and axis2 interchanged.
to_dask_dataframe([columns, index]) Convert dask Array to dask Dataframe
to_delayed([optimize_graph]) Convert into an array of dask.delayed objects, one per chunk.
to_hdf5(filename, datapath, **kwargs) Store array in HDF5 file
to_zarr(*args, **kwargs) Save array to the zarr storage format
topk(k[, axis, split_every]) The top k elements of an array.
transpose(*axes) Returns a view of the array with axes transposed.
var([axis, dtype, out, ddof, keepdims]) Returns the variance of the array elements, along given axis.
view(dtype[, order]) Get a view of the array as a new data type
visualize([filename, format, optimize_graph]) Render the computation of this object’s task graph using graphviz.
as_daskarray  
conj  
static __dask_optimize__(dsk, keys, **kwargs)[source]

Notes

The dask default optimizer induces too many (unnecesarry) IO calls – we turn this off feature off by default, and only apply a culling.

__repr__()
>>> import dask.array as da
>>> da.ones((10, 10), chunks=(5, 5), dtype='i4')
dask.array<..., shape=(10, 10), dtype=int32, chunksize=(5, 5)>
all(axis=None, out=None, keepdims=False)

Returns True if all elements evaluate to True.

Refer to numpy.all for full documentation.

See also

numpy.all()
equivalent function
any(axis=None, out=None, keepdims=False)

Returns True if any of the elements of a evaluate to True.

Refer to numpy.any for full documentation.

See also

numpy.any()
equivalent function
argmax(axis=None, out=None)

Return indices of the maximum values along the given axis.

Refer to numpy.argmax for full documentation.

See also

numpy.argmax()
equivalent function
argmin(axis=None, out=None)

Return indices of the minimum values along the given axis of a.

Refer to numpy.argmin for detailed documentation.

See also

numpy.argmin()
equivalent function
argtopk(k, axis=-1, split_every=None)

The indices of the top k elements of an array.

See da.argtopk for docstring

astype(dtype, **kwargs)

Copy of the array, cast to a specified type.

Parameters:
  • dtype (str or dtype) – Typecode or data-type to which the array is cast.
  • casting ({'no', 'equiv', 'safe', 'same_kind', 'unsafe'}, optional) –

    Controls what kind of data casting may occur. Defaults to ‘unsafe’ for backwards compatibility.

    • ’no’ means the data types should not be cast at all.
    • ’equiv’ means only byte-order changes are allowed.
    • ’safe’ means only casts which can preserve values are allowed.
    • ’same_kind’ means only safe casts or casts within a kind,
      like float64 to float32, are allowed.
    • ’unsafe’ means any data conversions may be done.
  • copy (bool, optional) – By default, astype always returns a newly allocated array. If this is set to False and the dtype requirement is satisfied, the input array is returned instead of a copy.
blocks

Slice an array by blocks

This allows blockwise slicing of a Dask array. You can perform normal Numpy-style slicing but now rather than slice elements of the array you slice along blocks so, for example, x.blocks[0, ::2] produces a new dask array with every other block in the first row of blocks.

You can index blocks in any way that could index a numpy array of shape equal to the number of blocks in each dimension, (available as array.numblocks). The dimension of the output array will be the same as the dimension of this array, even if integer indices are passed. This does not support slicing with np.newaxis or multiple lists.

Examples

>>> import dask.array as da
>>> x = da.arange(10, chunks=2)
>>> x.blocks[0].compute()
array([0, 1])
>>> x.blocks[:3].compute()
array([0, 1, 2, 3, 4, 5])
>>> x.blocks[::2].compute()
array([0, 1, 4, 5, 8, 9])
>>> x.blocks[[-1, 0]].compute()
array([8, 9, 0, 1])
Returns:
Return type:A Dask array
choose(choices, out=None, mode='raise')

Use an index array to construct a new array from a set of choices.

Refer to numpy.choose for full documentation.

See also

numpy.choose()
equivalent function
clip(min=None, max=None, out=None)

Return an array whose values are limited to [min, max]. One of max or min must be given.

Refer to numpy.clip for full documentation.

See also

numpy.clip()
equivalent function
compute()[source]

Compute this dask collection

This turns a lazy Dask collection into its in-memory equivalent. For example a Dask.array turns into a numpy.array() and a Dask.dataframe turns into a Pandas dataframe. The entire dataset must fit into memory before calling this operation.

Parameters:
  • scheduler (string, optional) – Which scheduler to use like “threads”, “synchronous” or “processes”. If not provided, the default is to check the global settings first, and then fall back to the collection defaults.
  • optimize_graph (bool, optional) – If True [default], the graph is optimized before computation. Otherwise the graph is run as is. This can be useful for debugging.
  • kwargs – Extra keywords to forward to the scheduler function.

See also

dask.base.compute()

copy()

Copy array. This is a no-op for dask.arrays, which are immutable

cumprod(axis, dtype=None, out=None)

See da.cumprod for docstring

cumsum(axis, dtype=None, out=None)

See da.cumsum for docstring

dot(b, out=None)

Dot product of two arrays.

Refer to numpy.dot for full documentation.

See also

numpy.dot()
equivalent function

Examples

>>> a = np.eye(2)  # doctest: +SKIP
>>> b = np.ones((2, 2)) * 2  # doctest: +SKIP
>>> a.dot(b)  # doctest: +SKIP
array([[ 2.,  2.],
       [ 2.,  2.]])

This array method can be conveniently chained:

>>> a.dot(b).dot(b)  # doctest: +SKIP
array([[ 8.,  8.],
       [ 8.,  8.]])
flatten([order])

Return a flattened array.

Refer to numpy.ravel for full documentation.

See also

numpy.ravel()
equivalent function
ndarray.flat()
a flat iterator on the array.
itemsize

Length of one array element in bytes

map_blocks(*args, **kwargs)

Map a function across all blocks of a dask array.

Parameters:
  • func (callable) – Function to apply to every block in the array.
  • args (dask arrays or other objects) –
  • dtype (np.dtype, optional) – The dtype of the output array. It is recommended to provide this. If not provided, will be inferred by applying the function to a small set of fake data.
  • chunks (tuple, optional) – Chunk shape of resulting blocks if the function does not preserve shape. If not provided, the resulting array is assumed to have the same block structure as the first input array.
  • drop_axis (number or iterable, optional) – Dimensions lost by the function.
  • new_axis (number or iterable, optional) – New dimensions created by the function. Note that these are applied after drop_axis (if present).
  • token (string, optional) – The key prefix to use for the output array. If not provided, will be determined from the function name.
  • name (string, optional) – The key name to use for the output array. Note that this fully specifies the output key name, and must be unique. If not provided, will be determined by a hash of the arguments.
  • **kwargs – Other keyword arguments to pass to function. Values must be constants (not dask.arrays)

Examples

>>> import dask.array as da
>>> x = da.arange(6, chunks=3)
>>> x.map_blocks(lambda x: x * 2).compute()
array([ 0,  2,  4,  6,  8, 10])

The da.map_blocks function can also accept multiple arrays.

>>> d = da.arange(5, chunks=2)
>>> e = da.arange(5, chunks=2)
>>> f = map_blocks(lambda a, b: a + b**2, d, e)
>>> f.compute()
array([ 0,  2,  6, 12, 20])

If the function changes shape of the blocks then you must provide chunks explicitly.

>>> y = x.map_blocks(lambda x: x[::2], chunks=((2, 2),))

You have a bit of freedom in specifying chunks. If all of the output chunk sizes are the same, you can provide just that chunk size as a single tuple.

>>> a = da.arange(18, chunks=(6,))
>>> b = a.map_blocks(lambda x: x[:3], chunks=(3,))

If the function changes the dimension of the blocks you must specify the created or destroyed dimensions.

>>> b = a.map_blocks(lambda x: x[None, :, None], chunks=(1, 6, 1),
...                  new_axis=[0, 2])

Map_blocks aligns blocks by block positions without regard to shape. In the following example we have two arrays with the same number of blocks but with different shape and chunk sizes.

>>> x = da.arange(1000, chunks=(100,))
>>> y = da.arange(100, chunks=(10,))

The relevant attribute to match is numblocks.

>>> x.numblocks
(10,)
>>> y.numblocks
(10,)

If these match (up to broadcasting rules) then we can map arbitrary functions across blocks

>>> def func(a, b):
...     return np.array([a.max(), b.max()])
>>> da.map_blocks(func, x, y, chunks=(2,), dtype='i8')
dask.array<func, shape=(20,), dtype=int64, chunksize=(2,)>
>>> _.compute()
array([ 99,   9, 199,  19, 299,  29, 399,  39, 499,  49, 599,  59, 699,
        69, 799,  79, 899,  89, 999,  99])

Your block function get information about where it is in the array by accepting a special block_info keyword argument.

>>> def func(block, block_info=None):
...     pass

This will receive the following information:

>>> block_info  # doctest: +SKIP
{0: {'shape': (1000,),
     'num-chunks': (10,),
     'chunk-location': (4,),
     'array-location': [(400, 500)]}}

For each argument and keyword arguments that are dask arrays (the positions of which are the first index), you will receive the shape of the full array, the number of chunks of the full array in each dimension, the chunk location (for example the fourth chunk over in the first dimension), and the array location (for example the slice corresponding to 40:50).

You may specify the key name prefix of the resulting task in the graph with the optional token keyword argument.

>>> x.map_blocks(lambda x: x + 1, name='increment')  # doctest: +SKIP
dask.array<increment, shape=(100,), dtype=int64, chunksize=(10,)>
map_overlap(func, depth, boundary=None, trim=True, **kwargs)

Map a function over blocks of the array with some overlap

We share neighboring zones between blocks of the array, then map a function, then trim away the neighboring strips.

Parameters:
  • func (function) – The function to apply to each extended block
  • depth (int, tuple, or dict) – The number of elements that each block should share with its neighbors If a tuple or dict then this can be different per axis
  • boundary (str, tuple, dict) – How to handle the boundaries. Values include ‘reflect’, ‘periodic’, ‘nearest’, ‘none’, or any constant value like 0 or np.nan
  • trim (bool) – Whether or not to trim depth elements from each block after calling the map function. Set this to False if your mapping function already does this for you
  • **kwargs – Other keyword arguments valid in map_blocks

Examples

>>> x = np.array([1, 1, 2, 3, 3, 3, 2, 1, 1])
>>> x = from_array(x, chunks=5)
>>> def derivative(x):
...     return x - np.roll(x, 1)
>>> y = x.map_overlap(derivative, depth=1, boundary=0)
>>> y.compute()
array([ 1,  0,  1,  1,  0,  0, -1, -1,  0])
>>> import dask.array as da
>>> x = np.arange(16).reshape((4, 4))
>>> d = da.from_array(x, chunks=(2, 2))
>>> d.map_overlap(lambda x: x + x.size, depth=1).compute()
array([[16, 17, 18, 19],
       [20, 21, 22, 23],
       [24, 25, 26, 27],
       [28, 29, 30, 31]])
>>> func = lambda x: x + x.size
>>> depth = {0: 1, 1: 1}
>>> boundary = {0: 'reflect', 1: 'none'}
>>> d.map_overlap(func, depth, boundary).compute()  # doctest: +NORMALIZE_WHITESPACE
array([[12,  13,  14,  15],
       [16,  17,  18,  19],
       [20,  21,  22,  23],
       [24,  25,  26,  27]])
max(axis=None, out=None, keepdims=False)

Return the maximum along a given axis.

Refer to numpy.amax for full documentation.

See also

numpy.amax()
equivalent function
mean(axis=None, dtype=None, out=None, keepdims=False)

Returns the average of the array elements along given axis.

Refer to numpy.mean for full documentation.

See also

numpy.mean()
equivalent function
min(axis=None, out=None, keepdims=False)

Return the minimum along a given axis.

Refer to numpy.amin for full documentation.

See also

numpy.amin()
equivalent function
moment(order, axis=None, dtype=None, keepdims=False, ddof=0, split_every=None, out=None)

Calculate the nth centralized moment.

Parameters:
  • order (int) – Order of the moment that is returned, must be >= 2.
  • axis (int, optional) – Axis along which the central moment is computed. The default is to compute the moment of the flattened array.
  • dtype (data-type, optional) – Type to use in computing the moment. For arrays of integer type the default is float64; for arrays of float types it is the same as the array type.
  • keepdims (bool, optional) – If this is set to True, the axes which are reduced are left in the result as dimensions with size one. With this option, the result will broadcast correctly against the original array.
  • ddof (int, optional) – “Delta Degrees of Freedom”: the divisor used in the calculation is N - ddof, where N represents the number of elements. By default ddof is zero.
Returns:

moment

Return type:

ndarray

References

[1]Pebay, Philippe (2008), “Formulas for Robust, One-Pass Parallel

Computation of Covariances and Arbitrary-Order Statistical Moments” (PDF), Technical Report SAND2008-6212, Sandia National Laboratories

nbytes

Number of bytes in array

nonzero()

Return the indices of the elements that are non-zero.

Refer to numpy.nonzero for full documentation.

See also

numpy.nonzero()
equivalent function
persist(**kwargs)

Persist this dask collection into memory

This turns a lazy Dask collection into a Dask collection with the same metadata, but now with the results fully computed or actively computing in the background.

The action of function differs significantly depending on the active task scheduler. If the task scheduler supports asynchronous computing, such as is the case of the dask.distributed scheduler, then persist will return immediately and the return value’s task graph will contain Dask Future objects. However if the task scheduler only supports blocking computation then the call to persist will block and the return value’s task graph will contain concrete Python results.

This function is particularly useful when using distributed systems, because the results will be kept in distributed memory, rather than returned to the local process as with compute.

Parameters:
  • scheduler (string, optional) – Which scheduler to use like “threads”, “synchronous” or “processes”. If not provided, the default is to check the global settings first, and then fall back to the collection defaults.
  • optimize_graph (bool, optional) – If True [default], the graph is optimized before computation. Otherwise the graph is run as is. This can be useful for debugging.
  • **kwargs – Extra keywords to forward to the scheduler function.
Returns:

Return type:

New dask collections backed by in-memory data

See also

dask.base.persist()

prod(axis=None, dtype=None, out=None, keepdims=False)

Return the product of the array elements over the given axis

Refer to numpy.prod for full documentation.

See also

numpy.prod()
equivalent function
ravel([order])

Return a flattened array.

Refer to numpy.ravel for full documentation.

See also

numpy.ravel()
equivalent function
ndarray.flat()
a flat iterator on the array.
rechunk(chunks, threshold=None, block_size_limit=None)

See da.rechunk for docstring

repeat(repeats, axis=None)

Repeat elements of an array.

Refer to numpy.repeat for full documentation.

See also

numpy.repeat()
equivalent function
reshape(shape, order='C')

Returns an array containing the same data with a new shape.

Refer to numpy.reshape for full documentation.

See also

numpy.reshape()
equivalent function

Notes

Unlike the free function numpy.reshape, this method on ndarray allows the elements of the shape parameter to be passed in as separate arguments. For example, a.reshape(10, 11) is equivalent to a.reshape((10, 11)).

round(decimals=0, out=None)

Return a with each element rounded to the given number of decimals.

Refer to numpy.around for full documentation.

See also

numpy.around()
equivalent function
size

Number of elements in array

squeeze(axis=None)

Remove single-dimensional entries from the shape of a.

Refer to numpy.squeeze for full documentation.

See also

numpy.squeeze()
equivalent function
std(axis=None, dtype=None, out=None, ddof=0, keepdims=False)

Returns the standard deviation of the array elements along given axis.

Refer to numpy.std for full documentation.

See also

numpy.std()
equivalent function
store(targets, lock=True, regions=None, compute=True, return_stored=False, **kwargs)

Store dask arrays in array-like objects, overwrite data in target

This stores dask arrays into object that supports numpy-style setitem indexing. It stores values chunk by chunk so that it does not have to fill up memory. For best performance you can align the block size of the storage target with the block size of your array.

If your data fits in memory then you may prefer calling np.array(myarray) instead.

Parameters:
  • sources (Array or iterable of Arrays) –
  • targets (array-like or Delayed or iterable of array-likes and/or Delayeds) – These should support setitem syntax target[10:20] = ...
  • lock (boolean or threading.Lock, optional) – Whether or not to lock the data stores while storing. Pass True (lock each file individually), False (don’t lock) or a particular threading.Lock object to be shared among all writes.
  • regions (tuple of slices or iterable of tuple of slices) – Each region tuple in regions should be such that target[region].shape = source.shape for the corresponding source and target in sources and targets, respectively.
  • compute (boolean, optional) – If true compute immediately, return dask.delayed.Delayed otherwise
  • return_stored (boolean, optional) – Optionally return the stored result (default False).

Examples

>>> x = ...  # doctest: +SKIP
>>> import h5py  # doctest: +SKIP
>>> f = h5py.File('myfile.hdf5')  # doctest: +SKIP
>>> dset = f.create_dataset('/data', shape=x.shape,
...                                  chunks=x.chunks,
...                                  dtype='f8')  # doctest: +SKIP
>>> store(x, dset)  # doctest: +SKIP

Alternatively store many arrays at the same time

>>> store([x, y, z], [dset1, dset2, dset3])  # doctest: +SKIP
sum(axis=None, dtype=None, out=None, keepdims=False)

Return the sum of the array elements over the given axis.

Refer to numpy.sum for full documentation.

See also

numpy.sum()
equivalent function
swapaxes(axis1, axis2)

Return a view of the array with axis1 and axis2 interchanged.

Refer to numpy.swapaxes for full documentation.

See also

numpy.swapaxes()
equivalent function
to_dask_dataframe(columns=None, index=None)

Convert dask Array to dask Dataframe

Parameters:
  • columns (list or string) – list of column names if DataFrame, single string if Series
  • index (dask.dataframe.Index, optional) –

    An optional dask Index to use for the output Series or DataFrame.

    The default output index depends on whether the array has any unknown chunks. If there are any unknown chunks, the output has None for all the divisions (one per chunk). If all the chunks are known, a default index with known divsions is created.

    Specifying index can be useful if you’re conforming a Dask Array to an existing dask Series or DataFrame, and you would like the indices to match.

See also

dask.dataframe.from_dask_array()

to_delayed(optimize_graph=True)

Convert into an array of dask.delayed objects, one per chunk.

Parameters:optimize_graph (bool, optional) – If True [default], the graph is optimized before converting into dask.delayed objects.

See also

dask.array.from_delayed()

to_hdf5(filename, datapath, **kwargs)

Store array in HDF5 file

>>> x.to_hdf5('myfile.hdf5', '/x')  # doctest: +SKIP

Optionally provide arguments as though to h5py.File.create_dataset

>>> x.to_hdf5('myfile.hdf5', '/x', compression='lzf', shuffle=True)  # doctest: +SKIP

See also

da.store(), h5py.File.create_dataset()

to_zarr(*args, **kwargs)

Save array to the zarr storage format

See https://zarr.readthedocs.io for details about the format.

See function to_zarr() for parameters.

topk(k, axis=-1, split_every=None)

The top k elements of an array.

See da.topk for docstring

transpose(*axes)

Returns a view of the array with axes transposed.

For a 1-D array, this has no effect. (To change between column and row vectors, first cast the 1-D array into a matrix object.) For a 2-D array, this is the usual matrix transpose. For an n-D array, if axes are given, their order indicates how the axes are permuted (see Examples). If axes are not provided and a.shape = (i[0], i[1], ... i[n-2], i[n-1]), then a.transpose().shape = (i[n-1], i[n-2], ... i[1], i[0]).

Parameters:axes (None, tuple of ints, or n ints) –
  • None or no argument: reverses the order of the axes.
  • tuple of ints: i in the j-th place in the tuple means a’s i-th axis becomes a.transpose()’s j-th axis.
  • n ints: same as an n-tuple of the same ints (this form is intended simply as a “convenience” alternative to the tuple form)
Returns:out – View of a, with axes suitably permuted.
Return type:ndarray

See also

ndarray.T()
Array property returning the array transposed.

Examples

>>> a = np.array([[1, 2], [3, 4]])  # doctest: +SKIP
>>> a  # doctest: +SKIP
array([[1, 2],
       [3, 4]])
>>> a.transpose()  # doctest: +SKIP
array([[1, 3],
       [2, 4]])
>>> a.transpose((1, 0))  # doctest: +SKIP
array([[1, 3],
       [2, 4]])
>>> a.transpose(1, 0)  # doctest: +SKIP
array([[1, 3],
       [2, 4]])
var(axis=None, dtype=None, out=None, ddof=0, keepdims=False)

Returns the variance of the array elements, along given axis.

Refer to numpy.var for full documentation.

See also

numpy.var()
equivalent function
view(dtype, order='C')

Get a view of the array as a new data type

Parameters:
  • dtype – The dtype by which to view the array
  • order (string) – ‘C’ or ‘F’ (Fortran) ordering
  • reinterprets the bytes of the array under a new dtype. If that (This) –
  • does not have the same size as the original array then the shape (dtype) –
  • change. (will) –
  • that both numpy and dask.array can behave oddly when taking (Beware) –
  • views of arrays under Fortran ordering. Under some (shape-changing) –
  • of NumPy this function will fail when taking shape-changing (versions) –
  • of Fortran ordered arrays if the first dimension has chunks of (views) –
  • one. (size) –
vindex

Vectorized indexing with broadcasting.

This is equivalent to numpy’s advanced indexing, using arrays that are broadcast against each other. This allows for pointwise indexing:

>>> x = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
>>> x = from_array(x, chunks=2)
>>> x.vindex[[0, 1, 2], [0, 1, 2]].compute()
array([1, 5, 9])

Mixed basic/advanced indexing with slices/arrays is also supported. The order of dimensions in the result follows those proposed for ndarray.vindex [1]_: the subspace spanned by arrays is followed by all slices.

Note: vindex provides more general functionality than standard indexing, but it also has fewer optimizations and can be significantly slower.

_[1]: https://github.com/numpy/numpy/pull/6256

visualize(filename='mydask', format=None, optimize_graph=False, **kwargs)

Render the computation of this object’s task graph using graphviz.

Requires graphviz to be installed.

Parameters:
  • filename (str or None, optional) – The name (without an extension) of the file to write to disk. If filename is None, no file will be written, and we communicate with dot using only pipes.
  • format ({'png', 'pdf', 'dot', 'svg', 'jpeg', 'jpg'}, optional) – Format in which to write output file. Default is ‘png’.
  • optimize_graph (bool, optional) – If True, the graph is optimized before rendering. Otherwise, the graph is displayed as is. Default is False.
  • color ({None, 'order'}, optional) – Options to color nodes. Provide cmap= keyword for additional colormap
  • **kwargs – Additional keyword arguments to forward to to_graphviz.

Examples

>>> x.visualize(filename='dask.pdf')  # doctest: +SKIP
>>> x.visualize(filename='dask.pdf', color='order')  # doctest: +SKIP
Returns:result – See dask.dot.dot_graph for more information.
Return type:IPython.diplay.Image, IPython.display.SVG, or None

See also

dask.base.visualize(), dask.dot.dot_graph()

Notes

For more information on optimization see here:

https://docs.dask.org/en/latest/optimize.html

class nbodykit.base.catalog.ColumnFinder(clsname, bases, attrs)[source]

A meta-class that will register all columns of a class that have been marked with the column() decorator.

This adds the following attributes to the class definition:

1. _defaults : default columns, specified by passing default=True to the column() decorator. 2. _hardcolumns : non-default, hard-coded columns

Note

This is a subclass of abc.ABCMeta so subclasses can define abstract properties, if they need to.

Methods

__call__($self, /, *args, **kwargs) Call self as a function.
mro() return a type’s method resolution order
register(subclass) Register a virtual subclass of an ABC.
__instancecheck__(instance)

Override for isinstance(instance, cls).

__subclasscheck__(subclass)

Override for issubclass(subclass, cls).

mro() → list

return a type’s method resolution order

register(subclass)

Register a virtual subclass of an ABC.

Returns the subclass, to allow usage as a class decorator.

nbodykit.base.catalog.column(name=None, is_default=False)[source]

Decorator that defines the decorated function as a column in a CatalogSource.

This can be used as a decorator with or without arguments. If no name is specified, the function name is used.

Parameters:
  • name (str, optional) – the name of the column; if not provided, the name of the function being decorated is used
  • is_default (bool, optional) – whether the column is a default column; default columns are not serialized to disk