nbodykit.base.catalog

Functions

column([name, is_default])

Decorator that defines the decorated function as a column in a CatalogSource.

Classes

CatalogSource(comm)

An abstract base class representing a catalog of discrete particles.

CatalogSourceBase(comm)

An abstract base class that implements most of the functionality in CatalogSource.

ColumnAccessor(catalog, daskarray[, is_default])

Provides access to a Column from a Catalog.

ColumnFinder(name, bases, namespace, **kwargs)

A meta-class that will register all columns of a class that have been marked with the column() decorator.

class nbodykit.base.catalog.CatalogSource(comm)[source]

An abstract base class representing a catalog of discrete particles.

This objects behaves like a structured numpy array – it must have a well-defined size when initialized. The size here represents the number of particles in the source on the local rank.

The information about each particle is stored as a series of columns in the format of dask arrays. These columns can be accessed in a dict-like fashion.

All subclasses of this class contain the following default columns:

  1. Weight

  2. Value

  3. Selection

For a full description of these default columns, see the documentation.

Important

Subclasses of this class must set the _size attribute.

Parameters

comm – the MPI communicator to use for this object

Attributes
Index

The attribute giving the global index rank of each particle in the list.

attrs

A dictionary storing relevant meta-data about the CatalogSource.

columns

All columns in the CatalogSource, including those hard-coded into the class’s defintion and override columns provided by the user.

csize

The total, collective size of the CatalogSource, i.e., summed across all ranks.

hardcolumns

A list of the hard-coded columns in the CatalogSource.

size

The number of objects in the CatalogSource on the local rank.

Methods

Selection()

A boolean column that selects a subset slice of the CatalogSource.

Value()

When interpolating a CatalogSource on to a mesh, the value of this array is used as the Value that each particle contributes to a given mesh cell.

Weight()

The column giving the weight to use for each particle on the mesh.

compute(*args, **kwargs)

Our version of dask.compute() that computes multiple delayed dask collections at once.

copy()

Return a shallow copy of the object, where each column is a reference of the corresponding column in self.

get_hardcolumn(col)

Construct and return a hard-coded column.

gslice(start, stop[, end, redistribute])

Execute a global slice of a CatalogSource.

make_column(array)

Utility function to convert an array-like object to a dask.array.Array.

persist([columns])

Return a CatalogSource, where the selected columns are computed and persist in memory.

read(columns)

Return the requested columns as dask arrays.

save(output[, columns, dataset, datasets, …])

Save the CatalogSource to a bigfile.BigFile.

sort(keys[, reverse, usecols])

Return a CatalogSource, sorted globally across all MPI ranks in ascending order by the input keys.

to_mesh([Nmesh, BoxSize, dtype, interlaced, …])

Convert the CatalogSource to a MeshSource, using the specified parameters.

to_subvolumes([domain, position, columns])

Domain Decompose a catalog, sending items to the ranks according to the supplied domain object.

view([type])

Return a “view” of the CatalogSource object, with the returned type set by type.

create_instance

property Index

The attribute giving the global index rank of each particle in the list. It is an integer from 0 to self.csize.

Note that slicing changes this index value.

Selection()[source]

A boolean column that selects a subset slice of the CatalogSource.

By default, this column is set to True for all particles, and all CatalogSource objects will contain this column.

Value()[source]

When interpolating a CatalogSource on to a mesh, the value of this array is used as the Value that each particle contributes to a given mesh cell.

The mesh field is a weighted average of Value, with the weights given by Weight.

By default, this array is set to unity for all particles, and all CatalogSource objects will contain this column.

Weight()[source]

The column giving the weight to use for each particle on the mesh.

The mesh field is a weighted average of Value, with the weights given by Weight.

By default, this array is set to unity for all particles, and all CatalogSource objects will contain this column.

__delitem__(col)

Delete a column; cannot delete a “hard-coded” column.

Note

If the base attribute is set, columns will be deleted from base instead of from self.

__finalize__(other)

Finalize the creation of a CatalogSource object by copying over any additional attributes from a second CatalogSource.

The idea here is to only copy over attributes that are similar to meta-data, so we do not copy some of the core attributes of the CatalogSource object.

Parameters

other – the second object to copy over attributes from; it needs to be a subclass of CatalogSourcBase for attributes to be copied

Returns

return self, with the added attributes

Return type

CatalogSource

__getitem__(sel)

The following types of indexing are supported:

  1. strings specifying a column in the CatalogSource; returns a dask array holding the column data

  2. boolean arrays specifying a slice of the CatalogSource; returns a CatalogSource holding only the revelant slice

  3. slice object specifying which particles to select

  4. list of strings specifying column names; returns a CatalogSource holding only the selected columns

Notes

  • Slicing is a collective operation

  • If the base attribute is set, columns will be returned from base instead of from self.

__len__()[source]

The local size of the CatalogSource on a given rank.

__setitem__(col, value)[source]

Add columns to the CatalogSource, overriding any existing columns with the name col.

property attrs

A dictionary storing relevant meta-data about the CatalogSource.

property columns

All columns in the CatalogSource, including those hard-coded into the class’s defintion and override columns provided by the user.

Note

If the base attribute is set, the value of base.columns will be returned.

compute(*args, **kwargs)

Our version of dask.compute() that computes multiple delayed dask collections at once.

This should be called on the return value of read() to converts any dask arrays to numpy arrays.

. note::

If the base attribute is set, compute() will called using base instead of self.

Parameters

args (object) – Any number of objects. If the object is a dask collection, it’s computed and the result is returned. Otherwise it’s passed through unchanged.

copy()

Return a shallow copy of the object, where each column is a reference of the corresponding column in self.

Note

No copy of data is made.

Note

This is different from view in that the attributes dictionary of the copy no longer related to self.

Returns

a new CatalogSource that holds all of the data columns of self

Return type

CatalogSource

property csize

The total, collective size of the CatalogSource, i.e., summed across all ranks.

It is the sum of size across all available ranks.

If the base attribute is set, the base.csize attribute will be returned.

get_hardcolumn(col)

Construct and return a hard-coded column.

These are usually produced by calling member functions marked by the @column decorator.

Subclasses may override this method and the hardcolumns attribute to bypass the decorator logic.

Note

If the base attribute is set, get_hardcolumn() will called using base instead of self.

gslice(start, stop, end=1, redistribute=True)[source]

Execute a global slice of a CatalogSource.

Note

After the global slice is performed, the data is scattered evenly across all ranks.

Note

The current algorithm generates an index on the root rank and does not scale well.

Parameters
  • start (int) – the start index of the global slice

  • stop (int) – the stop index of the global slice

  • step (int, optional) – the default step size of the global size

  • redistribute (bool, optional) – if True, evenly re-distribute the sliced data across all ranks, otherwise just return any local data part of the global slice

property hardcolumns

A list of the hard-coded columns in the CatalogSource.

These columns are usually member functions marked by @column decorator. Subclasses may override this method and use get_hardcolumn() to bypass the decorator logic.

Note

If the base attribute is set, the value of base.hardcolumns will be returned.

static make_column(array)

Utility function to convert an array-like object to a dask.array.Array.

Note

The dask array chunk size is controlled via the dask_chunk_size global option. See set_options.

Parameters

array (array_like) – an array-like object; can be a dask array, numpy array, ColumnAccessor, or other non-scalar array-like object

Returns

a dask array initialized from array

Return type

dask.array.Array

persist(columns=None)[source]

Return a CatalogSource, where the selected columns are computed and persist in memory.

read(columns)

Return the requested columns as dask arrays.

Parameters

columns (list of str) – the names of the requested columns

Returns

the list of column data, in the form of dask arrays

Return type

list of dask.array.Array

save(output, columns=None, dataset=None, datasets=None, header='Header', compute=True)

Save the CatalogSource to a bigfile.BigFile.

Only the selected columns are saved and attrs are saved in header. The attrs of columns are stored in the datasets.

Parameters
  • output (str) – the name of the file to write to

  • columns (list of str) – the names of the columns to save in the file, or None to use all columns

  • dataset (str, optional) – dataset to store the columns under.

  • datasets (list of str, optional) – names for the data set where each column is stored; defaults to the name of the column (deprecated)

  • header (str, optional, or None) – the name of the data set holding the header information, where attrs is stored if header is None, do not save the header.

  • compute (boolean, default True) – if True, wait till the store operations finish if False, return a dictionary with column name and a future object for the store. use dask.compute() to wait for the store operations on the result.

property size

The number of objects in the CatalogSource on the local rank.

If the base attribute is set, the base.size attribute will be returned.

Important

This property must be defined for all subclasses.

sort(keys, reverse=False, usecols=None)[source]

Return a CatalogSource, sorted globally across all MPI ranks in ascending order by the input keys.

Sort columns must be floating or integer type.

Note

After the sort operation, the data is scattered evenly across all ranks.

Parameters
  • keys (list, tuple) – the names of columns to sort by. If multiple columns are provided, the data is sorted consecutively in the order provided

  • reverse (bool, optional) – if True, perform descending sort operations

  • usecols (list, optional) – the name of the columns to include in the returned CatalogSource

to_mesh(Nmesh=None, BoxSize=None, dtype='f4', interlaced=False, compensated=False, resampler='cic', weight='Weight', value='Value', selection='Selection', position='Position', window=None)

Convert the CatalogSource to a MeshSource, using the specified parameters.

Parameters
  • Nmesh (int, optional) – the number of cells per side on the mesh; must be provided if not stored in attrs

  • BoxSize (scalar, 3-vector, optional) – the size of the box; must be provided if not stored in attrs

  • dtype (string, optional) – the data type of the mesh array

  • interlaced (bool, optional) – use the interlacing technique of Sefusatti et al. 2015 to reduce the effects of aliasing on Fourier space quantities computed from the mesh

  • compensated (bool, optional) – whether to correct for the resampler window introduced by the grid interpolation scheme

  • resampler (str, optional) – the string specifying which resampler interpolation scheme to use; see pmesh.resampler.methods

  • weight (str, optional) – the name of the column specifying the weight for each particle

  • value (str, optional) – the name of the column specifying the field value for each particle

  • selection (str, optional) – the name of the column that specifies which (if any) slice of the CatalogSource to take

  • position (str, optional) – the name of the column that specifies the position data of the objects in the catalog

  • window (str, deprecated) – use resampler instead.

Returns

mesh – a mesh object that provides an interface for gridding particle data onto a specified mesh

Return type

CatalogMesh

to_subvolumes(domain=None, position='Position', columns=None)

Domain Decompose a catalog, sending items to the ranks according to the supplied domain object. Using the position column as the Position.

This will read in the full position array and all of the requested columns.

Parameters
  • domain (pmesh.domain.GridND object, or None) – The domain to distribute the catalog. If None, try to evenly divide spatially. An easiest way to find a domain object is to use pm.domain, where pm is a pmesh.pm.ParticleMesh object.

  • position (string_like) – column to use to compute the position.

  • columns (list of string_like) – columns to include in the new catalog, if not supplied, all catalogs will be exchanged.

Returns

A decomposed catalog source, where each rank only contains objects belongs to the rank as claimed by the domain object.

self.attrs are carried over as a shallow copy to the returned object.

Return type

CatalogSource

view(type=None)

Return a “view” of the CatalogSource object, with the returned type set by type.

This initializes a new empty class of type type and attaches attributes to it via the __finalize__() mechanism.

Parameters

type (Python type) – the desired class type of the returned object.

class nbodykit.base.catalog.CatalogSourceBase(comm)[source]

An abstract base class that implements most of the functionality in CatalogSource.

The main difference between this class and CatalogSource is that this base class does not assume the object has a size attribute.

Note

See the docstring for CatalogSource. Most often, users should implement custom sources as subclasses of CatalogSource.

The names of hard-coded columns, i.e., those defined through member functions of the class, are stored in the _defaults and _hardcolumns attributes. These attributes are computed by the ColumnFinder meta-class.

Parameters

comm – the MPI communicator to use for this object

Attributes
attrs

A dictionary storing relevant meta-data about the CatalogSource.

columns

All columns in the CatalogSource, including those hard-coded into the class’s defintion and override columns provided by the user.

hardcolumns

A list of the hard-coded columns in the CatalogSource.

Methods

compute(*args, **kwargs)

Our version of dask.compute() that computes multiple delayed dask collections at once.

copy()

Return a shallow copy of the object, where each column is a reference of the corresponding column in self.

get_hardcolumn(col)

Construct and return a hard-coded column.

make_column(array)

Utility function to convert an array-like object to a dask.array.Array.

read(columns)

Return the requested columns as dask arrays.

save(output[, columns, dataset, datasets, …])

Save the CatalogSource to a bigfile.BigFile.

to_mesh([Nmesh, BoxSize, dtype, interlaced, …])

Convert the CatalogSource to a MeshSource, using the specified parameters.

to_subvolumes([domain, position, columns])

Domain Decompose a catalog, sending items to the ranks according to the supplied domain object.

view([type])

Return a “view” of the CatalogSource object, with the returned type set by type.

create_instance

__delitem__(col)[source]

Delete a column; cannot delete a “hard-coded” column.

Note

If the base attribute is set, columns will be deleted from base instead of from self.

__finalize__(other)[source]

Finalize the creation of a CatalogSource object by copying over any additional attributes from a second CatalogSource.

The idea here is to only copy over attributes that are similar to meta-data, so we do not copy some of the core attributes of the CatalogSource object.

Parameters

other – the second object to copy over attributes from; it needs to be a subclass of CatalogSourcBase for attributes to be copied

Returns

return self, with the added attributes

Return type

CatalogSource

__getitem__(sel)[source]

The following types of indexing are supported:

  1. strings specifying a column in the CatalogSource; returns a dask array holding the column data

  2. boolean arrays specifying a slice of the CatalogSource; returns a CatalogSource holding only the revelant slice

  3. slice object specifying which particles to select

  4. list of strings specifying column names; returns a CatalogSource holding only the selected columns

Notes

  • Slicing is a collective operation

  • If the base attribute is set, columns will be returned from base instead of from self.

__setitem__(col, value)[source]

Add new columns to the CatalogSource, overriding any existing columns with the name col.

Note

If the base attribute is set, columns will be added to base instead of to self.

property attrs

A dictionary storing relevant meta-data about the CatalogSource.

property columns

All columns in the CatalogSource, including those hard-coded into the class’s defintion and override columns provided by the user.

Note

If the base attribute is set, the value of base.columns will be returned.

compute(*args, **kwargs)[source]

Our version of dask.compute() that computes multiple delayed dask collections at once.

This should be called on the return value of read() to converts any dask arrays to numpy arrays.

. note::

If the base attribute is set, compute() will called using base instead of self.

Parameters

args (object) – Any number of objects. If the object is a dask collection, it’s computed and the result is returned. Otherwise it’s passed through unchanged.

copy()[source]

Return a shallow copy of the object, where each column is a reference of the corresponding column in self.

Note

No copy of data is made.

Note

This is different from view in that the attributes dictionary of the copy no longer related to self.

Returns

a new CatalogSource that holds all of the data columns of self

Return type

CatalogSource

get_hardcolumn(col)[source]

Construct and return a hard-coded column.

These are usually produced by calling member functions marked by the @column decorator.

Subclasses may override this method and the hardcolumns attribute to bypass the decorator logic.

Note

If the base attribute is set, get_hardcolumn() will called using base instead of self.

property hardcolumns

A list of the hard-coded columns in the CatalogSource.

These columns are usually member functions marked by @column decorator. Subclasses may override this method and use get_hardcolumn() to bypass the decorator logic.

Note

If the base attribute is set, the value of base.hardcolumns will be returned.

static make_column(array)[source]

Utility function to convert an array-like object to a dask.array.Array.

Note

The dask array chunk size is controlled via the dask_chunk_size global option. See set_options.

Parameters

array (array_like) – an array-like object; can be a dask array, numpy array, ColumnAccessor, or other non-scalar array-like object

Returns

a dask array initialized from array

Return type

dask.array.Array

read(columns)[source]

Return the requested columns as dask arrays.

Parameters

columns (list of str) – the names of the requested columns

Returns

the list of column data, in the form of dask arrays

Return type

list of dask.array.Array

save(output, columns=None, dataset=None, datasets=None, header='Header', compute=True)[source]

Save the CatalogSource to a bigfile.BigFile.

Only the selected columns are saved and attrs are saved in header. The attrs of columns are stored in the datasets.

Parameters
  • output (str) – the name of the file to write to

  • columns (list of str) – the names of the columns to save in the file, or None to use all columns

  • dataset (str, optional) – dataset to store the columns under.

  • datasets (list of str, optional) – names for the data set where each column is stored; defaults to the name of the column (deprecated)

  • header (str, optional, or None) – the name of the data set holding the header information, where attrs is stored if header is None, do not save the header.

  • compute (boolean, default True) – if True, wait till the store operations finish if False, return a dictionary with column name and a future object for the store. use dask.compute() to wait for the store operations on the result.

to_mesh(Nmesh=None, BoxSize=None, dtype='f4', interlaced=False, compensated=False, resampler='cic', weight='Weight', value='Value', selection='Selection', position='Position', window=None)[source]

Convert the CatalogSource to a MeshSource, using the specified parameters.

Parameters
  • Nmesh (int, optional) – the number of cells per side on the mesh; must be provided if not stored in attrs

  • BoxSize (scalar, 3-vector, optional) – the size of the box; must be provided if not stored in attrs

  • dtype (string, optional) – the data type of the mesh array

  • interlaced (bool, optional) – use the interlacing technique of Sefusatti et al. 2015 to reduce the effects of aliasing on Fourier space quantities computed from the mesh

  • compensated (bool, optional) – whether to correct for the resampler window introduced by the grid interpolation scheme

  • resampler (str, optional) – the string specifying which resampler interpolation scheme to use; see pmesh.resampler.methods

  • weight (str, optional) – the name of the column specifying the weight for each particle

  • value (str, optional) – the name of the column specifying the field value for each particle

  • selection (str, optional) – the name of the column that specifies which (if any) slice of the CatalogSource to take

  • position (str, optional) – the name of the column that specifies the position data of the objects in the catalog

  • window (str, deprecated) – use resampler instead.

Returns

mesh – a mesh object that provides an interface for gridding particle data onto a specified mesh

Return type

CatalogMesh

to_subvolumes(domain=None, position='Position', columns=None)[source]

Domain Decompose a catalog, sending items to the ranks according to the supplied domain object. Using the position column as the Position.

This will read in the full position array and all of the requested columns.

Parameters
  • domain (pmesh.domain.GridND object, or None) – The domain to distribute the catalog. If None, try to evenly divide spatially. An easiest way to find a domain object is to use pm.domain, where pm is a pmesh.pm.ParticleMesh object.

  • position (string_like) – column to use to compute the position.

  • columns (list of string_like) – columns to include in the new catalog, if not supplied, all catalogs will be exchanged.

Returns

A decomposed catalog source, where each rank only contains objects belongs to the rank as claimed by the domain object.

self.attrs are carried over as a shallow copy to the returned object.

Return type

CatalogSource

view(type=None)[source]

Return a “view” of the CatalogSource object, with the returned type set by type.

This initializes a new empty class of type type and attaches attributes to it via the __finalize__() mechanism.

Parameters

type (Python type) – the desired class type of the returned object.

class nbodykit.base.catalog.ColumnAccessor(catalog, daskarray, is_default=False)[source]

Provides access to a Column from a Catalog.

This is a thin subclass of dask.array.Array to provide a reference to the catalog object, an additional attrs attribute (for recording the reproducible meta-data), and some pretty print support.

Due to particularity of dask, any transformation that is not explicitly in-place will return a dask.array.Array, and losing the pointer to the original catalog and the meta data attrs.

Parameters
  • catalog (CatalogSource) – the catalog from which the column was accessed

  • daskarray (dask.array.Array) – the column in dask array form

  • is_default (bool, optional) – whether this column is a default column; default columns are not serialized to disk, as they are automatically available as columns

Attributes
A
T
blocks

Slice an array by blocks

chunks

Chunks property.

chunksize
dask
dtype
imag
itemsize

Length of one array element in bytes

name
nbytes

Number of bytes in array

ndim
npartitions
numblocks
partitions

Slice an array by partitions.

real
shape
size

Number of elements in array

vindex

Vectorized indexing with broadcasting.

Methods

all([axis, out, keepdims])

This docstring was copied from numpy.ndarray.all.

any([axis, out, keepdims])

This docstring was copied from numpy.ndarray.any.

argmax([axis, out])

This docstring was copied from numpy.ndarray.argmax.

argmin([axis, out])

This docstring was copied from numpy.ndarray.argmin.

argtopk(k[, axis, split_every])

The indices of the top k elements of an array.

astype(dtype, **kwargs)

Copy of the array, cast to a specified type.

choose(choices[, out, mode])

This docstring was copied from numpy.ndarray.choose.

clip([min, max, out])

This docstring was copied from numpy.ndarray.clip.

compute()

Compute this dask collection

compute_chunk_sizes()

Compute the chunk sizes for a Dask array.

copy()

Copy array.

cumprod([axis, dtype, out])

This docstring was copied from numpy.ndarray.cumprod.

cumsum([axis, dtype, out])

This docstring was copied from numpy.ndarray.cumsum.

dot(b[, out])

This docstring was copied from numpy.ndarray.dot.

flatten([order])

This docstring was copied from numpy.ndarray.ravel.

map_blocks(*args[, name, token, dtype, …])

Map a function across all blocks of a dask array.

map_overlap(func, depth[, boundary, trim])

Map a function over blocks of the array with some overlap

max([axis, out, keepdims, initial, where])

This docstring was copied from numpy.ndarray.max.

mean([axis, dtype, out, keepdims])

This docstring was copied from numpy.ndarray.mean.

min([axis, out, keepdims, initial, where])

This docstring was copied from numpy.ndarray.min.

moment(order[, axis, dtype, keepdims, ddof, …])

Calculate the nth centralized moment.

nonzero()

This docstring was copied from numpy.ndarray.nonzero.

persist(**kwargs)

Persist this dask collection into memory

prod([axis, dtype, out, keepdims, initial, …])

This docstring was copied from numpy.ndarray.prod.

ravel([order])

This docstring was copied from numpy.ndarray.ravel.

rechunk([chunks, threshold, …])

See da.rechunk for docstring

repeat(repeats[, axis])

This docstring was copied from numpy.ndarray.repeat.

reshape(shape[, order])

This docstring was copied from numpy.ndarray.reshape.

round([decimals, out])

This docstring was copied from numpy.ndarray.round.

squeeze([axis])

This docstring was copied from numpy.ndarray.squeeze.

std([axis, dtype, out, ddof, keepdims])

This docstring was copied from numpy.ndarray.std.

store(targets[, lock, regions, compute, …])

Store dask arrays in array-like objects, overwrite data in target

sum([axis, dtype, out, keepdims, initial, where])

This docstring was copied from numpy.ndarray.sum.

swapaxes(axis1, axis2)

This docstring was copied from numpy.ndarray.swapaxes.

to_dask_dataframe([columns, index, meta])

Convert dask Array to dask Dataframe

to_delayed([optimize_graph])

Convert into an array of dask.delayed objects, one per chunk.

to_hdf5(filename, datapath, **kwargs)

Store array in HDF5 file

to_svg([size])

Convert chunks from Dask Array into an SVG Image

to_tiledb(uri, *args, **kwargs)

Save array to the TileDB storage manager

to_zarr(*args, **kwargs)

Save array to the zarr storage format

topk(k[, axis, split_every])

The top k elements of an array.

trace([offset, axis1, axis2, dtype, out])

This docstring was copied from numpy.ndarray.trace.

transpose(*axes)

This docstring was copied from numpy.ndarray.transpose.

var([axis, dtype, out, ddof, keepdims])

This docstring was copied from numpy.ndarray.var.

view([dtype, order])

Get a view of the array as a new data type

visualize([filename, format, optimize_graph])

Render the computation of this object’s task graph using graphviz.

as_daskarray

conj

static __dask_optimize__(dsk, keys, **kwargs)[source]

Optimize the dask object.

Note

The dask default optimizer induces too many (unnecesarry) IO calls. We turn this feature off by default, and only apply a culling.

__repr__()
>>> import dask.array as da
>>> da.ones((10, 10), chunks=(5, 5), dtype='i4')
dask.array<..., shape=(10, 10), dtype=int32, chunksize=(5, 5), chunktype=numpy.ndarray>
all(axis=None, out=None, keepdims=False)

This docstring was copied from numpy.ndarray.all.

Some inconsistencies with the Dask version may exist.

Returns True if all elements evaluate to True.

Refer to numpy.all for full documentation.

See also

numpy.all

equivalent function

any(axis=None, out=None, keepdims=False)

This docstring was copied from numpy.ndarray.any.

Some inconsistencies with the Dask version may exist.

Returns True if any of the elements of a evaluate to True.

Refer to numpy.any for full documentation.

See also

numpy.any

equivalent function

argmax(axis=None, out=None)

This docstring was copied from numpy.ndarray.argmax.

Some inconsistencies with the Dask version may exist.

Return indices of the maximum values along the given axis.

Refer to numpy.argmax for full documentation.

See also

numpy.argmax

equivalent function

argmin(axis=None, out=None)

This docstring was copied from numpy.ndarray.argmin.

Some inconsistencies with the Dask version may exist.

Return indices of the minimum values along the given axis of a.

Refer to numpy.argmin for detailed documentation.

See also

numpy.argmin

equivalent function

argtopk(k, axis=- 1, split_every=None)

The indices of the top k elements of an array.

See da.argtopk for docstring

astype(dtype, **kwargs)

Copy of the array, cast to a specified type.

Parameters
  • dtype (str or dtype) – Typecode or data-type to which the array is cast.

  • casting ({'no', 'equiv', 'safe', 'same_kind', 'unsafe'}, optional) –

    Controls what kind of data casting may occur. Defaults to ‘unsafe’ for backwards compatibility.

    • ’no’ means the data types should not be cast at all.

    • ’equiv’ means only byte-order changes are allowed.

    • ’safe’ means only casts which can preserve values are allowed.

    • ’same_kind’ means only safe casts or casts within a kind,

      like float64 to float32, are allowed.

    • ’unsafe’ means any data conversions may be done.

  • copy (bool, optional) – By default, astype always returns a newly allocated array. If this is set to False and the dtype requirement is satisfied, the input array is returned instead of a copy.

property blocks

Slice an array by blocks

This allows blockwise slicing of a Dask array. You can perform normal Numpy-style slicing but now rather than slice elements of the array you slice along blocks so, for example, x.blocks[0, ::2] produces a new dask array with every other block in the first row of blocks.

You can index blocks in any way that could index a numpy array of shape equal to the number of blocks in each dimension, (available as array.numblocks). The dimension of the output array will be the same as the dimension of this array, even if integer indices are passed. This does not support slicing with np.newaxis or multiple lists.

Examples

>>> import dask.array as da
>>> x = da.arange(10, chunks=2)
>>> x.blocks[0].compute()
array([0, 1])
>>> x.blocks[:3].compute()
array([0, 1, 2, 3, 4, 5])
>>> x.blocks[::2].compute()
array([0, 1, 4, 5, 8, 9])
>>> x.blocks[[-1, 0]].compute()
array([8, 9, 0, 1])
Returns

Return type

A Dask array

choose(choices, out=None, mode='raise')

This docstring was copied from numpy.ndarray.choose.

Some inconsistencies with the Dask version may exist.

Use an index array to construct a new array from a set of choices.

Refer to numpy.choose for full documentation.

See also

numpy.choose

equivalent function

property chunks

Chunks property.

clip(min=None, max=None, out=None, **kwargs)

This docstring was copied from numpy.ndarray.clip.

Some inconsistencies with the Dask version may exist.

Return an array whose values are limited to [min, max]. One of max or min must be given.

Refer to numpy.clip for full documentation.

See also

numpy.clip

equivalent function

compute()[source]

Compute this dask collection

This turns a lazy Dask collection into its in-memory equivalent. For example a Dask array turns into a NumPy array and a Dask dataframe turns into a Pandas dataframe. The entire dataset must fit into memory before calling this operation.

Parameters
  • scheduler (string, optional) – Which scheduler to use like “threads”, “synchronous” or “processes”. If not provided, the default is to check the global settings first, and then fall back to the collection defaults.

  • optimize_graph (bool, optional) – If True [default], the graph is optimized before computation. Otherwise the graph is run as is. This can be useful for debugging.

  • kwargs – Extra keywords to forward to the scheduler function.

See also

dask.base.compute

compute_chunk_sizes()

Compute the chunk sizes for a Dask array. This is especially useful when the chunk sizes are unknown (e.g., when indexing one Dask array with another).

Notes

This function modifies the Dask array in-place.

Examples

>>> import dask.array as da
>>> import numpy as np
>>> x = da.from_array([-2, -1, 0, 1, 2], chunks=2)
>>> x.chunks
((2, 2, 1),)
>>> y = x[x <= 0]
>>> y.chunks
((nan, nan, nan),)
>>> y.compute_chunk_sizes()  # in-place computation
dask.array<getitem, shape=(3,), dtype=int64, chunksize=(2,), chunktype=numpy.ndarray>
>>> y.chunks
((2, 1, 0),)
copy()

Copy array. This is a no-op for dask.arrays, which are immutable

cumprod(axis=None, dtype=None, out=None)

This docstring was copied from numpy.ndarray.cumprod.

Some inconsistencies with the Dask version may exist.

Dask added an additional keyword-only argument method.

method{‘sequential’, ‘blelloch’}, optional

Choose which method to use to perform the cumprod. Default is ‘sequential’.

  • ‘sequential’ performs the cumprod of each prior block before the current block.

  • ‘blelloch’ is a work-efficient parallel cumprod. It exposes parallelism by first taking the product of each block and combines the products via a binary tree. This method may be faster or more memory efficient depending on workload, scheduler, and hardware. More benchmarking is necessary.

Return the cumulative product of the elements along the given axis.

Refer to numpy.cumprod for full documentation.

See also

numpy.cumprod

equivalent function

cumsum(axis=None, dtype=None, out=None)

This docstring was copied from numpy.ndarray.cumsum.

Some inconsistencies with the Dask version may exist.

Dask added an additional keyword-only argument method.

method{‘sequential’, ‘blelloch’}, optional

Choose which method to use to perform the cumsum. Default is ‘sequential’.

  • ‘sequential’ performs the cumsum of each prior block before the current block.

  • ‘blelloch’ is a work-efficient parallel cumsum. It exposes parallelism by first taking the sum of each block and combines the sums via a binary tree. This method may be faster or more memory efficient depending on workload, scheduler, and hardware. More benchmarking is necessary.

Return the cumulative sum of the elements along the given axis.

Refer to numpy.cumsum for full documentation.

See also

numpy.cumsum

equivalent function

dot(b, out=None)

This docstring was copied from numpy.ndarray.dot.

Some inconsistencies with the Dask version may exist.

Dot product of two arrays.

Refer to numpy.dot for full documentation.

See also

numpy.dot

equivalent function

Examples

>>> a = np.eye(2)  
>>> b = np.ones((2, 2)) * 2  
>>> a.dot(b)  
array([[2.,  2.],
       [2.,  2.]])

This array method can be conveniently chained:

>>> a.dot(b).dot(b)  
array([[8.,  8.],
       [8.,  8.]])
flatten([order])

This docstring was copied from numpy.ndarray.ravel.

Some inconsistencies with the Dask version may exist.

Return a flattened array.

Refer to numpy.ravel for full documentation.

See also

numpy.ravel

equivalent function

ndarray.flat

a flat iterator on the array.

property itemsize

Length of one array element in bytes

map_blocks(*args, name=None, token=None, dtype=None, chunks=None, drop_axis=[], new_axis=None, meta=None, **kwargs)

Map a function across all blocks of a dask array.

Note that map_blocks will attempt to automatically determine the output array type by calling func on 0-d versions of the inputs. Please refer to the meta keyword argument below if you expect that the function will not succeed when operating on 0-d arrays.

Parameters
  • func (callable) – Function to apply to every block in the array.

  • args (dask arrays or other objects) –

  • dtype (np.dtype, optional) – The dtype of the output array. It is recommended to provide this. If not provided, will be inferred by applying the function to a small set of fake data.

  • chunks (tuple, optional) – Chunk shape of resulting blocks if the function does not preserve shape. If not provided, the resulting array is assumed to have the same block structure as the first input array.

  • drop_axis (number or iterable, optional) – Dimensions lost by the function.

  • new_axis (number or iterable, optional) – New dimensions created by the function. Note that these are applied after drop_axis (if present).

  • token (string, optional) – The key prefix to use for the output array. If not provided, will be determined from the function name.

  • name (string, optional) – The key name to use for the output array. Note that this fully specifies the output key name, and must be unique. If not provided, will be determined by a hash of the arguments.

  • meta (array-like, optional) – The meta of the output array, when specified is expected to be an array of the same type and dtype of that returned when calling .compute() on the array returned by this function. When not provided, meta will be inferred by applying the function to a small set of fake data, usually a 0-d array. It’s important to ensure that func can successfully complete computation without raising exceptions when 0-d is passed to it, providing meta will be required otherwise. If the output type is known beforehand (e.g., np.ndarray, cupy.ndarray), an empty array of such type dtype can be passed, for example: meta=np.array((), dtype=np.int32).

  • **kwargs – Other keyword arguments to pass to function. Values must be constants (not dask.arrays)

See also

dask.array.blockwise

Generalized operation with control over block alignment.

Examples

>>> import dask.array as da
>>> x = da.arange(6, chunks=3)
>>> x.map_blocks(lambda x: x * 2).compute()
array([ 0,  2,  4,  6,  8, 10])

The da.map_blocks function can also accept multiple arrays.

>>> d = da.arange(5, chunks=2)
>>> e = da.arange(5, chunks=2)
>>> f = da.map_blocks(lambda a, b: a + b**2, d, e)
>>> f.compute()
array([ 0,  2,  6, 12, 20])

If the function changes shape of the blocks then you must provide chunks explicitly.

>>> y = x.map_blocks(lambda x: x[::2], chunks=((2, 2),))

You have a bit of freedom in specifying chunks. If all of the output chunk sizes are the same, you can provide just that chunk size as a single tuple.

>>> a = da.arange(18, chunks=(6,))
>>> b = a.map_blocks(lambda x: x[:3], chunks=(3,))

If the function changes the dimension of the blocks you must specify the created or destroyed dimensions.

>>> b = a.map_blocks(lambda x: x[None, :, None], chunks=(1, 6, 1),
...                  new_axis=[0, 2])

If chunks is specified but new_axis is not, then it is inferred to add the necessary number of axes on the left.

Map_blocks aligns blocks by block positions without regard to shape. In the following example we have two arrays with the same number of blocks but with different shape and chunk sizes.

>>> x = da.arange(1000, chunks=(100,))
>>> y = da.arange(100, chunks=(10,))

The relevant attribute to match is numblocks.

>>> x.numblocks
(10,)
>>> y.numblocks
(10,)

If these match (up to broadcasting rules) then we can map arbitrary functions across blocks

>>> def func(a, b):
...     return np.array([a.max(), b.max()])
>>> da.map_blocks(func, x, y, chunks=(2,), dtype='i8')
dask.array<func, shape=(20,), dtype=int64, chunksize=(2,), chunktype=numpy.ndarray>
>>> _.compute()
array([ 99,   9, 199,  19, 299,  29, 399,  39, 499,  49, 599,  59, 699,
        69, 799,  79, 899,  89, 999,  99])

Your block function get information about where it is in the array by accepting a special block_info or block_id keyword argument.

>>> def func(block_info=None):
...     pass

This will receive the following information:

>>> block_info  
{0: {'shape': (1000,),
     'num-chunks': (10,),
     'chunk-location': (4,),
     'array-location': [(400, 500)]},
 None: {'shape': (1000,),
        'num-chunks': (10,),
        'chunk-location': (4,),
        'array-location': [(400, 500)],
        'chunk-shape': (100,),
        'dtype': dtype('float64')}}

For each argument and keyword arguments that are dask arrays (the positions of which are the first index), you will receive the shape of the full array, the number of chunks of the full array in each dimension, the chunk location (for example the fourth chunk over in the first dimension), and the array location (for example the slice corresponding to 40:50). The same information is provided for the output, with the key None, plus the shape and dtype that should be returned.

These features can be combined to synthesize an array from scratch, for example:

>>> def func(block_info=None):
...     loc = block_info[None]['array-location'][0]
...     return np.arange(loc[0], loc[1])
>>> da.map_blocks(func, chunks=((4, 4),), dtype=np.float_)
dask.array<func, shape=(8,), dtype=float64, chunksize=(4,), chunktype=numpy.ndarray>
>>> _.compute()
array([0, 1, 2, 3, 4, 5, 6, 7])

block_id is similar to block_info but contains only the chunk_location:

>>> def func(block_id=None):
...     pass

This will receive the following information:

>>> block_id  
(4, 3)

You may specify the key name prefix of the resulting task in the graph with the optional token keyword argument.

>>> x.map_blocks(lambda x: x + 1, name='increment')  
dask.array<increment, shape=(100,), dtype=int64, chunksize=(10,), chunktype=numpy.ndarray>

For functions that may not handle 0-d arrays, it’s also possible to specify meta with an empty array matching the type of the expected result. In the example below, func will result in an IndexError when computing meta:

>>> da.map_blocks(lambda x: x[2], da.random.random(5), meta=np.array(()))
dask.array<lambda, shape=(5,), dtype=float64, chunksize=(5,), chunktype=numpy.ndarray>

Similarly, it’s possible to specify a non-NumPy array to meta, and provide a dtype:

>>> import cupy  
>>> rs = da.random.RandomState(RandomState=cupy.random.RandomState)  
>>> dt = np.float32
>>> da.map_blocks(lambda x: x[2], rs.random(5, dtype=dt), meta=cupy.array((), dtype=dt))  
dask.array<lambda, shape=(5,), dtype=float32, chunksize=(5,), chunktype=cupy.ndarray>
map_overlap(func, depth, boundary=None, trim=True, **kwargs)

Map a function over blocks of the array with some overlap

We share neighboring zones between blocks of the array, then map a function, then trim away the neighboring strips.

Note that this function will attempt to automatically determine the output array type before computing it, please refer to the meta keyword argument in map_blocks if you expect that the function will not succeed when operating on 0-d arrays.

Parameters
  • func (function) – The function to apply to each extended block

  • depth (int, tuple, or dict) – The number of elements that each block should share with its neighbors If a tuple or dict then this can be different per axis

  • boundary (str, tuple, dict) – How to handle the boundaries. Values include ‘reflect’, ‘periodic’, ‘nearest’, ‘none’, or any constant value like 0 or np.nan

  • trim (bool) – Whether or not to trim depth elements from each block after calling the map function. Set this to False if your mapping function already does this for you

  • **kwargs – Other keyword arguments valid in map_blocks

Examples

>>> import dask.array as da
>>> x = np.array([1, 1, 2, 3, 3, 3, 2, 1, 1])
>>> x = da.from_array(x, chunks=5)
>>> def derivative(x):
...     return x - np.roll(x, 1)
>>> y = x.map_overlap(derivative, depth=1, boundary=0)
>>> y.compute()
array([ 1,  0,  1,  1,  0,  0, -1, -1,  0])
>>> import dask.array as da
>>> x = np.arange(16).reshape((4, 4))
>>> d = da.from_array(x, chunks=(2, 2))
>>> d.map_overlap(lambda x: x + x.size, depth=1).compute()
array([[16, 17, 18, 19],
       [20, 21, 22, 23],
       [24, 25, 26, 27],
       [28, 29, 30, 31]])
>>> func = lambda x: x + x.size
>>> depth = {0: 1, 1: 1}
>>> boundary = {0: 'reflect', 1: 'none'}
>>> d.map_overlap(func, depth, boundary).compute()  
array([[12,  13,  14,  15],
       [16,  17,  18,  19],
       [20,  21,  22,  23],
       [24,  25,  26,  27]])
>>> x = np.arange(16).reshape((4, 4))
>>> d = da.from_array(x, chunks=(2, 2))
>>> y = d.map_overlap(lambda x: x + x[2], depth=1, meta=np.array(()))
>>> y
dask.array<_trim, shape=(4, 4), dtype=float64, chunksize=(2, 2), chunktype=numpy.ndarray>
>>> y.compute()
array([[ 4,  6,  8, 10],
       [ 8, 10, 12, 14],
       [20, 22, 24, 26],
       [24, 26, 28, 30]])
>>> import cupy  
>>> x = cupy.arange(16).reshape((5, 4))  
>>> d = da.from_array(x, chunks=(2, 2))  
>>> y = d.map_overlap(lambda x: x + x[2], depth=1, meta=cupy.array(()))  
>>> y  
dask.array<_trim, shape=(4, 4), dtype=float64, chunksize=(2, 2), chunktype=cupy.ndarray>
>>> y.compute()  
array([[ 4,  6,  8, 10],
       [ 8, 10, 12, 14],
       [20, 22, 24, 26],
       [24, 26, 28, 30]])
max(axis=None, out=None, keepdims=False, initial=<no value>, where=True)

This docstring was copied from numpy.ndarray.max.

Some inconsistencies with the Dask version may exist.

Return the maximum along a given axis.

Refer to numpy.amax for full documentation.

See also

numpy.amax

equivalent function

mean(axis=None, dtype=None, out=None, keepdims=False)

This docstring was copied from numpy.ndarray.mean.

Some inconsistencies with the Dask version may exist.

Returns the average of the array elements along given axis.

Refer to numpy.mean for full documentation.

See also

numpy.mean

equivalent function

min(axis=None, out=None, keepdims=False, initial=<no value>, where=True)

This docstring was copied from numpy.ndarray.min.

Some inconsistencies with the Dask version may exist.

Return the minimum along a given axis.

Refer to numpy.amin for full documentation.

See also

numpy.amin

equivalent function

moment(order, axis=None, dtype=None, keepdims=False, ddof=0, split_every=None, out=None)

Calculate the nth centralized moment.

Parameters
  • order (int) – Order of the moment that is returned, must be >= 2.

  • axis (int, optional) – Axis along which the central moment is computed. The default is to compute the moment of the flattened array.

  • dtype (data-type, optional) – Type to use in computing the moment. For arrays of integer type the default is float64; for arrays of float types it is the same as the array type.

  • keepdims (bool, optional) – If this is set to True, the axes which are reduced are left in the result as dimensions with size one. With this option, the result will broadcast correctly against the original array.

  • ddof (int, optional) – “Delta Degrees of Freedom”: the divisor used in the calculation is N - ddof, where N represents the number of elements. By default ddof is zero.

Returns

moment

Return type

ndarray

References

1

Pebay, Philippe (2008), “Formulas for Robust, One-Pass Parallel Computation of Covariances and Arbitrary-Order Statistical Moments”, Technical Report SAND2008-6212, Sandia National Laboratories.

property nbytes

Number of bytes in array

nonzero()

This docstring was copied from numpy.ndarray.nonzero.

Some inconsistencies with the Dask version may exist.

Return the indices of the elements that are non-zero.

Refer to numpy.nonzero for full documentation.

See also

numpy.nonzero

equivalent function

property partitions

Slice an array by partitions. Alias of dask array .blocks attribute.

This alias allows you to write agnostic code that works with both dask arrays and dask dataframes.

This allows blockwise slicing of a Dask array. You can perform normal Numpy-style slicing but now rather than slice elements of the array you slice along blocks so, for example, x.blocks[0, ::2] produces a new dask array with every other block in the first row of blocks.

You can index blocks in any way that could index a numpy array of shape equal to the number of blocks in each dimension, (available as array.numblocks). The dimension of the output array will be the same as the dimension of this array, even if integer indices are passed. This does not support slicing with np.newaxis or multiple lists.

Examples

>>> import dask.array as da
>>> x = da.arange(10, chunks=2)
>>> x.partitions[0].compute()
array([0, 1])
>>> x.partitions[:3].compute()
array([0, 1, 2, 3, 4, 5])
>>> x.partitions[::2].compute()
array([0, 1, 4, 5, 8, 9])
>>> x.partitions[[-1, 0]].compute()
array([8, 9, 0, 1])
>>> all(x.partitions[:].compute() == x.blocks[:].compute())
True
Returns

Return type

A Dask array

persist(**kwargs)

Persist this dask collection into memory

This turns a lazy Dask collection into a Dask collection with the same metadata, but now with the results fully computed or actively computing in the background.

The action of function differs significantly depending on the active task scheduler. If the task scheduler supports asynchronous computing, such as is the case of the dask.distributed scheduler, then persist will return immediately and the return value’s task graph will contain Dask Future objects. However if the task scheduler only supports blocking computation then the call to persist will block and the return value’s task graph will contain concrete Python results.

This function is particularly useful when using distributed systems, because the results will be kept in distributed memory, rather than returned to the local process as with compute.

Parameters
  • scheduler (string, optional) – Which scheduler to use like “threads”, “synchronous” or “processes”. If not provided, the default is to check the global settings first, and then fall back to the collection defaults.

  • optimize_graph (bool, optional) – If True [default], the graph is optimized before computation. Otherwise the graph is run as is. This can be useful for debugging.

  • **kwargs – Extra keywords to forward to the scheduler function.

Returns

Return type

New dask collections backed by in-memory data

See also

dask.base.persist

prod(axis=None, dtype=None, out=None, keepdims=False, initial=1, where=True)

This docstring was copied from numpy.ndarray.prod.

Some inconsistencies with the Dask version may exist.

Return the product of the array elements over the given axis

Refer to numpy.prod for full documentation.

See also

numpy.prod

equivalent function

ravel([order])

This docstring was copied from numpy.ndarray.ravel.

Some inconsistencies with the Dask version may exist.

Return a flattened array.

Refer to numpy.ravel for full documentation.

See also

numpy.ravel

equivalent function

ndarray.flat

a flat iterator on the array.

rechunk(chunks='auto', threshold=None, block_size_limit=None, balance=False)

See da.rechunk for docstring

repeat(repeats, axis=None)

This docstring was copied from numpy.ndarray.repeat.

Some inconsistencies with the Dask version may exist.

Repeat elements of an array.

Refer to numpy.repeat for full documentation.

See also

numpy.repeat

equivalent function

reshape(shape, order='C')

This docstring was copied from numpy.ndarray.reshape.

Some inconsistencies with the Dask version may exist.

Note

See dask.array.reshape() for an explanation of the merge_chunks keyword.

Returns an array containing the same data with a new shape.

Refer to numpy.reshape for full documentation.

See also

numpy.reshape

equivalent function

Notes

Unlike the free function numpy.reshape, this method on ndarray allows the elements of the shape parameter to be passed in as separate arguments. For example, a.reshape(10, 11) is equivalent to a.reshape((10, 11)).

round(decimals=0, out=None)

This docstring was copied from numpy.ndarray.round.

Some inconsistencies with the Dask version may exist.

Return a with each element rounded to the given number of decimals.

Refer to numpy.around for full documentation.

See also

numpy.around

equivalent function

size

Number of elements in array

squeeze(axis=None)

This docstring was copied from numpy.ndarray.squeeze.

Some inconsistencies with the Dask version may exist.

Remove single-dimensional entries from the shape of a.

Refer to numpy.squeeze for full documentation.

See also

numpy.squeeze

equivalent function

std(axis=None, dtype=None, out=None, ddof=0, keepdims=False)

This docstring was copied from numpy.ndarray.std.

Some inconsistencies with the Dask version may exist.

Returns the standard deviation of the array elements along given axis.

Refer to numpy.std for full documentation.

See also

numpy.std

equivalent function

store(targets, lock=True, regions=None, compute=True, return_stored=False, **kwargs)

Store dask arrays in array-like objects, overwrite data in target

This stores dask arrays into object that supports numpy-style setitem indexing. It stores values chunk by chunk so that it does not have to fill up memory. For best performance you can align the block size of the storage target with the block size of your array.

If your data fits in memory then you may prefer calling np.array(myarray) instead.

Parameters
  • sources (Array or iterable of Arrays) –

  • targets (array-like or Delayed or iterable of array-likes and/or Delayeds) – These should support setitem syntax target[10:20] = ...

  • lock (boolean or threading.Lock, optional) – Whether or not to lock the data stores while storing. Pass True (lock each file individually), False (don’t lock) or a particular threading.Lock object to be shared among all writes.

  • regions (tuple of slices or list of tuples of slices) – Each region tuple in regions should be such that target[region].shape = source.shape for the corresponding source and target in sources and targets, respectively. If this is a tuple, the contents will be assumed to be slices, so do not provide a tuple of tuples.

  • compute (boolean, optional) – If true compute immediately, return dask.delayed.Delayed otherwise

  • return_stored (boolean, optional) – Optionally return the stored result (default False).

Examples

>>> x = ...  
>>> import h5py  
>>> f = h5py.File('myfile.hdf5', mode='a')  
>>> dset = f.create_dataset('/data', shape=x.shape,
...                                  chunks=x.chunks,
...                                  dtype='f8')  
>>> store(x, dset)  

Alternatively store many arrays at the same time

>>> store([x, y, z], [dset1, dset2, dset3])  
sum(axis=None, dtype=None, out=None, keepdims=False, initial=0, where=True)

This docstring was copied from numpy.ndarray.sum.

Some inconsistencies with the Dask version may exist.

Return the sum of the array elements over the given axis.

Refer to numpy.sum for full documentation.

See also

numpy.sum

equivalent function

swapaxes(axis1, axis2)

This docstring was copied from numpy.ndarray.swapaxes.

Some inconsistencies with the Dask version may exist.

Return a view of the array with axis1 and axis2 interchanged.

Refer to numpy.swapaxes for full documentation.

See also

numpy.swapaxes

equivalent function

to_dask_dataframe(columns=None, index=None, meta=None)

Convert dask Array to dask Dataframe

Parameters
  • columns (list or string) – list of column names if DataFrame, single string if Series

  • index (dask.dataframe.Index, optional) –

    An optional dask Index to use for the output Series or DataFrame.

    The default output index depends on whether the array has any unknown chunks. If there are any unknown chunks, the output has None for all the divisions (one per chunk). If all the chunks are known, a default index with known divsions is created.

    Specifying index can be useful if you’re conforming a Dask Array to an existing dask Series or DataFrame, and you would like the indices to match.

  • meta (object, optional) – An optional meta parameter can be passed for dask to specify the concrete dataframe type to use for partitions of the Dask dataframe. By default, pandas DataFrame is used.

to_delayed(optimize_graph=True)

Convert into an array of dask.delayed objects, one per chunk.

Parameters

optimize_graph (bool, optional) – If True [default], the graph is optimized before converting into dask.delayed objects.

to_hdf5(filename, datapath, **kwargs)

Store array in HDF5 file

>>> x.to_hdf5('myfile.hdf5', '/x')  

Optionally provide arguments as though to h5py.File.create_dataset

>>> x.to_hdf5('myfile.hdf5', '/x', compression='lzf', shuffle=True)  

See also

da.store, h5py.File.create_dataset

to_svg(size=500)

Convert chunks from Dask Array into an SVG Image

Parameters
  • chunks (tuple) –

  • size (int) – Rough size of the image

Examples

>>> x.to_svg(size=500)  
Returns

text

Return type

An svg string depicting the array as a grid of chunks

to_tiledb(uri, *args, **kwargs)

Save array to the TileDB storage manager

See function to_tiledb() for argument documentation.

See https://docs.tiledb.io for details about the format and engine.

to_zarr(*args, **kwargs)

Save array to the zarr storage format

See https://zarr.readthedocs.io for details about the format.

See function to_zarr() for parameters.

topk(k, axis=- 1, split_every=None)

The top k elements of an array.

See da.topk for docstring

trace(offset=0, axis1=0, axis2=1, dtype=None, out=None)

This docstring was copied from numpy.ndarray.trace.

Some inconsistencies with the Dask version may exist.

Return the sum along diagonals of the array.

Refer to numpy.trace for full documentation.

See also

numpy.trace

equivalent function

transpose(*axes)

This docstring was copied from numpy.ndarray.transpose.

Some inconsistencies with the Dask version may exist.

Returns a view of the array with axes transposed.

For a 1-D array this has no effect, as a transposed vector is simply the same vector. To convert a 1-D array into a 2D column vector, an additional dimension must be added. np.atleast2d(a).T achieves this, as does a[:, np.newaxis]. For a 2-D array, this is a standard matrix transpose. For an n-D array, if axes are given, their order indicates how the axes are permuted (see Examples). If axes are not provided and a.shape = (i[0], i[1], ... i[n-2], i[n-1]), then a.transpose().shape = (i[n-1], i[n-2], ... i[1], i[0]).

Parameters

axes (None, tuple of ints, or n ints) –

  • None or no argument: reverses the order of the axes.

  • tuple of ints: i in the j-th place in the tuple means a’s i-th axis becomes a.transpose()’s j-th axis.

  • n ints: same as an n-tuple of the same ints (this form is intended simply as a “convenience” alternative to the tuple form)

Returns

out – View of a, with axes suitably permuted.

Return type

ndarray

See also

ndarray.T

Array property returning the array transposed.

ndarray.reshape

Give a new shape to an array without changing its data.

Examples

>>> a = np.array([[1, 2], [3, 4]])  
>>> a  
array([[1, 2],
       [3, 4]])
>>> a.transpose()  
array([[1, 3],
       [2, 4]])
>>> a.transpose((1, 0))  
array([[1, 3],
       [2, 4]])
>>> a.transpose(1, 0)  
array([[1, 3],
       [2, 4]])
var(axis=None, dtype=None, out=None, ddof=0, keepdims=False)

This docstring was copied from numpy.ndarray.var.

Some inconsistencies with the Dask version may exist.

Returns the variance of the array elements, along given axis.

Refer to numpy.var for full documentation.

See also

numpy.var

equivalent function

view(dtype=None, order='C')

Get a view of the array as a new data type

Parameters
  • dtype – The dtype by which to view the array. The default, None, results in the view having the same data-type as the original array.

  • order (string) – ‘C’ or ‘F’ (Fortran) ordering

  • reinterprets the bytes of the array under a new dtype. If that (This) –

  • does not have the same size as the original array then the shape (dtype) –

  • change. (will) –

  • that both numpy and dask.array can behave oddly when taking (Beware) –

  • views of arrays under Fortran ordering. Under some (shape-changing) –

  • of NumPy this function will fail when taking shape-changing (versions) –

  • of Fortran ordered arrays if the first dimension has chunks of (views) –

  • one. (size) –

property vindex

Vectorized indexing with broadcasting.

This is equivalent to numpy’s advanced indexing, using arrays that are broadcast against each other. This allows for pointwise indexing:

>>> import dask.array as da
>>> x = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
>>> x = da.from_array(x, chunks=2)
>>> x.vindex[[0, 1, 2], [0, 1, 2]].compute()
array([1, 5, 9])

Mixed basic/advanced indexing with slices/arrays is also supported. The order of dimensions in the result follows those proposed for ndarray.vindex: the subspace spanned by arrays is followed by all slices.

Note: vindex provides more general functionality than standard indexing, but it also has fewer optimizations and can be significantly slower.

visualize(filename='mydask', format=None, optimize_graph=False, **kwargs)

Render the computation of this object’s task graph using graphviz.

Requires graphviz to be installed.

Parameters
  • filename (str or None, optional) – The name of the file to write to disk. If the provided filename doesn’t include an extension, ‘.png’ will be used by default. If filename is None, no file will be written, and we communicate with dot using only pipes.

  • format ({'png', 'pdf', 'dot', 'svg', 'jpeg', 'jpg'}, optional) – Format in which to write output file. Default is ‘png’.

  • optimize_graph (bool, optional) – If True, the graph is optimized before rendering. Otherwise, the graph is displayed as is. Default is False.

  • color ({None, 'order'}, optional) – Options to color nodes. Provide cmap= keyword for additional colormap

  • **kwargs – Additional keyword arguments to forward to to_graphviz.

Examples

>>> x.visualize(filename='dask.pdf')  
>>> x.visualize(filename='dask.pdf', color='order')  
Returns

result – See dask.dot.dot_graph for more information.

Return type

IPython.diplay.Image, IPython.display.SVG, or None

See also

dask.base.visualize, dask.dot.dot_graph

Notes

For more information on optimization see here:

https://docs.dask.org/en/latest/optimize.html

class nbodykit.base.catalog.ColumnFinder(name, bases, namespace, **kwargs)[source]

A meta-class that will register all columns of a class that have been marked with the column() decorator.

This adds the following attributes to the class definition:

1. _defaults : default columns, specified by passing default=True to the column() decorator.

  1. _hardcolumns : non-default, hard-coded columns

Note

This is a subclass of abc.ABCMeta so subclasses can define abstract properties, if they need to.

Methods

__call__(*args, **kwargs)

Call self as a function.

mro()

return a type’s method resolution order

register(subclass)

Register a virtual subclass of an ABC.

__instancecheck__(instance)

Override for isinstance(instance, cls).

__subclasscheck__(subclass)

Override for issubclass(subclass, cls).

mro()list

return a type’s method resolution order

register(subclass)

Register a virtual subclass of an ABC.

Returns the subclass, to allow usage as a class decorator.

nbodykit.base.catalog.column(name=None, is_default=False)[source]

Decorator that defines the decorated function as a column in a CatalogSource.

This can be used as a decorator with or without arguments. If no name is specified, the function name is used.

Parameters
  • name (str, optional) – the name of the column; if not provided, the name of the function being decorated is used

  • is_default (bool, optional) – whether the column is a default column; default columns are not serialized to disk