nbodykit.base.catalog module

class nbodykit.base.catalog.CatalogCopy(size, comm, use_cache=False, **columns)[source]

Bases: nbodykit.base.catalog.CatalogSource

A CatalogSource object that holds column data copied from an original source

Parameters:
  • size (int) – the size of the new source; this was likely determined by the number of particles passing the selection criterion
  • comm (MPI communicator) – the MPI communicator; this should be the same as the comm of the object that we are selecting from
  • use_cache (bool, optional) – whether to cache results
  • **columns – the data arrays that will be added to this source; keys represent the column names

Attributes

attrs A dictionary storing relevant meta-data about the CatalogSource.
columns All columns in the CatalogSource, including those hard-coded into the class’s defintion and override columns provided by the user.
csize The total, collective size of the CatalogSource, i.e., summed across all ranks.
hardcolumns A list of the hard-coded columns in the CatalogSource.
size
use_cache If set to True, use the built-in caching features of dask to cache data in memory.

Methods

Selection() A boolean column that selects a subset slice of the CatalogSource.
Value() When interpolating a CatalogSource on to a mesh, the value of this array is used as the Value that each particle contributes to a given mesh cell.
Weight() The column giving the weight to use for each particle on the mesh.
compute(*args, **kwargs) Our version of dask.compute() that computes multiple delayed dask collections at once.
copy() Return a copy of the CatalogSource object
get_hardcolumn(col) Construct and return a hard-coded column.
make_column(array) Utility function to convert a numpy array to a dask.array.Array.
read(columns) Return the requested columns as dask arrays.
save(output, columns[, datasets, header]) Save the CatalogSource to a bigfile.BigFile.
to_mesh([Nmesh, BoxSize, dtype, interlaced, …]) Convert the CatalogSource to a MeshSource, using the specified parameters.
update_csize() Set the collective size, csize.
Selection()

A boolean column that selects a subset slice of the CatalogSource.

By default, this column is set to True for all particles.

Value()

When interpolating a CatalogSource on to a mesh, the value of this array is used as the Value that each particle contributes to a given mesh cell.

The mesh field is a weighted average of Value, with the weights given by Weight.

By default, this array is set to unity for all particles.

Weight()

The column giving the weight to use for each particle on the mesh.

The mesh field is a weighted average of Value, with the weights given by Weight.

By default, this array is set to unity for all particles.

__delitem__(col)

Delete a column; cannot delete a “hard-coded” column

__getitem__(sel)

The following types of indexing are supported:

  1. strings specifying a column in the CatalogSource; returns a dask array holding the column data
  2. boolean arrays specifying a slice of the CatalogSource; returns a CatalogCopy holding only the revelant slice
  3. slice object specifying which particles to select
  4. list of strings specifying column names; returns a CatalogCopy holding only the selected columnss
__len__()

The local size of the CatalogSource on a given rank.

__setitem__(col, value)

Add columns to the CatalogSource, overriding any existing columns with the name col.

attrs

A dictionary storing relevant meta-data about the CatalogSource.

columns

All columns in the CatalogSource, including those hard-coded into the class’s defintion and override columns provided by the user.

compute(*args, **kwargs)

Our version of dask.compute() that computes multiple delayed dask collections at once.

This should be called on the return value of read() to converts any dask arrays to numpy arrays.

If use_cache is True, this internally caches data, using dask’s built-in cache features.

Parameters:args (object) – Any number of objects. If the object is a dask collection, it’s computed and the result is returned. Otherwise it’s passed through unchanged.

Notes

The dask default optimizer induces too many (unnecesarry) IO calls – we turn this off feature off by default. Eventually we want our own optimizer probably.

copy()

Return a copy of the CatalogSource object

Returns:the new CatalogSource object holding the copied data columns
Return type:CatalogCopy
csize

The total, collective size of the CatalogSource, i.e., summed across all ranks.

It is the sum of size across all available ranks.

get_hardcolumn(col)

Construct and return a hard-coded column.

These are usually produced by calling member functions marked by the @column decorator.

Subclasses may override this method and the hardcolumns attribute to bypass the decorator logic.

hardcolumns

A list of the hard-coded columns in the CatalogSource.

These columns are usually member functions marked by @column decorator. Subclasses may override this method and use get_hardcolumn() to bypass the decorator logic.

logger = <logging.Logger object>
make_column(array)

Utility function to convert a numpy array to a dask.array.Array.

read(columns)

Return the requested columns as dask arrays.

Parameters:columns (list of str) – the names of the requested columns
Returns:the list of column data, in the form of dask arrays
Return type:list of dask.array.Array
save(output, columns, datasets=None, header='Header')

Save the CatalogSource to a bigfile.BigFile.

Only the selected columns are saved and attrs are saved in header. The attrs of columns are stored in the datasets.

Parameters:
  • output (str) – the name of the file to write to
  • columns (list of str) – the names of the columns to save in the file
  • datasets (list of str, optional) – names for the data set where each column is stored; defaults to the name of the column
  • header (str, optional) – the name of the data set holding the header information, where attrs is stored
size
to_mesh(Nmesh=None, BoxSize=None, dtype='f4', interlaced=False, compensated=False, window='cic', weight='Weight', value='Value', selection='Selection', position='Position')

Convert the CatalogSource to a MeshSource, using the specified parameters.

Parameters:
  • Nmesh (int, optional) – the number of cells per side on the mesh; must be provided if not stored in attrs
  • BoxSize (scalar, 3-vector, optional) – the size of the box; must be provided if not stored in attrs
  • dtype (string, optional) – the data type of the mesh array
  • interlaced (bool, optional) – use the interlacing technique of Sefusatti et al. 2015 to reduce the effects of aliasing on Fourier space quantities computed from the mesh
  • compensated (bool, optional) – whether to correct for the window introduced by the grid interpolation scheme
  • window (str, optional) – the string specifying which window interpolation scheme to use; see pmesh.window.methods
  • weight (str, optional) – the name of the column specifying the weight for each particle
  • value (str, optional) – the name of the column specifying the field value for each particle
  • selection (str, optional) – the name of the column that specifies which (if any) slice of the CatalogSource to take
  • position (str, optional) – the name of the column that specifies the position data of the objects in the catalog
Returns:

mesh – a mesh object that provides an interface for gridding particle data onto a specified mesh

Return type:

CatalogMesh

update_csize()

Set the collective size, csize.

This function should be called in __init__() of a subclass, after size has been set to a valid value (not NotImplemented)

use_cache

If set to True, use the built-in caching features of dask to cache data in memory.

class nbodykit.base.catalog.CatalogSource(comm, use_cache=False)[source]

Bases: nbodykit.base.catalog.CatalogSourceBase

An abstract base class representing a catalog of discrete particles.

This objects behaves like a structured numpy array – it must have a well-defined size when initialized. The size here represents the number of particles in the source on the local rank.

Subclasses of this class must define a size attribute.

The information about each particle is stored as a series of columns in the format of dask arrays. These columns can be accessed in a dict-like fashion.

All subclasses of this class contain the following default columns:

  1. Weight
  2. Value
  3. Selection

For a full description of these default columns, see the documentation.

Attributes

attrs A dictionary storing relevant meta-data about the CatalogSource.
columns All columns in the CatalogSource, including those hard-coded into the class’s defintion and override columns provided by the user.
csize The total, collective size of the CatalogSource, i.e., summed across all ranks.
hardcolumns A list of the hard-coded columns in the CatalogSource.
size The number of particles in the CatalogSource on the local rank.
use_cache If set to True, use the built-in caching features of dask to cache data in memory.

Methods

Selection() A boolean column that selects a subset slice of the CatalogSource.
Value() When interpolating a CatalogSource on to a mesh, the value of this array is used as the Value that each particle contributes to a given mesh cell.
Weight() The column giving the weight to use for each particle on the mesh.
compute(*args, **kwargs) Our version of dask.compute() that computes multiple delayed dask collections at once.
copy() Return a copy of the CatalogSource object
get_hardcolumn(col) Construct and return a hard-coded column.
make_column(array) Utility function to convert a numpy array to a dask.array.Array.
read(columns) Return the requested columns as dask arrays.
save(output, columns[, datasets, header]) Save the CatalogSource to a bigfile.BigFile.
to_mesh([Nmesh, BoxSize, dtype, interlaced, …]) Convert the CatalogSource to a MeshSource, using the specified parameters.
update_csize() Set the collective size, csize.
Selection()[source]

A boolean column that selects a subset slice of the CatalogSource.

By default, this column is set to True for all particles.

Value()[source]

When interpolating a CatalogSource on to a mesh, the value of this array is used as the Value that each particle contributes to a given mesh cell.

The mesh field is a weighted average of Value, with the weights given by Weight.

By default, this array is set to unity for all particles.

Weight()[source]

The column giving the weight to use for each particle on the mesh.

The mesh field is a weighted average of Value, with the weights given by Weight.

By default, this array is set to unity for all particles.

__delitem__(col)

Delete a column; cannot delete a “hard-coded” column

__getitem__(sel)

The following types of indexing are supported:

  1. strings specifying a column in the CatalogSource; returns a dask array holding the column data
  2. boolean arrays specifying a slice of the CatalogSource; returns a CatalogCopy holding only the revelant slice
  3. slice object specifying which particles to select
  4. list of strings specifying column names; returns a CatalogCopy holding only the selected columnss
__len__()[source]

The local size of the CatalogSource on a given rank.

__setitem__(col, value)[source]

Add columns to the CatalogSource, overriding any existing columns with the name col.

attrs

A dictionary storing relevant meta-data about the CatalogSource.

columns

All columns in the CatalogSource, including those hard-coded into the class’s defintion and override columns provided by the user.

compute(*args, **kwargs)

Our version of dask.compute() that computes multiple delayed dask collections at once.

This should be called on the return value of read() to converts any dask arrays to numpy arrays.

If use_cache is True, this internally caches data, using dask’s built-in cache features.

Parameters:args (object) – Any number of objects. If the object is a dask collection, it’s computed and the result is returned. Otherwise it’s passed through unchanged.

Notes

The dask default optimizer induces too many (unnecesarry) IO calls – we turn this off feature off by default. Eventually we want our own optimizer probably.

copy()[source]

Return a copy of the CatalogSource object

Returns:the new CatalogSource object holding the copied data columns
Return type:CatalogCopy
csize

The total, collective size of the CatalogSource, i.e., summed across all ranks.

It is the sum of size across all available ranks.

get_hardcolumn(col)

Construct and return a hard-coded column.

These are usually produced by calling member functions marked by the @column decorator.

Subclasses may override this method and the hardcolumns attribute to bypass the decorator logic.

hardcolumns

A list of the hard-coded columns in the CatalogSource.

These columns are usually member functions marked by @column decorator. Subclasses may override this method and use get_hardcolumn() to bypass the decorator logic.

logger = <logging.Logger object>
make_column(array)

Utility function to convert a numpy array to a dask.array.Array.

read(columns)

Return the requested columns as dask arrays.

Parameters:columns (list of str) – the names of the requested columns
Returns:the list of column data, in the form of dask arrays
Return type:list of dask.array.Array
save(output, columns, datasets=None, header='Header')

Save the CatalogSource to a bigfile.BigFile.

Only the selected columns are saved and attrs are saved in header. The attrs of columns are stored in the datasets.

Parameters:
  • output (str) – the name of the file to write to
  • columns (list of str) – the names of the columns to save in the file
  • datasets (list of str, optional) – names for the data set where each column is stored; defaults to the name of the column
  • header (str, optional) – the name of the data set holding the header information, where attrs is stored
size

The number of particles in the CatalogSource on the local rank.

This property must be defined for all subclasses.

to_mesh(Nmesh=None, BoxSize=None, dtype='f4', interlaced=False, compensated=False, window='cic', weight='Weight', value='Value', selection='Selection', position='Position')

Convert the CatalogSource to a MeshSource, using the specified parameters.

Parameters:
  • Nmesh (int, optional) – the number of cells per side on the mesh; must be provided if not stored in attrs
  • BoxSize (scalar, 3-vector, optional) – the size of the box; must be provided if not stored in attrs
  • dtype (string, optional) – the data type of the mesh array
  • interlaced (bool, optional) – use the interlacing technique of Sefusatti et al. 2015 to reduce the effects of aliasing on Fourier space quantities computed from the mesh
  • compensated (bool, optional) – whether to correct for the window introduced by the grid interpolation scheme
  • window (str, optional) – the string specifying which window interpolation scheme to use; see pmesh.window.methods
  • weight (str, optional) – the name of the column specifying the weight for each particle
  • value (str, optional) – the name of the column specifying the field value for each particle
  • selection (str, optional) – the name of the column that specifies which (if any) slice of the CatalogSource to take
  • position (str, optional) – the name of the column that specifies the position data of the objects in the catalog
Returns:

mesh – a mesh object that provides an interface for gridding particle data onto a specified mesh

Return type:

CatalogMesh

update_csize()[source]

Set the collective size, csize.

This function should be called in __init__() of a subclass, after size has been set to a valid value (not NotImplemented)

use_cache

If set to True, use the built-in caching features of dask to cache data in memory.

class nbodykit.base.catalog.CatalogSourceBase(comm, use_cache=False)[source]

Bases: object

An abstract base class representing a catalog of discrete particles.

This objects behaves like a structured numpy array – it must have a well-defined size when initialized. The size here represents the number of particles in the source on the local rank.

Subclasses of this class must define a size attribute.

The information about each particle is stored as a series of columns in the format of dask arrays. These columns can be accessed in a dict-like fashion.

Attributes

attrs A dictionary storing relevant meta-data about the CatalogSource.
columns All columns in the CatalogSource, including those hard-coded into the class’s defintion and override columns provided by the user.
hardcolumns A list of the hard-coded columns in the CatalogSource.
use_cache If set to True, use the built-in caching features of dask to cache data in memory.

Methods

compute(*args, **kwargs) Our version of dask.compute() that computes multiple delayed dask collections at once.
get_hardcolumn(col) Construct and return a hard-coded column.
make_column(array) Utility function to convert a numpy array to a dask.array.Array.
read(columns) Return the requested columns as dask arrays.
save(output, columns[, datasets, header]) Save the CatalogSource to a bigfile.BigFile.
to_mesh([Nmesh, BoxSize, dtype, interlaced, …]) Convert the CatalogSource to a MeshSource, using the specified parameters.
__delitem__(col)[source]

Delete a column; cannot delete a “hard-coded” column

__getitem__(sel)[source]

The following types of indexing are supported:

  1. strings specifying a column in the CatalogSource; returns a dask array holding the column data
  2. boolean arrays specifying a slice of the CatalogSource; returns a CatalogCopy holding only the revelant slice
  3. slice object specifying which particles to select
  4. list of strings specifying column names; returns a CatalogCopy holding only the selected columnss
__setitem__(col, value)[source]

Add new columns to the CatalogSource, overriding any existing columns with the name col.

attrs

A dictionary storing relevant meta-data about the CatalogSource.

columns

All columns in the CatalogSource, including those hard-coded into the class’s defintion and override columns provided by the user.

compute(*args, **kwargs)[source]

Our version of dask.compute() that computes multiple delayed dask collections at once.

This should be called on the return value of read() to converts any dask arrays to numpy arrays.

If use_cache is True, this internally caches data, using dask’s built-in cache features.

Parameters:args (object) – Any number of objects. If the object is a dask collection, it’s computed and the result is returned. Otherwise it’s passed through unchanged.

Notes

The dask default optimizer induces too many (unnecesarry) IO calls – we turn this off feature off by default. Eventually we want our own optimizer probably.

get_hardcolumn(col)[source]

Construct and return a hard-coded column.

These are usually produced by calling member functions marked by the @column decorator.

Subclasses may override this method and the hardcolumns attribute to bypass the decorator logic.

hardcolumns

A list of the hard-coded columns in the CatalogSource.

These columns are usually member functions marked by @column decorator. Subclasses may override this method and use get_hardcolumn() to bypass the decorator logic.

logger = <logging.Logger object>
static make_column(array)[source]

Utility function to convert a numpy array to a dask.array.Array.

read(columns)[source]

Return the requested columns as dask arrays.

Parameters:columns (list of str) – the names of the requested columns
Returns:the list of column data, in the form of dask arrays
Return type:list of dask.array.Array
save(output, columns, datasets=None, header='Header')[source]

Save the CatalogSource to a bigfile.BigFile.

Only the selected columns are saved and attrs are saved in header. The attrs of columns are stored in the datasets.

Parameters:
  • output (str) – the name of the file to write to
  • columns (list of str) – the names of the columns to save in the file
  • datasets (list of str, optional) – names for the data set where each column is stored; defaults to the name of the column
  • header (str, optional) – the name of the data set holding the header information, where attrs is stored
to_mesh(Nmesh=None, BoxSize=None, dtype='f4', interlaced=False, compensated=False, window='cic', weight='Weight', value='Value', selection='Selection', position='Position')[source]

Convert the CatalogSource to a MeshSource, using the specified parameters.

Parameters:
  • Nmesh (int, optional) – the number of cells per side on the mesh; must be provided if not stored in attrs
  • BoxSize (scalar, 3-vector, optional) – the size of the box; must be provided if not stored in attrs
  • dtype (string, optional) – the data type of the mesh array
  • interlaced (bool, optional) – use the interlacing technique of Sefusatti et al. 2015 to reduce the effects of aliasing on Fourier space quantities computed from the mesh
  • compensated (bool, optional) – whether to correct for the window introduced by the grid interpolation scheme
  • window (str, optional) – the string specifying which window interpolation scheme to use; see pmesh.window.methods
  • weight (str, optional) – the name of the column specifying the weight for each particle
  • value (str, optional) – the name of the column specifying the field value for each particle
  • selection (str, optional) – the name of the column that specifies which (if any) slice of the CatalogSource to take
  • position (str, optional) – the name of the column that specifies the position data of the objects in the catalog
Returns:

mesh – a mesh object that provides an interface for gridding particle data onto a specified mesh

Return type:

CatalogMesh

use_cache

If set to True, use the built-in caching features of dask to cache data in memory.

class nbodykit.base.catalog.ColumnAccessor[source]

Bases: dask.array.core.Array

Provides access to a Column from a Catalog

This is a thin subclass of dask.array.Array to provide a reference to the catalog object, an additional attrs attribute (for recording the reproducible meta-data), and some pretty print support.

Due to particularity of dask, any transformation that is not explicitly in-place will return a dask.array.Array, and losing the pointer to the original catalog and the meta data attrs.

Attributes

A
T
chunks
imag
itemsize Length of one array element in bytes
name
nbytes Number of bytes in array
ndim
npartitions
numblocks
real
shape
size Number of elements in array
vindex Vectorized indexing with broadcasting.

Methods

all([axis, out, keepdims]) Returns True if all elements evaluate to True.
any([axis, out, keepdims]) Returns True if any of the elements of a evaluate to True.
argmax([axis, out]) Return indices of the maximum values along the given axis.
argmin([axis, out]) Return indices of the minimum values along the given axis of a.
as_daskarray()
astype(dtype, **kwargs) Copy of the array, cast to a specified type.
choose(choices[, out, mode]) Use an index array to construct a new array from a set of choices.
clip([min, max, out]) Return an array whose values are limited to [min, max].
compute()
conj()
copy() Copy array.
cumprod(axis[, dtype, out]) See da.cumprod for docstring
cumsum(axis[, dtype, out]) See da.cumsum for docstring
dot(b[, out]) Dot product of two arrays.
flatten([order]) Return a flattened array.
map_blocks(func, *args, **kwargs) Map a function across all blocks of a dask array.
map_overlap(func, depth[, boundary, trim]) Map a function over blocks of the array with some overlap
max([axis, out]) Return the maximum along a given axis.
mean([axis, dtype, out, keepdims]) Returns the average of the array elements along given axis.
min([axis, out, keepdims]) Return the minimum along a given axis.
moment(order[, axis, dtype, keepdims, ddof, …]) Calculate the nth centralized moment.
nonzero() Return the indices of the elements that are non-zero.
persist(**kwargs) Persist multiple Dask collections into memory
prod([axis, dtype, out, keepdims]) Return the product of the array elements over the given axis
ravel([order]) Return a flattened array.
rechunk(chunks[, threshold, block_size_limit]) See da.rechunk for docstring
repeat(repeats[, axis]) Repeat elements of an array.
reshape(shape[, order]) Returns an array containing the same data with a new shape.
round([decimals, out]) Return a with each element rounded to the given number of decimals.
squeeze([axis]) Remove single-dimensional entries from the shape of a.
std([axis, dtype, out, ddof, keepdims]) Returns the standard deviation of the array elements along given axis.
store(sources, targets[, lock, regions, compute]) Store dask arrays in array-like objects, overwrite data in target
sum([axis, dtype, out, keepdims]) Return the sum of the array elements over the given axis.
swapaxes(axis1, axis2) Return a view of the array with axis1 and axis2 interchanged.
to_dask_dataframe([columns]) Convert dask Array to dask Dataframe
to_delayed() Convert Array into dask Delayed objects
to_hdf5(filename, datapath, **kwargs) Store array in HDF5 file
topk(k) The top k elements of an array.
transpose(*axes) Returns a view of the array with axes transposed.
var([axis, dtype, out, ddof, keepdims]) Returns the variance of the array elements, along given axis.
view(dtype[, order]) Get a view of the array as a new data type
visualize([filename, format, optimize_graph]) Render the computation of this object’s task graph using graphviz.
vnorm([ord, axis, keepdims, split_every, out]) Vector norm
A
T
__repr__()
>>> import dask.array as da
>>> da.ones((10, 10), chunks=(5, 5), dtype='i4')
dask.array<..., shape=(10, 10), dtype=int32, chunksize=(5, 5)>
all(axis=None, out=None, keepdims=False)

Returns True if all elements evaluate to True.

Refer to numpy.all for full documentation.

See also

numpy.all()
equivalent function
any(axis=None, out=None, keepdims=False)

Returns True if any of the elements of a evaluate to True.

Refer to numpy.any for full documentation.

See also

numpy.any()
equivalent function
argmax(axis=None, out=None)

Return indices of the maximum values along the given axis.

Refer to numpy.argmax for full documentation.

See also

numpy.argmax()
equivalent function
argmin(axis=None, out=None)

Return indices of the minimum values along the given axis of a.

Refer to numpy.argmin for detailed documentation.

See also

numpy.argmin()
equivalent function
as_daskarray()[source]
astype(dtype, **kwargs)

Copy of the array, cast to a specified type.

Parameters:
  • dtype (str or dtype) – Typecode or data-type to which the array is cast.
  • casting ({'no', 'equiv', 'safe', 'same_kind', 'unsafe'}, optional) –

    Controls what kind of data casting may occur. Defaults to ‘unsafe’ for backwards compatibility.

    • ’no’ means the data types should not be cast at all.
    • ’equiv’ means only byte-order changes are allowed.
    • ’safe’ means only casts which can preserve values are allowed.
    • ’same_kind’ means only safe casts or casts within a kind,
      like float64 to float32, are allowed.
    • ’unsafe’ means any data conversions may be done.
  • copy (bool, optional) – By default, astype always returns a newly allocated array. If this is set to False and the dtype requirement is satisfied, the input array is returned instead of a copy.
choose(choices, out=None, mode='raise')

Use an index array to construct a new array from a set of choices.

Refer to numpy.choose for full documentation.

See also

numpy.choose()
equivalent function
chunks
clip(min=None, max=None, out=None)

Return an array whose values are limited to [min, max]. One of max or min must be given.

Refer to numpy.clip for full documentation.

See also

numpy.clip()
equivalent function
compute()[source]
conj()
copy()

Copy array. This is a no-op for dask.arrays, which are immutable

cumprod(axis, dtype=None, out=None)

See da.cumprod for docstring

cumsum(axis, dtype=None, out=None)

See da.cumsum for docstring

dask
dot(b, out=None)

Dot product of two arrays.

Refer to numpy.dot for full documentation.

See also

numpy.dot()
equivalent function

Examples

>>> a = np.eye(2)  
>>> b = np.ones((2, 2)) * 2  
>>> a.dot(b)  
array([[ 2.,  2.],
       [ 2.,  2.]])

This array method can be conveniently chained:

>>> a.dot(b).dot(b)  
array([[ 8.,  8.],
       [ 8.,  8.]])
dtype
flatten([order])

Return a flattened array.

Refer to numpy.ravel for full documentation.

See also

numpy.ravel()
equivalent function
ndarray.flat()
a flat iterator on the array.
imag
itemsize

Length of one array element in bytes

map_blocks(func, *args, **kwargs)

Map a function across all blocks of a dask array.

Parameters:
  • func (callable) – Function to apply to every block in the array.
  • args (dask arrays or constants) –
  • dtype (np.dtype, optional) – The dtype of the output array. It is recommended to provide this. If not provided, will be inferred by applying the function to a small set of fake data.
  • chunks (tuple, optional) – Chunk shape of resulting blocks if the function does not preserve shape. If not provided, the resulting array is assumed to have the same block structure as the first input array.
  • drop_axis (number or iterable, optional) – Dimensions lost by the function.
  • new_axis (number or iterable, optional) – New dimensions created by the function. Note that these are applied after drop_axis (if present).
  • token (string, optional) – The key prefix to use for the output array. If not provided, will be determined from the function name.
  • name (string, optional) – The key name to use for the output array. Note that this fully specifies the output key name, and must be unique. If not provided, will be determined by a hash of the arguments.
  • **kwargs – Other keyword arguments to pass to function. Values must be constants (not dask.arrays)

Examples

>>> import dask.array as da
>>> x = da.arange(6, chunks=3)
>>> x.map_blocks(lambda x: x * 2).compute()
array([ 0,  2,  4,  6,  8, 10])

The da.map_blocks function can also accept multiple arrays.

>>> d = da.arange(5, chunks=2)
>>> e = da.arange(5, chunks=2)
>>> f = map_blocks(lambda a, b: a + b**2, d, e)
>>> f.compute()
array([ 0,  2,  6, 12, 20])

If the function changes shape of the blocks then you must provide chunks explicitly.

>>> y = x.map_blocks(lambda x: x[::2], chunks=((2, 2),))

You have a bit of freedom in specifying chunks. If all of the output chunk sizes are the same, you can provide just that chunk size as a single tuple.

>>> a = da.arange(18, chunks=(6,))
>>> b = a.map_blocks(lambda x: x[:3], chunks=(3,))

If the function changes the dimension of the blocks you must specify the created or destroyed dimensions.

>>> b = a.map_blocks(lambda x: x[None, :, None], chunks=(1, 6, 1),
...                  new_axis=[0, 2])

Map_blocks aligns blocks by block positions without regard to shape. In the following example we have two arrays with the same number of blocks but with different shape and chunk sizes.

>>> x = da.arange(1000, chunks=(100,))
>>> y = da.arange(100, chunks=(10,))

The relevant attribute to match is numblocks.

>>> x.numblocks
(10,)
>>> y.numblocks
(10,)

If these match (up to broadcasting rules) then we can map arbitrary functions across blocks

>>> def func(a, b):
...     return np.array([a.max(), b.max()])
>>> da.map_blocks(func, x, y, chunks=(2,), dtype='i8')
dask.array<func, shape=(20,), dtype=int64, chunksize=(2,)>
>>> _.compute()
array([ 99,   9, 199,  19, 299,  29, 399,  39, 499,  49, 599,  59, 699,
        69, 799,  79, 899,  89, 999,  99])

Your block function can learn where in the array it is if it supports a block_id keyword argument. This will receive entries like (2, 0, 1), the position of the block in the dask array.

>>> def func(block, block_id=None):
...     pass

You may specify the key name prefix of the resulting task in the graph with the optional token keyword argument.

>>> x.map_blocks(lambda x: x + 1, token='increment')  
dask.array<increment, shape=(100,), dtype=int64, chunksize=(10,)>
map_overlap(func, depth, boundary=None, trim=True, **kwargs)

Map a function over blocks of the array with some overlap

We share neighboring zones between blocks of the array, then map a function, then trim away the neighboring strips.

Parameters:
  • func (function) – The function to apply to each extended block
  • depth (int, tuple, or dict) – The number of cells that each block should share with its neighbors If a tuple or dict this can be different per axis
  • boundary (str, tuple, dict) – how to handle the boundaries. Values include ‘reflect’, ‘periodic’, ‘nearest’, ‘none’, or any constant value like 0 or np.nan
  • trim (bool) – Whether or not to trim the excess after the map function. Set this to false if your mapping function does this for you.
  • **kwargs – Other keyword arguments valid in map_blocks

Examples

>>> x = np.array([1, 1, 2, 3, 3, 3, 2, 1, 1])
>>> x = from_array(x, chunks=5)
>>> def derivative(x):
...     return x - np.roll(x, 1)
>>> y = x.map_overlap(derivative, depth=1, boundary=0)
>>> y.compute()
array([ 1,  0,  1,  1,  0,  0, -1, -1,  0])
>>> import dask.array as da
>>> x = np.arange(16).reshape((4, 4))
>>> d = da.from_array(x, chunks=(2, 2))
>>> d.map_overlap(lambda x: x + x.size, depth=1).compute()
array([[16, 17, 18, 19],
       [20, 21, 22, 23],
       [24, 25, 26, 27],
       [28, 29, 30, 31]])
>>> func = lambda x: x + x.size
>>> depth = {0: 1, 1: 1}
>>> boundary = {0: 'reflect', 1: 'none'}
>>> d.map_overlap(func, depth, boundary).compute()  
array([[12,  13,  14,  15],
       [16,  17,  18,  19],
       [20,  21,  22,  23],
       [24,  25,  26,  27]])
max(axis=None, out=None)

Return the maximum along a given axis.

Refer to numpy.amax for full documentation.

See also

numpy.amax()
equivalent function
mean(axis=None, dtype=None, out=None, keepdims=False)

Returns the average of the array elements along given axis.

Refer to numpy.mean for full documentation.

See also

numpy.mean()
equivalent function
min(axis=None, out=None, keepdims=False)

Return the minimum along a given axis.

Refer to numpy.amin for full documentation.

See also

numpy.amin()
equivalent function
moment(order, axis=None, dtype=None, keepdims=False, ddof=0, split_every=None, out=None)

Calculate the nth centralized moment.

Parameters:
  • order (int) – Order of the moment that is returned, must be >= 2.
  • axis (int, optional) – Axis along which the central moment is computed. The default is to compute the moment of the flattened array.
  • dtype (data-type, optional) – Type to use in computing the moment. For arrays of integer type the default is float64; for arrays of float types it is the same as the array type.
  • keepdims (bool, optional) – If this is set to True, the axes which are reduced are left in the result as dimensions with size one. With this option, the result will broadcast correctly against the original array.
  • ddof (int, optional) – “Delta Degrees of Freedom”: the divisor used in the calculation is N - ddof, where N represents the number of elements. By default ddof is zero.
Returns:

moment

Return type:

ndarray

References

[R114]Pebay, Philippe (2008), “Formulas for Robust, One-Pass Parallel

Computation of Covariances and Arbitrary-Order Statistical Moments” (PDF), Technical Report SAND2008-6212, Sandia National Laboratories

name
nbytes

Number of bytes in array

ndim
nonzero()

Return the indices of the elements that are non-zero.

Refer to numpy.nonzero for full documentation.

See also

numpy.nonzero()
equivalent function
npartitions
numblocks
persist(**kwargs)

Persist multiple Dask collections into memory

This turns lazy Dask collections into Dask collections with the same metadata, but now with their results fully computed or actively computing in the background.

For example a lazy dask.array built up from many lazy calls will now be a dask.array of the same shape, dtype, chunks, etc., but now with all of those previously lazy tasks either computed in memory as many small NumPy arrays (in the single-machine case) or asynchronously running in the background on a cluster (in the distributed case).

This function operates differently if a dask.distributed.Client exists and is connected to a distributed scheduler. In this case this function will return as soon as the task graph has been submitted to the cluster, but before the computations have completed. Computations will continue asynchronously in the background. When using this function with the single machine scheduler it blocks until the computations have finished.

When using Dask on a single machine you should ensure that the dataset fits entirely within memory.

Examples

>>> df = dd.read_csv('/path/to/*.csv')  
>>> df = df[df.name == 'Alice']  
>>> df['in-debt'] = df.balance < 0  
>>> df = df.persist()  # triggers computation  
>>> df.value().min()  # future computations are now fast  
-10
>>> df.value().max()  
100
>>> from dask import persist  # use persist function on multiple collections
>>> a, b = persist(a, b)  
Parameters:
  • *args (Dask collections) –
  • get (callable, optional) – A scheduler get function to use. If not provided, the default is to check the global settings first, and then fall back to the collection defaults.
  • optimize_graph (bool, optional) – If True [default], the graph is optimized before computation. Otherwise the graph is run as is. This can be useful for debugging.
  • **kwargs – Extra keywords to forward to the scheduler get function.
Returns:

Return type:

New dask collections backed by in-memory data

prod(axis=None, dtype=None, out=None, keepdims=False)

Return the product of the array elements over the given axis

Refer to numpy.prod for full documentation.

See also

numpy.prod()
equivalent function
ravel([order])

Return a flattened array.

Refer to numpy.ravel for full documentation.

See also

numpy.ravel()
equivalent function
ndarray.flat()
a flat iterator on the array.
real
rechunk(chunks, threshold=None, block_size_limit=None)

See da.rechunk for docstring

repeat(repeats, axis=None)

Repeat elements of an array.

Refer to numpy.repeat for full documentation.

See also

numpy.repeat()
equivalent function
reshape(shape, order='C')

Returns an array containing the same data with a new shape.

Refer to numpy.reshape for full documentation.

See also

numpy.reshape()
equivalent function
round(decimals=0, out=None)

Return a with each element rounded to the given number of decimals.

Refer to numpy.around for full documentation.

See also

numpy.around()
equivalent function
shape
size

Number of elements in array

squeeze(axis=None)

Remove single-dimensional entries from the shape of a.

Refer to numpy.squeeze for full documentation.

See also

numpy.squeeze()
equivalent function
std(axis=None, dtype=None, out=None, ddof=0, keepdims=False)

Returns the standard deviation of the array elements along given axis.

Refer to numpy.std for full documentation.

See also

numpy.std()
equivalent function
store(sources, targets, lock=True, regions=None, compute=True, **kwargs)

Store dask arrays in array-like objects, overwrite data in target

This stores dask arrays into object that supports numpy-style setitem indexing. It stores values chunk by chunk so that it does not have to fill up memory. For best performance you can align the block size of the storage target with the block size of your array.

If your data fits in memory then you may prefer calling np.array(myarray) instead.

Parameters:
  • sources (Array or iterable of Arrays) –
  • targets (array-like or iterable of array-likes) – These should support setitem syntax target[10:20] = ...
  • lock (boolean or threading.Lock, optional) – Whether or not to lock the data stores while storing. Pass True (lock each file individually), False (don’t lock) or a particular threading.Lock object to be shared among all writes.
  • regions (tuple of slices or iterable of tuple of slices) – Each region tuple in regions should be such that target[region].shape = source.shape for the corresponding source and target in sources and targets, respectively.
  • compute (boolean, optional) – If true compute immediately, return dask.delayed.Delayed otherwise

Examples

>>> x = ...  
>>> import h5py  
>>> f = h5py.File('myfile.hdf5')  
>>> dset = f.create_dataset('/data', shape=x.shape,
...                                  chunks=x.chunks,
...                                  dtype='f8')  
>>> store(x, dset)  

Alternatively store many arrays at the same time

>>> store([x, y, z], [dset1, dset2, dset3])  
sum(axis=None, dtype=None, out=None, keepdims=False)

Return the sum of the array elements over the given axis.

Refer to numpy.sum for full documentation.

See also

numpy.sum()
equivalent function
swapaxes(axis1, axis2)

Return a view of the array with axis1 and axis2 interchanged.

Refer to numpy.swapaxes for full documentation.

See also

numpy.swapaxes()
equivalent function
to_dask_dataframe(columns=None)

Convert dask Array to dask Dataframe

Parameters:columns (list or string) – list of column names if DataFrame, single string if Series

See also

dask.dataframe.from_dask_array()

to_delayed()

Convert Array into dask Delayed objects

Returns an array of values, one value per chunk.

See also

dask.array.from_delayed()

to_hdf5(filename, datapath, **kwargs)

Store array in HDF5 file

>>> x.to_hdf5('myfile.hdf5', '/x')  

Optionally provide arguments as though to h5py.File.create_dataset

>>> x.to_hdf5('myfile.hdf5', '/x', compression='lzf', shuffle=True)  

See also

da.store(), h5py.File.create_dataset()

topk(k)

The top k elements of an array.

See da.topk for docstring

transpose(*axes)

Returns a view of the array with axes transposed.

For a 1-D array, this has no effect. (To change between column and row vectors, first cast the 1-D array into a matrix object.) For a 2-D array, this is the usual matrix transpose. For an n-D array, if axes are given, their order indicates how the axes are permuted (see Examples). If axes are not provided and a.shape = (i[0], i[1], ... i[n-2], i[n-1]), then a.transpose().shape = (i[n-1], i[n-2], ... i[1], i[0]).

Parameters:axes (None, tuple of ints, or n ints) –
  • None or no argument: reverses the order of the axes.
  • tuple of ints: i in the j-th place in the tuple means a’s i-th axis becomes a.transpose()’s j-th axis.
  • n ints: same as an n-tuple of the same ints (this form is intended simply as a “convenience” alternative to the tuple form)
Returns:out – View of a, with axes suitably permuted.
Return type:ndarray

See also

ndarray.T()
Array property returning the array transposed.

Examples

>>> a = np.array([[1, 2], [3, 4]])  
>>> a  
array([[1, 2],
       [3, 4]])
>>> a.transpose()  
array([[1, 3],
       [2, 4]])
>>> a.transpose((1, 0))  
array([[1, 3],
       [2, 4]])
>>> a.transpose(1, 0)  
array([[1, 3],
       [2, 4]])
var(axis=None, dtype=None, out=None, ddof=0, keepdims=False)

Returns the variance of the array elements, along given axis.

Refer to numpy.var for full documentation.

See also

numpy.var()
equivalent function
view(dtype, order='C')

Get a view of the array as a new data type

Parameters:
  • dtype – The dtype by which to view the array
  • order (string) – ‘C’ or ‘F’ (Fortran) ordering
  • reinterprets the bytes of the array under a new dtype. If that (This) –
  • does not have the same size as the original array then the shape (dtype) –
  • change. (will) –
  • that both numpy and dask.array can behave oddly when taking (Beware) –
  • views of arrays under Fortran ordering. Under some (shape-changing) –
  • of NumPy this function will fail when taking shape-changing (versions) –
  • of Fortran ordered arrays if the first dimension has chunks of (views) –
  • one. (size) –
vindex

Vectorized indexing with broadcasting.

This is equivalent to numpy’s advanced indexing, using arrays that are broadcast against each other. This allows for pointwise indexing:

>>> x = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
>>> x = from_array(x, chunks=2)
>>> x.vindex[[0, 1, 2], [0, 1, 2]].compute()
array([1, 5, 9])

Mixed basic/advanced indexing with slices/arrays is also supported. The order of dimensions in the result follows those proposed for ndarray.vindex [1]_: the subspace spanned by arrays is followed by all slices.

Note: vindex provides more general functionality than standard indexing, but it also has fewer optimizations and can be significantly slower.

_[1]: https://github.com/numpy/numpy/pull/6256

visualize(filename='mydask', format=None, optimize_graph=False, **kwargs)

Render the computation of this object’s task graph using graphviz.

Requires graphviz to be installed.

Parameters:
  • filename (str or None, optional) – The name (without an extension) of the file to write to disk. If filename is None, no file will be written, and we communicate with dot using only pipes.
  • format ({'png', 'pdf', 'dot', 'svg', 'jpeg', 'jpg'}, optional) – Format in which to write output file. Default is ‘png’.
  • optimize_graph (bool, optional) – If True, the graph is optimized before rendering. Otherwise, the graph is displayed as is. Default is False.
  • **kwargs – Additional keyword arguments to forward to to_graphviz.
Returns:

result – See dask.dot.dot_graph for more information.

Return type:

IPython.diplay.Image, IPython.display.SVG, or None

See also

dask.base.visualize(), dask.dot.dot_graph()

Notes

For more information on optimization see here:

http://dask.pydata.org/en/latest/optimize.html

vnorm(ord=None, axis=None, keepdims=False, split_every=None, out=None)

Vector norm

nbodykit.base.catalog.column(name=None)[source]

Decorator that defines a function as a column in a CatalogSource

nbodykit.base.catalog.find_column(cls, name)[source]

Find a specific column name of an input class, or raise an exception if it does not exist

Returns:column – the callable that returns the column data
Return type:callable
nbodykit.base.catalog.find_columns(cls)[source]

Find all hard-coded column names associated with the input class

Returns:hardcolumns – a set of the names of all hard-coded columns for the input class cls
Return type:set
nbodykit.base.catalog.get_catalog_subset(parent, index)[source]

Select a subset of a CatalogSource according to a boolean index array.

Returns a CatalogCopy holding only the data that satisfies the slice criterion.

Parameters:
  • parent (CatalogSource) – the parent source that will be sliced
  • index (array_like) – either a dask or numpy boolean array; this determines which rows are included in the returned object
Returns:

subset – the particle source with the same meta-data as parent, and with the sliced data arrays

Return type:

CatalogCopy