nbodykit.base.catalog¶

Functions

`column`([name])	Decorator that defines a function as a column in a CatalogSource
`find_column`(cls, name)	Find a specific column `name` of an input class, or raise
`find_columns`(cls)	Find all hard-coded column names associated with the input class

Classes

`CatalogSource`(args, *kwargs)	An abstract base class representing a catalog of discrete particles.
`CatalogSourceBase`	An abstract base class that implements most of the functionality in `CatalogSource`.
`ColumnAccessor`	Provides access to a Column from a Catalog

class nbodykit.base.catalog.CatalogSource(*args, **kwargs)[source]¶

An abstract base class representing a catalog of discrete particles.

This objects behaves like a structured numpy array – it must have a well-defined size when initialized. The size here represents the number of particles in the source on the local rank.

The information about each particle is stored as a series of columns in the format of dask arrays. These columns can be accessed in a dict-like fashion.

All subclasses of this class contain the following default columns:

Weight
Value
Selection

For a full description of these default columns, see the documentation.

Important

Subclasses of this class must set the _size attribute.

Parameters:	comm – the MPI communicator to use for this object use_cache (bool, optional) – whether to cache intermediate dask task results; default is `False`

Attributes

`Index`	The attribute giving the global index rank of each particle in the list.
`attrs`	A dictionary storing relevant meta-data about the CatalogSource.
`columns`	All columns in the CatalogSource, including those hard-coded into the class’s defintion and override columns provided by the user.
`csize`	The total, collective size of the CatalogSource, i.e., summed across all ranks.
`hardcolumns`	A list of the hard-coded columns in the CatalogSource.
`size`	The number of objects in the CatalogSource on the local rank.
`use_cache`	If set to `True`, use the built-in caching features of `dask` to cache data in memory.

Methods

`Selection`()	A boolean column that selects a subset slice of the CatalogSource.
`Value`()	When interpolating a CatalogSource on to a mesh, the value of this array is used as the Value that each particle contributes to a given mesh cell.
`Weight`()	The column giving the weight to use for each particle on the mesh.
`compute`(args, *kwargs)	Our version of `dask.compute()` that computes multiple delayed dask collections at once.
`copy`()	Return a shallow copy of the object, where each column is a reference of the corresponding column in `self`.
`get_hardcolumn`(col)	Construct and return a hard-coded column.
`gslice`(start, stop[, end, redistribute])	Execute a global slice of a CatalogSource.
`make_column`(array)	Utility function to convert an array-like object to a `dask.array.Array`.
`read`(columns)	Return the requested columns as dask arrays.
`save`(output, columns[, datasets, header])	Save the CatalogSource to a `bigfile.BigFile`.
`sort`(keys[, reverse, usecols])	Return a CatalogSource, sorted globally across all MPI ranks in ascending order by the input keys.
`to_mesh`([Nmesh, BoxSize, dtype, interlaced, …])	Convert the CatalogSource to a MeshSource, using the specified parameters.
`view`([type])	Return a “view” of the CatalogSource object, with the returned type set by `type`.

Index¶

The attribute giving the global index rank of each particle in the list. It is an integer from 0 to self.csize.

Note that slicing changes this index value.

Selection()[source]¶

A boolean column that selects a subset slice of the CatalogSource.

By default, this column is set to True for all particles.

Value()[source]¶

When interpolating a CatalogSource on to a mesh, the value of this array is used as the Value that each particle contributes to a given mesh cell.

The mesh field is a weighted average of Value, with the weights given by Weight.

By default, this array is set to unity for all particles.

Weight()[source]¶

The column giving the weight to use for each particle on the mesh.

The mesh field is a weighted average of Value, with the weights given by Weight.

By default, this array is set to unity for all particles.

__len__()[source]¶: The local size of the CatalogSource on a given rank.

__setitem__(col, value)[source]¶: Add columns to the CatalogSource, overriding any existing columns with the name col.

csize¶

The total, collective size of the CatalogSource, i.e., summed across all ranks.

It is the sum of size across all available ranks.

If the base attribute is set, the base.csize attribute will be returned.

gslice(start, stop, end=1, redistribute=True)[source]¶

Execute a global slice of a CatalogSource.

Note

After the global slice is performed, the data is scattered evenly across all ranks.

Parameters:	start (int) – the start index of the global slice stop (int) – the stop index of the global slice step (int, optional) – the default step size of the global size redistribute (bool, optional) – if `True`, evenly re-distribute the sliced data across all ranks, otherwise just return any local data part of the global slice

size¶

The number of objects in the CatalogSource on the local rank.

If the base attribute is set, the base.size attribute will be returned.

Important

This property must be defined for all subclasses.

sort(keys, reverse=False, usecols=None)[source]¶

Return a CatalogSource, sorted globally across all MPI ranks in ascending order by the input keys.

Sort columns must be floating or integer type.

Note

After the sort operation, the data is scattered evenly across all ranks.

Parameters:	keys (list, tuple) – the names of columns to sort by. If multiple columns are provided, the data is sorted consecutively in the order provided reverse (bool, optional) – if `True`, perform descending sort operations usecols (list, optional) – the name of the columns to include in the returned CatalogSource

class nbodykit.base.catalog.CatalogSourceBase[source]¶

An abstract base class that implements most of the functionality in CatalogSource.

The main difference between this class and CatalogSource is that this base class does not assume the object has a size attribute.

Note

See the docstring for CatalogSource. Most often, users should implement custom sources as subclasses of CatalogSource.

Parameters:	comm – the MPI communicator to use for this object use_cache (bool, optional) – whether to cache intermediate dask task results; default is `False`

Attributes

`attrs`	A dictionary storing relevant meta-data about the CatalogSource.
`columns`	All columns in the CatalogSource, including those hard-coded into the class’s defintion and override columns provided by the user.
`hardcolumns`	A list of the hard-coded columns in the CatalogSource.
`use_cache`	If set to `True`, use the built-in caching features of `dask` to cache data in memory.

Methods

`compute`(args, *kwargs)	Our version of `dask.compute()` that computes multiple delayed dask collections at once.
`copy`()	Return a shallow copy of the object, where each column is a reference of the corresponding column in `self`.
`get_hardcolumn`(col)	Construct and return a hard-coded column.
`make_column`(array)	Utility function to convert an array-like object to a `dask.array.Array`.
`read`(columns)	Return the requested columns as dask arrays.
`save`(output, columns[, datasets, header])	Save the CatalogSource to a `bigfile.BigFile`.
`to_mesh`([Nmesh, BoxSize, dtype, interlaced, …])	Convert the CatalogSource to a MeshSource, using the specified parameters.
`view`([type])	Return a “view” of the CatalogSource object, with the returned type set by `type`.

__delitem__(col)[source]¶: Delete a column; cannot delete a “hard-coded” column.

Note

If the base attribute is set, columns will be deleted from base instead of from self.

__finalize__(other)[source]¶

Finalize the creation of a CatalogSource object by copying over any additional attributes from a second CatalogSource.

The idea here is to only copy over attributes that are similar to meta-data, so we do not copy some of the core attributes of the CatalogSource object.

Parameters:	other – the second object to copy over attributes from; it needs to be a subclass of CatalogSourcBase for attributes to be copied
Returns:	return `self`, with the added attributes
Return type:	CatalogSource

__getitem__(sel)[source]¶

The following types of indexing are supported:

strings specifying a column in the CatalogSource; returns a dask array holding the column data
boolean arrays specifying a slice of the CatalogSource; returns a CatalogSource holding only the revelant slice
slice object specifying which particles to select
list of strings specifying column names; returns a CatalogSource holding only the selected columns

Notes

Slicing with a boolean array is a collective operation
If the base attribute is set, columns will be returned from base instead of from self.

__setitem__(col, value)[source]¶: Add new columns to the CatalogSource, overriding any existing columns with the name col.

Note

If the base attribute is set, columns will be added to base instead of to self.

__slice__(index)[source]¶

Select a subset of self according to a boolean index array.

Returns a new object of the same type as selff holding only the data that satisfies the slice index.

Parameters:	index (array_like) – either a dask or numpy boolean array; this determines which rows are included in the returned object

attrs¶: A dictionary storing relevant meta-data about the CatalogSource.

columns¶: All columns in the CatalogSource, including those hard-coded into the class’s defintion and override columns provided by the user.

Note

If the base attribute is set, the value of base.columns will be returned.

compute(*args, **kwargs)[source]¶

Our version of dask.compute() that computes multiple delayed dask collections at once.

This should be called on the return value of read() to converts any dask arrays to numpy arrays.

If use_cache is True, this internally caches data, using dask’s built-in cache features.

Note

If the base attribute is set, compute() will called using base instead of self.

Parameters:	args (object) – Any number of objects. If the object is a dask collection, it’s computed and the result is returned. Otherwise it’s passed through unchanged.

Notes

The dask default optimizer induces too many (unnecesarry) IO calls – we turn this off feature off by default. Eventually we want our own optimizer probably.

copy()[source]¶

Return a shallow copy of the object, where each column is a reference of the corresponding column in self.

Note

No copy of data is made.

Note

This is different from view in that the attributes dictionary of the copy no longer related to self.

Returns:	a new CatalogSource that holds all of the data columns of `self`
Return type:	CatalogSource

get_hardcolumn(col)[source]¶

Construct and return a hard-coded column.

These are usually produced by calling member functions marked by the @column decorator.

Subclasses may override this method and the hardcolumns attribute to bypass the decorator logic.

Note

If the base attribute is set, get_hardcolumn() will called using base instead of self.

hardcolumns¶

A list of the hard-coded columns in the CatalogSource.

These columns are usually member functions marked by @column decorator. Subclasses may override this method and use get_hardcolumn() to bypass the decorator logic.

Note

If the base attribute is set, the value of base.hardcolumns will be returned.

static make_column(array)[source]¶

Utility function to convert an array-like object to a dask.array.Array.

Note

The dask array chunk size is controlled via the dask_chunk_size global option. See set_options.

Parameters:	array (array_like) – an array-like object; can be a dask array, numpy array, ColumnAccessor, or other non-scalar array-like object
Returns:	a dask array initialized from `array`
Return type:	`dask.array.Array`

read(columns)[source]¶

Return the requested columns as dask arrays.

Parameters:	columns (list of str) – the names of the requested columns
Returns:	the list of column data, in the form of dask arrays
Return type:	list of `dask.array.Array`

save(output, columns, datasets=None, header='Header')[source]¶

Save the CatalogSource to a bigfile.BigFile.

Only the selected columns are saved and attrs are saved in header. The attrs of columns are stored in the datasets.

Parameters:	output (str) – the name of the file to write to columns (list of str) – the names of the columns to save in the file datasets (list of str, optional) – names for the data set where each column is stored; defaults to the name of the column header (str, optional) – the name of the data set holding the header information, where `attrs` is stored

to_mesh(Nmesh=None, BoxSize=None, dtype='f4', interlaced=False, compensated=False, window='cic', weight='Weight', value='Value', selection='Selection', position='Position')[source]¶

Convert the CatalogSource to a MeshSource, using the specified parameters.

Parameters:	Nmesh (int, optional) – the number of cells per side on the mesh; must be provided if not stored in `attrs` BoxSize (scalar, 3-vector, optional) – the size of the box; must be provided if not stored in `attrs` dtype (string, optional) – the data type of the mesh array interlaced (bool, optional) – use the interlacing technique of Sefusatti et al. 2015 to reduce the effects of aliasing on Fourier space quantities computed from the mesh compensated (bool, optional) – whether to correct for the window introduced by the grid interpolation scheme window (str, optional) – the string specifying which window interpolation scheme to use; see pmesh.window.methods weight (str, optional) – the name of the column specifying the weight for each particle value (str, optional) – the name of the column specifying the field value for each particle selection (str, optional) – the name of the column that specifies which (if any) slice of the CatalogSource to take position (str, optional) – the name of the column that specifies the position data of the objects in the catalog
Returns:	mesh – a mesh object that provides an interface for gridding particle data onto a specified mesh
Return type:	CatalogMesh

use_cache¶: If set to True, use the built-in caching features of dask to cache data in memory.

view(type=None)[source]¶

Return a “view” of the CatalogSource object, with the returned type set by type.

This initializes a new empty class of type type and attaches attributes to it via the __finalize__() mechanism.

Parameters:	type (Python type) – the desired class type of the returned object.

class nbodykit.base.catalog.ColumnAccessor[source]¶

Provides access to a Column from a Catalog

This is a thin subclass of dask.array.Array to provide a reference to the catalog object, an additional attrs attribute (for recording the reproducible meta-data), and some pretty print support.

Due to particularity of dask, any transformation that is not explicitly in-place will return a dask.array.Array, and losing the pointer to the original catalog and the meta data attrs.

Attributes

`A`
`T`
`chunks`
`imag`
`itemsize`	Length of one array element in bytes
`name`
`nbytes`	Number of bytes in array
`ndim`
`npartitions`
`numblocks`
`real`
`shape`
`size`	Number of elements in array
`vindex`	Vectorized indexing with broadcasting.

Methods

`all`([axis, out, keepdims])	Returns True if all elements evaluate to True.
`any`([axis, out, keepdims])	Returns True if any of the elements of a evaluate to True.
`argmax`([axis, out])	Return indices of the maximum values along the given axis.
`argmin`([axis, out])	Return indices of the minimum values along the given axis of a.
`as_daskarray`()
`astype`(dtype, **kwargs)	Copy of the array, cast to a specified type.
`choose`(choices[, out, mode])	Use an index array to construct a new array from a set of choices.
`clip`([min, max, out])	Return an array whose values are limited to `[min, max]`.
`compute`()
`conj`()
`copy`()	Copy array.
`cumprod`(axis[, dtype, out])	See da.cumprod for docstring
`cumsum`(axis[, dtype, out])	See da.cumsum for docstring
`dot`(b[, out])	Dot product of two arrays.
`flatten`([order])	Return a flattened array.
`map_blocks`(func, args, *kwargs)	Map a function across all blocks of a dask array.
`map_overlap`(func, depth[, boundary, trim])	Map a function over blocks of the array with some overlap
`max`([axis, out])	Return the maximum along a given axis.
`mean`([axis, dtype, out, keepdims])	Returns the average of the array elements along given axis.
`min`([axis, out, keepdims])	Return the minimum along a given axis.
`moment`(order[, axis, dtype, keepdims, ddof, …])	Calculate the nth centralized moment.
`nonzero`()	Return the indices of the elements that are non-zero.
`persist`(**kwargs)	Persist multiple Dask collections into memory
`prod`([axis, dtype, out, keepdims])	Return the product of the array elements over the given axis
`ravel`([order])	Return a flattened array.
`rechunk`(chunks[, threshold, block_size_limit])	See da.rechunk for docstring
`repeat`(repeats[, axis])	Repeat elements of an array.
`reshape`(shape[, order])	Returns an array containing the same data with a new shape.
`round`([decimals, out])	Return a with each element rounded to the given number of decimals.
`squeeze`([axis])	Remove single-dimensional entries from the shape of a.
`std`([axis, dtype, out, ddof, keepdims])	Returns the standard deviation of the array elements along given axis.
`store`(sources, targets[, lock, regions, compute])	Store dask arrays in array-like objects, overwrite data in target
`sum`([axis, dtype, out, keepdims])	Return the sum of the array elements over the given axis.
`swapaxes`(axis1, axis2)	Return a view of the array with axis1 and axis2 interchanged.
`to_dask_dataframe`([columns])	Convert dask Array to dask Dataframe
`to_delayed`()	Convert Array into dask Delayed objects
`to_hdf5`(filename, datapath, **kwargs)	Store array in HDF5 file
`topk`(k)	The top k elements of an array.
`transpose`(*axes)	Returns a view of the array with axes transposed.
`var`([axis, dtype, out, ddof, keepdims])	Returns the variance of the array elements, along given axis.
`view`(dtype[, order])	Get a view of the array as a new data type
`visualize`([filename, format, optimize_graph])	Render the computation of this object’s task graph using graphviz.
`vnorm`([ord, axis, keepdims, split_every, out])	Vector norm

nbodykit.base.catalog.column(name=None)[source]¶: Decorator that defines a function as a column in a CatalogSource

nbodykit.base.catalog.find_column(cls, name)[source]¶

Find a specific column name of an input class, or raise an exception if it does not exist

Returns:	column – the callable that returns the column data
Return type:	callable

nbodykit.base.catalog.find_columns(cls)[source]¶

Find all hard-coded column names associated with the input class

Returns:	hardcolumns – a set of the names of all hard-coded columns for the input class `cls`
Return type:	set