nbodykit.source.catalog.file

Functions

FileCatalogFactory(name, filetype[, examples])

Factory method to create a CatalogSource that uses a subclass of nbodykit.io.base.FileType to read data from disk.

Classes

FileCatalog(filetype, path, *args, **kwargs)

Base class to create a source of particles from a single file, or multiple files, on disk.

FileCatalogBase(filetype, path[, args, ...])

Base class to create a source of particles from a single file, or multiple files, on disk.

class nbodykit.source.catalog.file.BigFileCatalog(path, *args, **kwargs)

A CatalogSource that uses BigFile to read data from disk.

Multiple files can be read at once by supplying a list of file names or a glob asterisk pattern as the path argument. See Reading Multiple Data Files at Once for examples.

Parameters
  • path (str) – the name of the directory holding the bigfile data

  • exclude (list of str, optional) – the data sets to exlude from loading within bigfile; default is the header. If any list is given, the name of the header column must be given too if it is not part of the data set. The names are shell glob patterns.

  • comm (MPI Communicator, optional) – the MPI communicator instance; default (None) sets to the current communicator

  • attrs (dict, optional) – dictionary of meta-data to store in attrs

Examples

Please see the documentation for examples.

Attributes
Index

The attribute giving the global index rank of each particle in the list.

attrs

A dictionary storing relevant meta-data about the CatalogSource.

columns

All columns in the CatalogSource, including those hard-coded into the class’s defintion and override columns provided by the user.

csize

The total, collective size of the CatalogSource, i.e., summed across all ranks.

hardcolumns

The union of the columns in the file and any transformed columns.

size

The number of objects in the CatalogSource on the local rank.

Methods

Selection()

A boolean column that selects a subset slice of the CatalogSource.

Value()

When interpolating a CatalogSource on to a mesh, the value of this array is used as the Value that each particle contributes to a given mesh cell.

Weight()

The column giving the weight to use for each particle on the mesh.

compute(*args, **kwargs)

Our version of dask.compute() that computes multiple delayed dask collections at once.

copy()

Return a shallow copy of the object, where each column is a reference of the corresponding column in self.

get_hardcolumn(col)

Return a column from the underlying file source.

gslice(start, stop[, end, redistribute])

Execute a global slice of a CatalogSource.

make_column(array)

Utility function to convert an array-like object to a dask.array.Array.

persist([columns])

Return a CatalogSource, where the selected columns are computed and persist in memory.

query_range(start, end)

Seek to a range in the file catalog.

read(columns)

Return the requested columns as dask arrays.

save(output[, columns, dataset, datasets, ...])

Save the CatalogSource to a bigfile.BigFile.

sort(keys[, reverse, usecols])

Return a CatalogSource, sorted globally across all MPI ranks in ascending order by the input keys.

to_mesh([Nmesh, BoxSize, dtype, interlaced, ...])

Convert the CatalogSource to a MeshSource, using the specified parameters.

to_subvolumes([domain, position, columns])

Domain Decompose a catalog, sending items to the ranks according to the supplied domain object.

view([type])

Return a "view" of the CatalogSource object, with the returned type set by type.

create_instance

property Index

The attribute giving the global index rank of each particle in the list. It is an integer from 0 to self.csize.

Note that slicing changes this index value.

Selection()

A boolean column that selects a subset slice of the CatalogSource.

By default, this column is set to True for all particles, and all CatalogSource objects will contain this column.

Value()

When interpolating a CatalogSource on to a mesh, the value of this array is used as the Value that each particle contributes to a given mesh cell.

The mesh field is a weighted average of Value, with the weights given by Weight.

By default, this array is set to unity for all particles, and all CatalogSource objects will contain this column.

Weight()

The column giving the weight to use for each particle on the mesh.

The mesh field is a weighted average of Value, with the weights given by Weight.

By default, this array is set to unity for all particles, and all CatalogSource objects will contain this column.

__delitem__(col)

Delete a column; cannot delete a “hard-coded” column.

Note

If the base attribute is set, columns will be deleted from base instead of from self.

__finalize__(other)

Finalize the creation of a CatalogSource object by copying over any additional attributes from a second CatalogSource.

The idea here is to only copy over attributes that are similar to meta-data, so we do not copy some of the core attributes of the CatalogSource object.

Parameters

other – the second object to copy over attributes from; it needs to be a subclass of CatalogSourcBase for attributes to be copied

Returns

return self, with the added attributes

Return type

CatalogSource

__getitem__(sel)

The following types of indexing are supported:

  1. strings specifying a column in the CatalogSource; returns a dask array holding the column data

  2. boolean arrays specifying a slice of the CatalogSource; returns a CatalogSource holding only the revelant slice

  3. slice object specifying which particles to select

  4. list of strings specifying column names; returns a CatalogSource holding only the selected columns

Notes

  • Slicing is a collective operation

  • If the base attribute is set, columns will be returned from base instead of from self.

__len__()

The local size of the CatalogSource on a given rank.

__setitem__(col, value)

Add columns to the CatalogSource, overriding any existing columns with the name col.

property attrs

A dictionary storing relevant meta-data about the CatalogSource.

property columns

All columns in the CatalogSource, including those hard-coded into the class’s defintion and override columns provided by the user.

Note

If the base attribute is set, the value of base.columns will be returned.

compute(*args, **kwargs)

Our version of dask.compute() that computes multiple delayed dask collections at once.

This should be called on the return value of read() to converts any dask arrays to numpy arrays.

. note::

If the base attribute is set, compute() will called using base instead of self.

Parameters

args (object) – Any number of objects. If the object is a dask collection, it’s computed and the result is returned. Otherwise it’s passed through unchanged.

copy()

Return a shallow copy of the object, where each column is a reference of the corresponding column in self.

Note

No copy of data is made.

Note

This is different from view in that the attributes dictionary of the copy no longer related to self.

Returns

a new CatalogSource that holds all of the data columns of self

Return type

CatalogSource

property csize

The total, collective size of the CatalogSource, i.e., summed across all ranks.

It is the sum of size across all available ranks.

If the base attribute is set, the base.csize attribute will be returned.

get_hardcolumn(col)

Return a column from the underlying file source.

Columns are returned as dask arrays.

gslice(start, stop, end=1, redistribute=True)

Execute a global slice of a CatalogSource.

Note

After the global slice is performed, the data is scattered evenly across all ranks.

Note

The current algorithm generates an index on the root rank and does not scale well.

Parameters
  • start (int) – the start index of the global slice

  • stop (int) – the stop index of the global slice

  • step (int, optional) – the default step size of the global size

  • redistribute (bool, optional) – if True, evenly re-distribute the sliced data across all ranks, otherwise just return any local data part of the global slice

property hardcolumns

The union of the columns in the file and any transformed columns.

static make_column(array)

Utility function to convert an array-like object to a dask.array.Array.

Note

The dask array chunk size is controlled via the dask_chunk_size global option. See set_options.

Parameters

array (array_like) – an array-like object; can be a dask array, numpy array, ColumnAccessor, or other non-scalar array-like object

Returns

a dask array initialized from array

Return type

dask.array.Array

persist(columns=None)

Return a CatalogSource, where the selected columns are computed and persist in memory.

query_range(start, end)

Seek to a range in the file catalog.

Parameters
  • start (int) – start of the file relative to the physical file

  • end (int) – end of the file relative to the physical file

Returns

  • A new catalog that only accesses the given region of the file.

  • If the original catalog (self) contains any assigned columns not directly

  • obtained from the file, then the function will raise ValueError, since

  • the operation in that case is not well defined.

read(columns)

Return the requested columns as dask arrays.

Parameters

columns (list of str) – the names of the requested columns

Returns

the list of column data, in the form of dask arrays

Return type

list of dask.array.Array

save(output, columns=None, dataset=None, datasets=None, header='Header', compute=True)

Save the CatalogSource to a bigfile.BigFile.

Only the selected columns are saved and attrs are saved in header. The attrs of columns are stored in the datasets.

Parameters
  • output (str) – the name of the file to write to

  • columns (list of str) – the names of the columns to save in the file, or None to use all columns

  • dataset (str, optional) – dataset to store the columns under.

  • datasets (list of str, optional) – names for the data set where each column is stored; defaults to the name of the column (deprecated)

  • header (str, optional, or None) – the name of the data set holding the header information, where attrs is stored if header is None, do not save the header.

  • compute (boolean, default True) – if True, wait till the store operations finish if False, return a dictionary with column name and a future object for the store. use dask.compute() to wait for the store operations on the result.

property size

The number of objects in the CatalogSource on the local rank.

If the base attribute is set, the base.size attribute will be returned.

Important

This property must be defined for all subclasses.

sort(keys, reverse=False, usecols=None)

Return a CatalogSource, sorted globally across all MPI ranks in ascending order by the input keys.

Sort columns must be floating or integer type.

Note

After the sort operation, the data is scattered evenly across all ranks.

Parameters
  • keys (list, tuple) – the names of columns to sort by. If multiple columns are provided, the data is sorted consecutively in the order provided

  • reverse (bool, optional) – if True, perform descending sort operations

  • usecols (list, optional) – the name of the columns to include in the returned CatalogSource

to_mesh(Nmesh=None, BoxSize=None, dtype='f4', interlaced=False, compensated=False, resampler='cic', weight='Weight', value='Value', selection='Selection', position='Position', window=None)

Convert the CatalogSource to a MeshSource, using the specified parameters.

Parameters
  • Nmesh (int, optional) – the number of cells per side on the mesh; must be provided if not stored in attrs

  • BoxSize (scalar, 3-vector, optional) – the size of the box; must be provided if not stored in attrs

  • dtype (string, optional) – the data type of the mesh array

  • interlaced (bool, optional) – use the interlacing technique of Sefusatti et al. 2015 to reduce the effects of aliasing on Fourier space quantities computed from the mesh

  • compensated (bool, optional) – whether to correct for the resampler window introduced by the grid interpolation scheme

  • resampler (str, optional) – the string specifying which resampler interpolation scheme to use; see pmesh.resampler.methods

  • weight (str, optional) – the name of the column specifying the weight for each particle

  • value (str, optional) – the name of the column specifying the field value for each particle

  • selection (str, optional) – the name of the column that specifies which (if any) slice of the CatalogSource to take

  • position (str, optional) – the name of the column that specifies the position data of the objects in the catalog

  • window (str, deprecated) – use resampler instead.

Returns

mesh – a mesh object that provides an interface for gridding particle data onto a specified mesh

Return type

CatalogMesh

to_subvolumes(domain=None, position='Position', columns=None)

Domain Decompose a catalog, sending items to the ranks according to the supplied domain object. Using the position column as the Position.

This will read in the full position array and all of the requested columns.

Parameters
  • domain (pmesh.domain.GridND object, or None) – The domain to distribute the catalog. If None, try to evenly divide spatially. An easiest way to find a domain object is to use pm.domain, where pm is a pmesh.pm.ParticleMesh object.

  • position (string_like) – column to use to compute the position.

  • columns (list of string_like) – columns to include in the new catalog, if not supplied, all catalogs will be exchanged.

Returns

A decomposed catalog source, where each rank only contains objects belongs to the rank as claimed by the domain object.

self.attrs are carried over as a shallow copy to the returned object.

Return type

CatalogSource

view(type=None)

Return a “view” of the CatalogSource object, with the returned type set by type.

This initializes a new empty class of type type and attaches attributes to it via the __finalize__() mechanism.

Parameters

type (Python type) – the desired class type of the returned object.

class nbodykit.source.catalog.file.BinaryCatalog(path, *args, **kwargs)

A CatalogSource that uses BinaryFile to read data from disk.

Multiple files can be read at once by supplying a list of file names or a glob asterisk pattern as the path argument. See Reading Multiple Data Files at Once for examples.

Parameters
  • path (str) – the name of the binary file to load

  • dtype (numpy.dtype or list of tuples) – the dtypes of the columns to load; this should be either a numpy.dtype or be able to be converted to one via a numpy.dtype() call

  • offsets (dict, optional) – a dictionay specifying the byte offsets of each column in the binary file; if not supplied, the offsets are inferred from the dtype size of each column, assuming a fixed header size, and contiguous storage

  • header_size (int, optional) – the size of the header in bytes

  • size (int, optional) – the number of objects in the binary file; if not provided, the value is inferred from the dtype and the total size of the file in bytes

  • comm (MPI Communicator, optional) – the MPI communicator instance; default (None) sets to the current communicator

  • attrs (dict, optional) – dictionary of meta-data to store in attrs

Examples

Please see the documentation for examples.

Attributes
Index

The attribute giving the global index rank of each particle in the list.

attrs

A dictionary storing relevant meta-data about the CatalogSource.

columns

All columns in the CatalogSource, including those hard-coded into the class’s defintion and override columns provided by the user.

csize

The total, collective size of the CatalogSource, i.e., summed across all ranks.

hardcolumns

The union of the columns in the file and any transformed columns.

size

The number of objects in the CatalogSource on the local rank.

Methods

Selection()

A boolean column that selects a subset slice of the CatalogSource.

Value()

When interpolating a CatalogSource on to a mesh, the value of this array is used as the Value that each particle contributes to a given mesh cell.

Weight()

The column giving the weight to use for each particle on the mesh.

compute(*args, **kwargs)

Our version of dask.compute() that computes multiple delayed dask collections at once.

copy()

Return a shallow copy of the object, where each column is a reference of the corresponding column in self.

get_hardcolumn(col)

Return a column from the underlying file source.

gslice(start, stop[, end, redistribute])

Execute a global slice of a CatalogSource.

make_column(array)

Utility function to convert an array-like object to a dask.array.Array.

persist([columns])

Return a CatalogSource, where the selected columns are computed and persist in memory.

query_range(start, end)

Seek to a range in the file catalog.

read(columns)

Return the requested columns as dask arrays.

save(output[, columns, dataset, datasets, ...])

Save the CatalogSource to a bigfile.BigFile.

sort(keys[, reverse, usecols])

Return a CatalogSource, sorted globally across all MPI ranks in ascending order by the input keys.

to_mesh([Nmesh, BoxSize, dtype, interlaced, ...])

Convert the CatalogSource to a MeshSource, using the specified parameters.

to_subvolumes([domain, position, columns])

Domain Decompose a catalog, sending items to the ranks according to the supplied domain object.

view([type])

Return a "view" of the CatalogSource object, with the returned type set by type.

create_instance

property Index

The attribute giving the global index rank of each particle in the list. It is an integer from 0 to self.csize.

Note that slicing changes this index value.

Selection()

A boolean column that selects a subset slice of the CatalogSource.

By default, this column is set to True for all particles, and all CatalogSource objects will contain this column.

Value()

When interpolating a CatalogSource on to a mesh, the value of this array is used as the Value that each particle contributes to a given mesh cell.

The mesh field is a weighted average of Value, with the weights given by Weight.

By default, this array is set to unity for all particles, and all CatalogSource objects will contain this column.

Weight()

The column giving the weight to use for each particle on the mesh.

The mesh field is a weighted average of Value, with the weights given by Weight.

By default, this array is set to unity for all particles, and all CatalogSource objects will contain this column.

__delitem__(col)

Delete a column; cannot delete a “hard-coded” column.

Note

If the base attribute is set, columns will be deleted from base instead of from self.

__finalize__(other)

Finalize the creation of a CatalogSource object by copying over any additional attributes from a second CatalogSource.

The idea here is to only copy over attributes that are similar to meta-data, so we do not copy some of the core attributes of the CatalogSource object.

Parameters

other – the second object to copy over attributes from; it needs to be a subclass of CatalogSourcBase for attributes to be copied

Returns

return self, with the added attributes

Return type

CatalogSource

__getitem__(sel)

The following types of indexing are supported:

  1. strings specifying a column in the CatalogSource; returns a dask array holding the column data

  2. boolean arrays specifying a slice of the CatalogSource; returns a CatalogSource holding only the revelant slice

  3. slice object specifying which particles to select

  4. list of strings specifying column names; returns a CatalogSource holding only the selected columns

Notes

  • Slicing is a collective operation

  • If the base attribute is set, columns will be returned from base instead of from self.

__len__()

The local size of the CatalogSource on a given rank.

__setitem__(col, value)

Add columns to the CatalogSource, overriding any existing columns with the name col.

property attrs

A dictionary storing relevant meta-data about the CatalogSource.

property columns

All columns in the CatalogSource, including those hard-coded into the class’s defintion and override columns provided by the user.

Note

If the base attribute is set, the value of base.columns will be returned.

compute(*args, **kwargs)

Our version of dask.compute() that computes multiple delayed dask collections at once.

This should be called on the return value of read() to converts any dask arrays to numpy arrays.

. note::

If the base attribute is set, compute() will called using base instead of self.

Parameters

args (object) – Any number of objects. If the object is a dask collection, it’s computed and the result is returned. Otherwise it’s passed through unchanged.

copy()

Return a shallow copy of the object, where each column is a reference of the corresponding column in self.

Note

No copy of data is made.

Note

This is different from view in that the attributes dictionary of the copy no longer related to self.

Returns

a new CatalogSource that holds all of the data columns of self

Return type

CatalogSource

property csize

The total, collective size of the CatalogSource, i.e., summed across all ranks.

It is the sum of size across all available ranks.

If the base attribute is set, the base.csize attribute will be returned.

get_hardcolumn(col)

Return a column from the underlying file source.

Columns are returned as dask arrays.

gslice(start, stop, end=1, redistribute=True)

Execute a global slice of a CatalogSource.

Note

After the global slice is performed, the data is scattered evenly across all ranks.

Note

The current algorithm generates an index on the root rank and does not scale well.

Parameters
  • start (int) – the start index of the global slice

  • stop (int) – the stop index of the global slice

  • step (int, optional) – the default step size of the global size

  • redistribute (bool, optional) – if True, evenly re-distribute the sliced data across all ranks, otherwise just return any local data part of the global slice

property hardcolumns

The union of the columns in the file and any transformed columns.

static make_column(array)

Utility function to convert an array-like object to a dask.array.Array.

Note

The dask array chunk size is controlled via the dask_chunk_size global option. See set_options.

Parameters

array (array_like) – an array-like object; can be a dask array, numpy array, ColumnAccessor, or other non-scalar array-like object

Returns

a dask array initialized from array

Return type

dask.array.Array

persist(columns=None)

Return a CatalogSource, where the selected columns are computed and persist in memory.

query_range(start, end)

Seek to a range in the file catalog.

Parameters
  • start (int) – start of the file relative to the physical file

  • end (int) – end of the file relative to the physical file

Returns

  • A new catalog that only accesses the given region of the file.

  • If the original catalog (self) contains any assigned columns not directly

  • obtained from the file, then the function will raise ValueError, since

  • the operation in that case is not well defined.

read(columns)

Return the requested columns as dask arrays.

Parameters

columns (list of str) – the names of the requested columns

Returns

the list of column data, in the form of dask arrays

Return type

list of dask.array.Array

save(output, columns=None, dataset=None, datasets=None, header='Header', compute=True)

Save the CatalogSource to a bigfile.BigFile.

Only the selected columns are saved and attrs are saved in header. The attrs of columns are stored in the datasets.

Parameters
  • output (str) – the name of the file to write to

  • columns (list of str) – the names of the columns to save in the file, or None to use all columns

  • dataset (str, optional) – dataset to store the columns under.

  • datasets (list of str, optional) – names for the data set where each column is stored; defaults to the name of the column (deprecated)

  • header (str, optional, or None) – the name of the data set holding the header information, where attrs is stored if header is None, do not save the header.

  • compute (boolean, default True) – if True, wait till the store operations finish if False, return a dictionary with column name and a future object for the store. use dask.compute() to wait for the store operations on the result.

property size

The number of objects in the CatalogSource on the local rank.

If the base attribute is set, the base.size attribute will be returned.

Important

This property must be defined for all subclasses.

sort(keys, reverse=False, usecols=None)

Return a CatalogSource, sorted globally across all MPI ranks in ascending order by the input keys.

Sort columns must be floating or integer type.

Note

After the sort operation, the data is scattered evenly across all ranks.

Parameters
  • keys (list, tuple) – the names of columns to sort by. If multiple columns are provided, the data is sorted consecutively in the order provided

  • reverse (bool, optional) – if True, perform descending sort operations

  • usecols (list, optional) – the name of the columns to include in the returned CatalogSource

to_mesh(Nmesh=None, BoxSize=None, dtype='f4', interlaced=False, compensated=False, resampler='cic', weight='Weight', value='Value', selection='Selection', position='Position', window=None)

Convert the CatalogSource to a MeshSource, using the specified parameters.

Parameters
  • Nmesh (int, optional) – the number of cells per side on the mesh; must be provided if not stored in attrs

  • BoxSize (scalar, 3-vector, optional) – the size of the box; must be provided if not stored in attrs

  • dtype (string, optional) – the data type of the mesh array

  • interlaced (bool, optional) – use the interlacing technique of Sefusatti et al. 2015 to reduce the effects of aliasing on Fourier space quantities computed from the mesh

  • compensated (bool, optional) – whether to correct for the resampler window introduced by the grid interpolation scheme

  • resampler (str, optional) – the string specifying which resampler interpolation scheme to use; see pmesh.resampler.methods

  • weight (str, optional) – the name of the column specifying the weight for each particle

  • value (str, optional) – the name of the column specifying the field value for each particle

  • selection (str, optional) – the name of the column that specifies which (if any) slice of the CatalogSource to take

  • position (str, optional) – the name of the column that specifies the position data of the objects in the catalog

  • window (str, deprecated) – use resampler instead.

Returns

mesh – a mesh object that provides an interface for gridding particle data onto a specified mesh

Return type

CatalogMesh

to_subvolumes(domain=None, position='Position', columns=None)

Domain Decompose a catalog, sending items to the ranks according to the supplied domain object. Using the position column as the Position.

This will read in the full position array and all of the requested columns.

Parameters
  • domain (pmesh.domain.GridND object, or None) – The domain to distribute the catalog. If None, try to evenly divide spatially. An easiest way to find a domain object is to use pm.domain, where pm is a pmesh.pm.ParticleMesh object.

  • position (string_like) – column to use to compute the position.

  • columns (list of string_like) – columns to include in the new catalog, if not supplied, all catalogs will be exchanged.

Returns

A decomposed catalog source, where each rank only contains objects belongs to the rank as claimed by the domain object.

self.attrs are carried over as a shallow copy to the returned object.

Return type

CatalogSource

view(type=None)

Return a “view” of the CatalogSource object, with the returned type set by type.

This initializes a new empty class of type type and attaches attributes to it via the __finalize__() mechanism.

Parameters

type (Python type) – the desired class type of the returned object.

class nbodykit.source.catalog.file.CSVCatalog(path, *args, **kwargs)

A CatalogSource that uses CSVFile to read data from disk.

Multiple files can be read at once by supplying a list of file names or a glob asterisk pattern as the path argument. See Reading Multiple Data Files at Once for examples.

Parameters
  • path (str) – the name of the file to load

  • names (list of str) – the names of the columns of the csv file; this should give names of all the columns in the file – pass usecols to select a subset of columns

  • blocksize (int, optional) – the file will be partitioned into blocks of bytes roughly of this size

  • dtype (dict, str, optional) – if specified as a string, assume all columns have this dtype, otherwise; each column can have a dtype entry in the dict; if not specified, the data types will be inferred from the file

  • usecols (list, optional) – a pandas.read_csv; a subset of names to store, ignoring all other columns

  • delim_whitespace (bool, optional) – a pandas.read_csv keyword; if the CSV file is space-separated, set this to True

  • **config – additional keyword arguments that will be passed to pandas.read_csv(); see the documentation of that function for a full list of possible options

  • comm (MPI Communicator, optional) – the MPI communicator instance; default (None) sets to the current communicator

  • attrs (dict, optional) – dictionary of meta-data to store in attrs

Examples

Please see the documentation for examples.

Attributes
Index

The attribute giving the global index rank of each particle in the list.

attrs

A dictionary storing relevant meta-data about the CatalogSource.

columns

All columns in the CatalogSource, including those hard-coded into the class’s defintion and override columns provided by the user.

csize

The total, collective size of the CatalogSource, i.e., summed across all ranks.

hardcolumns

The union of the columns in the file and any transformed columns.

size

The number of objects in the CatalogSource on the local rank.

Methods

Selection()

A boolean column that selects a subset slice of the CatalogSource.

Value()

When interpolating a CatalogSource on to a mesh, the value of this array is used as the Value that each particle contributes to a given mesh cell.

Weight()

The column giving the weight to use for each particle on the mesh.

compute(*args, **kwargs)

Our version of dask.compute() that computes multiple delayed dask collections at once.

copy()

Return a shallow copy of the object, where each column is a reference of the corresponding column in self.

get_hardcolumn(col)

Return a column from the underlying file source.

gslice(start, stop[, end, redistribute])

Execute a global slice of a CatalogSource.

make_column(array)

Utility function to convert an array-like object to a dask.array.Array.

persist([columns])

Return a CatalogSource, where the selected columns are computed and persist in memory.

query_range(start, end)

Seek to a range in the file catalog.

read(columns)

Return the requested columns as dask arrays.

save(output[, columns, dataset, datasets, ...])

Save the CatalogSource to a bigfile.BigFile.

sort(keys[, reverse, usecols])

Return a CatalogSource, sorted globally across all MPI ranks in ascending order by the input keys.

to_mesh([Nmesh, BoxSize, dtype, interlaced, ...])

Convert the CatalogSource to a MeshSource, using the specified parameters.

to_subvolumes([domain, position, columns])

Domain Decompose a catalog, sending items to the ranks according to the supplied domain object.

view([type])

Return a "view" of the CatalogSource object, with the returned type set by type.

create_instance

property Index

The attribute giving the global index rank of each particle in the list. It is an integer from 0 to self.csize.

Note that slicing changes this index value.

Selection()

A boolean column that selects a subset slice of the CatalogSource.

By default, this column is set to True for all particles, and all CatalogSource objects will contain this column.

Value()

When interpolating a CatalogSource on to a mesh, the value of this array is used as the Value that each particle contributes to a given mesh cell.

The mesh field is a weighted average of Value, with the weights given by Weight.

By default, this array is set to unity for all particles, and all CatalogSource objects will contain this column.

Weight()

The column giving the weight to use for each particle on the mesh.

The mesh field is a weighted average of Value, with the weights given by Weight.

By default, this array is set to unity for all particles, and all CatalogSource objects will contain this column.

__delitem__(col)

Delete a column; cannot delete a “hard-coded” column.

Note

If the base attribute is set, columns will be deleted from base instead of from self.

__finalize__(other)

Finalize the creation of a CatalogSource object by copying over any additional attributes from a second CatalogSource.

The idea here is to only copy over attributes that are similar to meta-data, so we do not copy some of the core attributes of the CatalogSource object.

Parameters

other – the second object to copy over attributes from; it needs to be a subclass of CatalogSourcBase for attributes to be copied

Returns

return self, with the added attributes

Return type

CatalogSource

__getitem__(sel)

The following types of indexing are supported:

  1. strings specifying a column in the CatalogSource; returns a dask array holding the column data

  2. boolean arrays specifying a slice of the CatalogSource; returns a CatalogSource holding only the revelant slice

  3. slice object specifying which particles to select

  4. list of strings specifying column names; returns a CatalogSource holding only the selected columns

Notes

  • Slicing is a collective operation

  • If the base attribute is set, columns will be returned from base instead of from self.

__len__()

The local size of the CatalogSource on a given rank.

__setitem__(col, value)

Add columns to the CatalogSource, overriding any existing columns with the name col.

property attrs

A dictionary storing relevant meta-data about the CatalogSource.

property columns

All columns in the CatalogSource, including those hard-coded into the class’s defintion and override columns provided by the user.

Note

If the base attribute is set, the value of base.columns will be returned.

compute(*args, **kwargs)

Our version of dask.compute() that computes multiple delayed dask collections at once.

This should be called on the return value of read() to converts any dask arrays to numpy arrays.

. note::

If the base attribute is set, compute() will called using base instead of self.

Parameters

args (object) – Any number of objects. If the object is a dask collection, it’s computed and the result is returned. Otherwise it’s passed through unchanged.

copy()

Return a shallow copy of the object, where each column is a reference of the corresponding column in self.

Note

No copy of data is made.

Note

This is different from view in that the attributes dictionary of the copy no longer related to self.

Returns

a new CatalogSource that holds all of the data columns of self

Return type

CatalogSource

property csize

The total, collective size of the CatalogSource, i.e., summed across all ranks.

It is the sum of size across all available ranks.

If the base attribute is set, the base.csize attribute will be returned.

get_hardcolumn(col)

Return a column from the underlying file source.

Columns are returned as dask arrays.

gslice(start, stop, end=1, redistribute=True)

Execute a global slice of a CatalogSource.

Note

After the global slice is performed, the data is scattered evenly across all ranks.

Note

The current algorithm generates an index on the root rank and does not scale well.

Parameters
  • start (int) – the start index of the global slice

  • stop (int) – the stop index of the global slice

  • step (int, optional) – the default step size of the global size

  • redistribute (bool, optional) – if True, evenly re-distribute the sliced data across all ranks, otherwise just return any local data part of the global slice

property hardcolumns

The union of the columns in the file and any transformed columns.

static make_column(array)

Utility function to convert an array-like object to a dask.array.Array.

Note

The dask array chunk size is controlled via the dask_chunk_size global option. See set_options.

Parameters

array (array_like) – an array-like object; can be a dask array, numpy array, ColumnAccessor, or other non-scalar array-like object

Returns

a dask array initialized from array

Return type

dask.array.Array

persist(columns=None)

Return a CatalogSource, where the selected columns are computed and persist in memory.

query_range(start, end)

Seek to a range in the file catalog.

Parameters
  • start (int) – start of the file relative to the physical file

  • end (int) – end of the file relative to the physical file

Returns

  • A new catalog that only accesses the given region of the file.

  • If the original catalog (self) contains any assigned columns not directly

  • obtained from the file, then the function will raise ValueError, since

  • the operation in that case is not well defined.

read(columns)

Return the requested columns as dask arrays.

Parameters

columns (list of str) – the names of the requested columns

Returns

the list of column data, in the form of dask arrays

Return type

list of dask.array.Array

save(output, columns=None, dataset=None, datasets=None, header='Header', compute=True)

Save the CatalogSource to a bigfile.BigFile.

Only the selected columns are saved and attrs are saved in header. The attrs of columns are stored in the datasets.

Parameters
  • output (str) – the name of the file to write to

  • columns (list of str) – the names of the columns to save in the file, or None to use all columns

  • dataset (str, optional) – dataset to store the columns under.

  • datasets (list of str, optional) – names for the data set where each column is stored; defaults to the name of the column (deprecated)

  • header (str, optional, or None) – the name of the data set holding the header information, where attrs is stored if header is None, do not save the header.

  • compute (boolean, default True) – if True, wait till the store operations finish if False, return a dictionary with column name and a future object for the store. use dask.compute() to wait for the store operations on the result.

property size

The number of objects in the CatalogSource on the local rank.

If the base attribute is set, the base.size attribute will be returned.

Important

This property must be defined for all subclasses.

sort(keys, reverse=False, usecols=None)

Return a CatalogSource, sorted globally across all MPI ranks in ascending order by the input keys.

Sort columns must be floating or integer type.

Note

After the sort operation, the data is scattered evenly across all ranks.

Parameters
  • keys (list, tuple) – the names of columns to sort by. If multiple columns are provided, the data is sorted consecutively in the order provided

  • reverse (bool, optional) – if True, perform descending sort operations

  • usecols (list, optional) – the name of the columns to include in the returned CatalogSource

to_mesh(Nmesh=None, BoxSize=None, dtype='f4', interlaced=False, compensated=False, resampler='cic', weight='Weight', value='Value', selection='Selection', position='Position', window=None)

Convert the CatalogSource to a MeshSource, using the specified parameters.

Parameters
  • Nmesh (int, optional) – the number of cells per side on the mesh; must be provided if not stored in attrs

  • BoxSize (scalar, 3-vector, optional) – the size of the box; must be provided if not stored in attrs

  • dtype (string, optional) – the data type of the mesh array

  • interlaced (bool, optional) – use the interlacing technique of Sefusatti et al. 2015 to reduce the effects of aliasing on Fourier space quantities computed from the mesh

  • compensated (bool, optional) – whether to correct for the resampler window introduced by the grid interpolation scheme

  • resampler (str, optional) – the string specifying which resampler interpolation scheme to use; see pmesh.resampler.methods

  • weight (str, optional) – the name of the column specifying the weight for each particle

  • value (str, optional) – the name of the column specifying the field value for each particle

  • selection (str, optional) – the name of the column that specifies which (if any) slice of the CatalogSource to take

  • position (str, optional) – the name of the column that specifies the position data of the objects in the catalog

  • window (str, deprecated) – use resampler instead.

Returns

mesh – a mesh object that provides an interface for gridding particle data onto a specified mesh

Return type

CatalogMesh

to_subvolumes(domain=None, position='Position', columns=None)

Domain Decompose a catalog, sending items to the ranks according to the supplied domain object. Using the position column as the Position.

This will read in the full position array and all of the requested columns.

Parameters
  • domain (pmesh.domain.GridND object, or None) – The domain to distribute the catalog. If None, try to evenly divide spatially. An easiest way to find a domain object is to use pm.domain, where pm is a pmesh.pm.ParticleMesh object.

  • position (string_like) – column to use to compute the position.

  • columns (list of string_like) – columns to include in the new catalog, if not supplied, all catalogs will be exchanged.

Returns

A decomposed catalog source, where each rank only contains objects belongs to the rank as claimed by the domain object.

self.attrs are carried over as a shallow copy to the returned object.

Return type

CatalogSource

view(type=None)

Return a “view” of the CatalogSource object, with the returned type set by type.

This initializes a new empty class of type type and attaches attributes to it via the __finalize__() mechanism.

Parameters

type (Python type) – the desired class type of the returned object.

class nbodykit.source.catalog.file.FITSCatalog(path, *args, **kwargs)

A CatalogSource that uses FITSFile to read data from disk.

Multiple files can be read at once by supplying a list of file names or a glob asterisk pattern as the path argument. See Reading Multiple Data Files at Once for examples.

Parameters
  • path (str) – the file path to load

  • ext (number or string, optional) – The extension. Either the numerical extension from zero or a string extension name. If not sent, data is read from the first HDU that has data.

  • comm (MPI Communicator, optional) – the MPI communicator instance; default (None) sets to the current communicator

  • attrs (dict, optional) – dictionary of meta-data to store in attrs

Examples

Please see the documentation for examples.

Attributes
Index

The attribute giving the global index rank of each particle in the list.

attrs

A dictionary storing relevant meta-data about the CatalogSource.

columns

All columns in the CatalogSource, including those hard-coded into the class’s defintion and override columns provided by the user.

csize

The total, collective size of the CatalogSource, i.e., summed across all ranks.

hardcolumns

The union of the columns in the file and any transformed columns.

size

The number of objects in the CatalogSource on the local rank.

Methods

Selection()

A boolean column that selects a subset slice of the CatalogSource.

Value()

When interpolating a CatalogSource on to a mesh, the value of this array is used as the Value that each particle contributes to a given mesh cell.

Weight()

The column giving the weight to use for each particle on the mesh.

compute(*args, **kwargs)

Our version of dask.compute() that computes multiple delayed dask collections at once.

copy()

Return a shallow copy of the object, where each column is a reference of the corresponding column in self.

get_hardcolumn(col)

Return a column from the underlying file source.

gslice(start, stop[, end, redistribute])

Execute a global slice of a CatalogSource.

make_column(array)

Utility function to convert an array-like object to a dask.array.Array.

persist([columns])

Return a CatalogSource, where the selected columns are computed and persist in memory.

query_range(start, end)

Seek to a range in the file catalog.

read(columns)

Return the requested columns as dask arrays.

save(output[, columns, dataset, datasets, ...])

Save the CatalogSource to a bigfile.BigFile.

sort(keys[, reverse, usecols])

Return a CatalogSource, sorted globally across all MPI ranks in ascending order by the input keys.

to_mesh([Nmesh, BoxSize, dtype, interlaced, ...])

Convert the CatalogSource to a MeshSource, using the specified parameters.

to_subvolumes([domain, position, columns])

Domain Decompose a catalog, sending items to the ranks according to the supplied domain object.

view([type])

Return a "view" of the CatalogSource object, with the returned type set by type.

create_instance

property Index

The attribute giving the global index rank of each particle in the list. It is an integer from 0 to self.csize.

Note that slicing changes this index value.

Selection()

A boolean column that selects a subset slice of the CatalogSource.

By default, this column is set to True for all particles, and all CatalogSource objects will contain this column.

Value()

When interpolating a CatalogSource on to a mesh, the value of this array is used as the Value that each particle contributes to a given mesh cell.

The mesh field is a weighted average of Value, with the weights given by Weight.

By default, this array is set to unity for all particles, and all CatalogSource objects will contain this column.

Weight()

The column giving the weight to use for each particle on the mesh.

The mesh field is a weighted average of Value, with the weights given by Weight.

By default, this array is set to unity for all particles, and all CatalogSource objects will contain this column.

__delitem__(col)

Delete a column; cannot delete a “hard-coded” column.

Note

If the base attribute is set, columns will be deleted from base instead of from self.

__finalize__(other)

Finalize the creation of a CatalogSource object by copying over any additional attributes from a second CatalogSource.

The idea here is to only copy over attributes that are similar to meta-data, so we do not copy some of the core attributes of the CatalogSource object.

Parameters

other – the second object to copy over attributes from; it needs to be a subclass of CatalogSourcBase for attributes to be copied

Returns

return self, with the added attributes

Return type

CatalogSource

__getitem__(sel)

The following types of indexing are supported:

  1. strings specifying a column in the CatalogSource; returns a dask array holding the column data

  2. boolean arrays specifying a slice of the CatalogSource; returns a CatalogSource holding only the revelant slice

  3. slice object specifying which particles to select

  4. list of strings specifying column names; returns a CatalogSource holding only the selected columns

Notes

  • Slicing is a collective operation

  • If the base attribute is set, columns will be returned from base instead of from self.

__len__()

The local size of the CatalogSource on a given rank.

__setitem__(col, value)

Add columns to the CatalogSource, overriding any existing columns with the name col.

property attrs

A dictionary storing relevant meta-data about the CatalogSource.

property columns

All columns in the CatalogSource, including those hard-coded into the class’s defintion and override columns provided by the user.

Note

If the base attribute is set, the value of base.columns will be returned.

compute(*args, **kwargs)

Our version of dask.compute() that computes multiple delayed dask collections at once.

This should be called on the return value of read() to converts any dask arrays to numpy arrays.

. note::

If the base attribute is set, compute() will called using base instead of self.

Parameters

args (object) – Any number of objects. If the object is a dask collection, it’s computed and the result is returned. Otherwise it’s passed through unchanged.

copy()

Return a shallow copy of the object, where each column is a reference of the corresponding column in self.

Note

No copy of data is made.

Note

This is different from view in that the attributes dictionary of the copy no longer related to self.

Returns

a new CatalogSource that holds all of the data columns of self

Return type

CatalogSource

property csize

The total, collective size of the CatalogSource, i.e., summed across all ranks.

It is the sum of size across all available ranks.

If the base attribute is set, the base.csize attribute will be returned.

get_hardcolumn(col)

Return a column from the underlying file source.

Columns are returned as dask arrays.

gslice(start, stop, end=1, redistribute=True)

Execute a global slice of a CatalogSource.

Note

After the global slice is performed, the data is scattered evenly across all ranks.

Note

The current algorithm generates an index on the root rank and does not scale well.

Parameters
  • start (int) – the start index of the global slice

  • stop (int) – the stop index of the global slice

  • step (int, optional) – the default step size of the global size

  • redistribute (bool, optional) – if True, evenly re-distribute the sliced data across all ranks, otherwise just return any local data part of the global slice

property hardcolumns

The union of the columns in the file and any transformed columns.

static make_column(array)

Utility function to convert an array-like object to a dask.array.Array.

Note

The dask array chunk size is controlled via the dask_chunk_size global option. See set_options.

Parameters

array (array_like) – an array-like object; can be a dask array, numpy array, ColumnAccessor, or other non-scalar array-like object

Returns

a dask array initialized from array

Return type

dask.array.Array

persist(columns=None)

Return a CatalogSource, where the selected columns are computed and persist in memory.

query_range(start, end)

Seek to a range in the file catalog.

Parameters
  • start (int) – start of the file relative to the physical file

  • end (int) – end of the file relative to the physical file

Returns

  • A new catalog that only accesses the given region of the file.

  • If the original catalog (self) contains any assigned columns not directly

  • obtained from the file, then the function will raise ValueError, since

  • the operation in that case is not well defined.

read(columns)

Return the requested columns as dask arrays.

Parameters

columns (list of str) – the names of the requested columns

Returns

the list of column data, in the form of dask arrays

Return type

list of dask.array.Array

save(output, columns=None, dataset=None, datasets=None, header='Header', compute=True)

Save the CatalogSource to a bigfile.BigFile.

Only the selected columns are saved and attrs are saved in header. The attrs of columns are stored in the datasets.

Parameters
  • output (str) – the name of the file to write to

  • columns (list of str) – the names of the columns to save in the file, or None to use all columns

  • dataset (str, optional) – dataset to store the columns under.

  • datasets (list of str, optional) – names for the data set where each column is stored; defaults to the name of the column (deprecated)

  • header (str, optional, or None) – the name of the data set holding the header information, where attrs is stored if header is None, do not save the header.

  • compute (boolean, default True) – if True, wait till the store operations finish if False, return a dictionary with column name and a future object for the store. use dask.compute() to wait for the store operations on the result.

property size

The number of objects in the CatalogSource on the local rank.

If the base attribute is set, the base.size attribute will be returned.

Important

This property must be defined for all subclasses.

sort(keys, reverse=False, usecols=None)

Return a CatalogSource, sorted globally across all MPI ranks in ascending order by the input keys.

Sort columns must be floating or integer type.

Note

After the sort operation, the data is scattered evenly across all ranks.

Parameters
  • keys (list, tuple) – the names of columns to sort by. If multiple columns are provided, the data is sorted consecutively in the order provided

  • reverse (bool, optional) – if True, perform descending sort operations

  • usecols (list, optional) – the name of the columns to include in the returned CatalogSource

to_mesh(Nmesh=None, BoxSize=None, dtype='f4', interlaced=False, compensated=False, resampler='cic', weight='Weight', value='Value', selection='Selection', position='Position', window=None)

Convert the CatalogSource to a MeshSource, using the specified parameters.

Parameters
  • Nmesh (int, optional) – the number of cells per side on the mesh; must be provided if not stored in attrs

  • BoxSize (scalar, 3-vector, optional) – the size of the box; must be provided if not stored in attrs

  • dtype (string, optional) – the data type of the mesh array

  • interlaced (bool, optional) – use the interlacing technique of Sefusatti et al. 2015 to reduce the effects of aliasing on Fourier space quantities computed from the mesh

  • compensated (bool, optional) – whether to correct for the resampler window introduced by the grid interpolation scheme

  • resampler (str, optional) – the string specifying which resampler interpolation scheme to use; see pmesh.resampler.methods

  • weight (str, optional) – the name of the column specifying the weight for each particle

  • value (str, optional) – the name of the column specifying the field value for each particle

  • selection (str, optional) – the name of the column that specifies which (if any) slice of the CatalogSource to take

  • position (str, optional) – the name of the column that specifies the position data of the objects in the catalog

  • window (str, deprecated) – use resampler instead.

Returns

mesh – a mesh object that provides an interface for gridding particle data onto a specified mesh

Return type

CatalogMesh

to_subvolumes(domain=None, position='Position', columns=None)

Domain Decompose a catalog, sending items to the ranks according to the supplied domain object. Using the position column as the Position.

This will read in the full position array and all of the requested columns.

Parameters
  • domain (pmesh.domain.GridND object, or None) – The domain to distribute the catalog. If None, try to evenly divide spatially. An easiest way to find a domain object is to use pm.domain, where pm is a pmesh.pm.ParticleMesh object.

  • position (string_like) – column to use to compute the position.

  • columns (list of string_like) – columns to include in the new catalog, if not supplied, all catalogs will be exchanged.

Returns

A decomposed catalog source, where each rank only contains objects belongs to the rank as claimed by the domain object.

self.attrs are carried over as a shallow copy to the returned object.

Return type

CatalogSource

view(type=None)

Return a “view” of the CatalogSource object, with the returned type set by type.

This initializes a new empty class of type type and attaches attributes to it via the __finalize__() mechanism.

Parameters

type (Python type) – the desired class type of the returned object.

class nbodykit.source.catalog.file.FileCatalogBase(filetype, path, args=(), kwargs={}, comm=None)[source]

Base class to create a source of particles from a single file, or multiple files, on disk.

Files of a specific type should be subclasses of this class.

Parameters
  • filetype (subclass of FileType) – the file-like class used to load the data from file; should be a subclass of nbodykit.io.base.FileType

  • path (string or list. If string, it is expanded as a glob pattern.) –

  • args (tuple, optional) – the arguments to pass to the filetype class when constructing each file object

  • kwargs (dict, optional) – the keyword arguments to pass to the filetype class when constructing each file object

  • comm (MPI Communicator, optional) – the MPI communicator instance; default (None) sets to the current communicator

Attributes
Index

The attribute giving the global index rank of each particle in the list.

attrs

A dictionary storing relevant meta-data about the CatalogSource.

columns

All columns in the CatalogSource, including those hard-coded into the class’s defintion and override columns provided by the user.

csize

The total, collective size of the CatalogSource, i.e., summed across all ranks.

hardcolumns

The union of the columns in the file and any transformed columns.

size

The number of objects in the CatalogSource on the local rank.

Methods

Selection()

A boolean column that selects a subset slice of the CatalogSource.

Value()

When interpolating a CatalogSource on to a mesh, the value of this array is used as the Value that each particle contributes to a given mesh cell.

Weight()

The column giving the weight to use for each particle on the mesh.

compute(*args, **kwargs)

Our version of dask.compute() that computes multiple delayed dask collections at once.

copy()

Return a shallow copy of the object, where each column is a reference of the corresponding column in self.

get_hardcolumn(col)

Return a column from the underlying file source.

gslice(start, stop[, end, redistribute])

Execute a global slice of a CatalogSource.

make_column(array)

Utility function to convert an array-like object to a dask.array.Array.

persist([columns])

Return a CatalogSource, where the selected columns are computed and persist in memory.

query_range(start, end)

Seek to a range in the file catalog.

read(columns)

Return the requested columns as dask arrays.

save(output[, columns, dataset, datasets, ...])

Save the CatalogSource to a bigfile.BigFile.

sort(keys[, reverse, usecols])

Return a CatalogSource, sorted globally across all MPI ranks in ascending order by the input keys.

to_mesh([Nmesh, BoxSize, dtype, interlaced, ...])

Convert the CatalogSource to a MeshSource, using the specified parameters.

to_subvolumes([domain, position, columns])

Domain Decompose a catalog, sending items to the ranks according to the supplied domain object.

view([type])

Return a "view" of the CatalogSource object, with the returned type set by type.

create_instance

property Index

The attribute giving the global index rank of each particle in the list. It is an integer from 0 to self.csize.

Note that slicing changes this index value.

Selection()

A boolean column that selects a subset slice of the CatalogSource.

By default, this column is set to True for all particles, and all CatalogSource objects will contain this column.

Value()

When interpolating a CatalogSource on to a mesh, the value of this array is used as the Value that each particle contributes to a given mesh cell.

The mesh field is a weighted average of Value, with the weights given by Weight.

By default, this array is set to unity for all particles, and all CatalogSource objects will contain this column.

Weight()

The column giving the weight to use for each particle on the mesh.

The mesh field is a weighted average of Value, with the weights given by Weight.

By default, this array is set to unity for all particles, and all CatalogSource objects will contain this column.

__delitem__(col)

Delete a column; cannot delete a “hard-coded” column.

Note

If the base attribute is set, columns will be deleted from base instead of from self.

__finalize__(other)

Finalize the creation of a CatalogSource object by copying over any additional attributes from a second CatalogSource.

The idea here is to only copy over attributes that are similar to meta-data, so we do not copy some of the core attributes of the CatalogSource object.

Parameters

other – the second object to copy over attributes from; it needs to be a subclass of CatalogSourcBase for attributes to be copied

Returns

return self, with the added attributes

Return type

CatalogSource

__getitem__(sel)

The following types of indexing are supported:

  1. strings specifying a column in the CatalogSource; returns a dask array holding the column data

  2. boolean arrays specifying a slice of the CatalogSource; returns a CatalogSource holding only the revelant slice

  3. slice object specifying which particles to select

  4. list of strings specifying column names; returns a CatalogSource holding only the selected columns

Notes

  • Slicing is a collective operation

  • If the base attribute is set, columns will be returned from base instead of from self.

__len__()

The local size of the CatalogSource on a given rank.

__setitem__(col, value)

Add columns to the CatalogSource, overriding any existing columns with the name col.

property attrs

A dictionary storing relevant meta-data about the CatalogSource.

property columns

All columns in the CatalogSource, including those hard-coded into the class’s defintion and override columns provided by the user.

Note

If the base attribute is set, the value of base.columns will be returned.

compute(*args, **kwargs)

Our version of dask.compute() that computes multiple delayed dask collections at once.

This should be called on the return value of read() to converts any dask arrays to numpy arrays.

. note::

If the base attribute is set, compute() will called using base instead of self.

Parameters

args (object) – Any number of objects. If the object is a dask collection, it’s computed and the result is returned. Otherwise it’s passed through unchanged.

copy()

Return a shallow copy of the object, where each column is a reference of the corresponding column in self.

Note

No copy of data is made.

Note

This is different from view in that the attributes dictionary of the copy no longer related to self.

Returns

a new CatalogSource that holds all of the data columns of self

Return type

CatalogSource

property csize

The total, collective size of the CatalogSource, i.e., summed across all ranks.

It is the sum of size across all available ranks.

If the base attribute is set, the base.csize attribute will be returned.

get_hardcolumn(col)[source]

Return a column from the underlying file source.

Columns are returned as dask arrays.

gslice(start, stop, end=1, redistribute=True)

Execute a global slice of a CatalogSource.

Note

After the global slice is performed, the data is scattered evenly across all ranks.

Note

The current algorithm generates an index on the root rank and does not scale well.

Parameters
  • start (int) – the start index of the global slice

  • stop (int) – the stop index of the global slice

  • step (int, optional) – the default step size of the global size

  • redistribute (bool, optional) – if True, evenly re-distribute the sliced data across all ranks, otherwise just return any local data part of the global slice

property hardcolumns

The union of the columns in the file and any transformed columns.

static make_column(array)

Utility function to convert an array-like object to a dask.array.Array.

Note

The dask array chunk size is controlled via the dask_chunk_size global option. See set_options.

Parameters

array (array_like) – an array-like object; can be a dask array, numpy array, ColumnAccessor, or other non-scalar array-like object

Returns

a dask array initialized from array

Return type

dask.array.Array

persist(columns=None)

Return a CatalogSource, where the selected columns are computed and persist in memory.

query_range(start, end)[source]

Seek to a range in the file catalog.

Parameters
  • start (int) – start of the file relative to the physical file

  • end (int) – end of the file relative to the physical file

Returns

  • A new catalog that only accesses the given region of the file.

  • If the original catalog (self) contains any assigned columns not directly

  • obtained from the file, then the function will raise ValueError, since

  • the operation in that case is not well defined.

read(columns)

Return the requested columns as dask arrays.

Parameters

columns (list of str) – the names of the requested columns

Returns

the list of column data, in the form of dask arrays

Return type

list of dask.array.Array

save(output, columns=None, dataset=None, datasets=None, header='Header', compute=True)

Save the CatalogSource to a bigfile.BigFile.

Only the selected columns are saved and attrs are saved in header. The attrs of columns are stored in the datasets.

Parameters
  • output (str) – the name of the file to write to

  • columns (list of str) – the names of the columns to save in the file, or None to use all columns

  • dataset (str, optional) – dataset to store the columns under.

  • datasets (list of str, optional) – names for the data set where each column is stored; defaults to the name of the column (deprecated)

  • header (str, optional, or None) – the name of the data set holding the header information, where attrs is stored if header is None, do not save the header.

  • compute (boolean, default True) – if True, wait till the store operations finish if False, return a dictionary with column name and a future object for the store. use dask.compute() to wait for the store operations on the result.

property size

The number of objects in the CatalogSource on the local rank.

If the base attribute is set, the base.size attribute will be returned.

Important

This property must be defined for all subclasses.

sort(keys, reverse=False, usecols=None)

Return a CatalogSource, sorted globally across all MPI ranks in ascending order by the input keys.

Sort columns must be floating or integer type.

Note

After the sort operation, the data is scattered evenly across all ranks.

Parameters
  • keys (list, tuple) – the names of columns to sort by. If multiple columns are provided, the data is sorted consecutively in the order provided

  • reverse (bool, optional) – if True, perform descending sort operations

  • usecols (list, optional) – the name of the columns to include in the returned CatalogSource

to_mesh(Nmesh=None, BoxSize=None, dtype='f4', interlaced=False, compensated=False, resampler='cic', weight='Weight', value='Value', selection='Selection', position='Position', window=None)

Convert the CatalogSource to a MeshSource, using the specified parameters.

Parameters
  • Nmesh (int, optional) – the number of cells per side on the mesh; must be provided if not stored in attrs

  • BoxSize (scalar, 3-vector, optional) – the size of the box; must be provided if not stored in attrs

  • dtype (string, optional) – the data type of the mesh array

  • interlaced (bool, optional) – use the interlacing technique of Sefusatti et al. 2015 to reduce the effects of aliasing on Fourier space quantities computed from the mesh

  • compensated (bool, optional) – whether to correct for the resampler window introduced by the grid interpolation scheme

  • resampler (str, optional) – the string specifying which resampler interpolation scheme to use; see pmesh.resampler.methods

  • weight (str, optional) – the name of the column specifying the weight for each particle

  • value (str, optional) – the name of the column specifying the field value for each particle

  • selection (str, optional) – the name of the column that specifies which (if any) slice of the CatalogSource to take

  • position (str, optional) – the name of the column that specifies the position data of the objects in the catalog

  • window (str, deprecated) – use resampler instead.

Returns

mesh – a mesh object that provides an interface for gridding particle data onto a specified mesh

Return type

CatalogMesh

to_subvolumes(domain=None, position='Position', columns=None)

Domain Decompose a catalog, sending items to the ranks according to the supplied domain object. Using the position column as the Position.

This will read in the full position array and all of the requested columns.

Parameters
  • domain (pmesh.domain.GridND object, or None) – The domain to distribute the catalog. If None, try to evenly divide spatially. An easiest way to find a domain object is to use pm.domain, where pm is a pmesh.pm.ParticleMesh object.

  • position (string_like) – column to use to compute the position.

  • columns (list of string_like) – columns to include in the new catalog, if not supplied, all catalogs will be exchanged.

Returns

A decomposed catalog source, where each rank only contains objects belongs to the rank as claimed by the domain object.

self.attrs are carried over as a shallow copy to the returned object.

Return type

CatalogSource

view(type=None)

Return a “view” of the CatalogSource object, with the returned type set by type.

This initializes a new empty class of type type and attaches attributes to it via the __finalize__() mechanism.

Parameters

type (Python type) – the desired class type of the returned object.

nbodykit.source.catalog.file.FileCatalogFactory(name, filetype, examples=None)[source]

Factory method to create a CatalogSource that uses a subclass of nbodykit.io.base.FileType to read data from disk.

Parameters
  • name (str) – the name of the catalog class to create

  • filetype (subclass of nbodykit.io.base.FileType) – the subclass of the FileType that reads a specific type of data

  • examples (str, optional) – if given, a documentation cross-reference link where examples can be found

Returns

the CatalogSource object that reads data using filetype

Return type

subclass of FileCatalogBase

class nbodykit.source.catalog.file.Gadget1Catalog(path, *args, **kwargs)

A CatalogSource that uses Gadget1File to read data from disk.

Multiple files can be read at once by supplying a list of file names or a glob asterisk pattern as the path argument. See Reading Multiple Data Files at Once for examples.

Parameters
  • path (str) – the path to the binary file to load

  • columndefs (list) – a list of triplets (columnname, element_dtype, particle_types)

  • ptype (int) – type of particle of interest.

  • hdtype (list, dtype) – dtype of the header; must define Massarr and Npart

  • comm (MPI Communicator, optional) – the MPI communicator instance; default (None) sets to the current communicator

  • attrs (dict, optional) – dictionary of meta-data to store in attrs

Attributes
Index

The attribute giving the global index rank of each particle in the list.

attrs

A dictionary storing relevant meta-data about the CatalogSource.

columns

All columns in the CatalogSource, including those hard-coded into the class’s defintion and override columns provided by the user.

csize

The total, collective size of the CatalogSource, i.e., summed across all ranks.

hardcolumns

The union of the columns in the file and any transformed columns.

size

The number of objects in the CatalogSource on the local rank.

Methods

Selection()

A boolean column that selects a subset slice of the CatalogSource.

Value()

When interpolating a CatalogSource on to a mesh, the value of this array is used as the Value that each particle contributes to a given mesh cell.

Weight()

The column giving the weight to use for each particle on the mesh.

compute(*args, **kwargs)

Our version of dask.compute() that computes multiple delayed dask collections at once.

copy()

Return a shallow copy of the object, where each column is a reference of the corresponding column in self.

get_hardcolumn(col)

Return a column from the underlying file source.

gslice(start, stop[, end, redistribute])

Execute a global slice of a CatalogSource.

make_column(array)

Utility function to convert an array-like object to a dask.array.Array.

persist([columns])

Return a CatalogSource, where the selected columns are computed and persist in memory.

query_range(start, end)

Seek to a range in the file catalog.

read(columns)

Return the requested columns as dask arrays.

save(output[, columns, dataset, datasets, ...])

Save the CatalogSource to a bigfile.BigFile.

sort(keys[, reverse, usecols])

Return a CatalogSource, sorted globally across all MPI ranks in ascending order by the input keys.

to_mesh([Nmesh, BoxSize, dtype, interlaced, ...])

Convert the CatalogSource to a MeshSource, using the specified parameters.

to_subvolumes([domain, position, columns])

Domain Decompose a catalog, sending items to the ranks according to the supplied domain object.

view([type])

Return a "view" of the CatalogSource object, with the returned type set by type.

create_instance

property Index

The attribute giving the global index rank of each particle in the list. It is an integer from 0 to self.csize.

Note that slicing changes this index value.

Selection()

A boolean column that selects a subset slice of the CatalogSource.

By default, this column is set to True for all particles, and all CatalogSource objects will contain this column.

Value()

When interpolating a CatalogSource on to a mesh, the value of this array is used as the Value that each particle contributes to a given mesh cell.

The mesh field is a weighted average of Value, with the weights given by Weight.

By default, this array is set to unity for all particles, and all CatalogSource objects will contain this column.

Weight()

The column giving the weight to use for each particle on the mesh.

The mesh field is a weighted average of Value, with the weights given by Weight.

By default, this array is set to unity for all particles, and all CatalogSource objects will contain this column.

__delitem__(col)

Delete a column; cannot delete a “hard-coded” column.

Note

If the base attribute is set, columns will be deleted from base instead of from self.

__finalize__(other)

Finalize the creation of a CatalogSource object by copying over any additional attributes from a second CatalogSource.

The idea here is to only copy over attributes that are similar to meta-data, so we do not copy some of the core attributes of the CatalogSource object.

Parameters

other – the second object to copy over attributes from; it needs to be a subclass of CatalogSourcBase for attributes to be copied

Returns

return self, with the added attributes

Return type

CatalogSource

__getitem__(sel)

The following types of indexing are supported:

  1. strings specifying a column in the CatalogSource; returns a dask array holding the column data

  2. boolean arrays specifying a slice of the CatalogSource; returns a CatalogSource holding only the revelant slice

  3. slice object specifying which particles to select

  4. list of strings specifying column names; returns a CatalogSource holding only the selected columns

Notes

  • Slicing is a collective operation

  • If the base attribute is set, columns will be returned from base instead of from self.

__len__()

The local size of the CatalogSource on a given rank.

__setitem__(col, value)

Add columns to the CatalogSource, overriding any existing columns with the name col.

property attrs

A dictionary storing relevant meta-data about the CatalogSource.

property columns

All columns in the CatalogSource, including those hard-coded into the class’s defintion and override columns provided by the user.

Note

If the base attribute is set, the value of base.columns will be returned.

compute(*args, **kwargs)

Our version of dask.compute() that computes multiple delayed dask collections at once.

This should be called on the return value of read() to converts any dask arrays to numpy arrays.

. note::

If the base attribute is set, compute() will called using base instead of self.

Parameters

args (object) – Any number of objects. If the object is a dask collection, it’s computed and the result is returned. Otherwise it’s passed through unchanged.

copy()

Return a shallow copy of the object, where each column is a reference of the corresponding column in self.

Note

No copy of data is made.

Note

This is different from view in that the attributes dictionary of the copy no longer related to self.

Returns

a new CatalogSource that holds all of the data columns of self

Return type

CatalogSource

property csize

The total, collective size of the CatalogSource, i.e., summed across all ranks.

It is the sum of size across all available ranks.

If the base attribute is set, the base.csize attribute will be returned.

get_hardcolumn(col)

Return a column from the underlying file source.

Columns are returned as dask arrays.

gslice(start, stop, end=1, redistribute=True)

Execute a global slice of a CatalogSource.

Note

After the global slice is performed, the data is scattered evenly across all ranks.

Note

The current algorithm generates an index on the root rank and does not scale well.

Parameters
  • start (int) – the start index of the global slice

  • stop (int) – the stop index of the global slice

  • step (int, optional) – the default step size of the global size

  • redistribute (bool, optional) – if True, evenly re-distribute the sliced data across all ranks, otherwise just return any local data part of the global slice

property hardcolumns

The union of the columns in the file and any transformed columns.

static make_column(array)

Utility function to convert an array-like object to a dask.array.Array.

Note

The dask array chunk size is controlled via the dask_chunk_size global option. See set_options.

Parameters

array (array_like) – an array-like object; can be a dask array, numpy array, ColumnAccessor, or other non-scalar array-like object

Returns

a dask array initialized from array

Return type

dask.array.Array

persist(columns=None)

Return a CatalogSource, where the selected columns are computed and persist in memory.

query_range(start, end)

Seek to a range in the file catalog.

Parameters
  • start (int) – start of the file relative to the physical file

  • end (int) – end of the file relative to the physical file

Returns

  • A new catalog that only accesses the given region of the file.

  • If the original catalog (self) contains any assigned columns not directly

  • obtained from the file, then the function will raise ValueError, since

  • the operation in that case is not well defined.

read(columns)

Return the requested columns as dask arrays.

Parameters

columns (list of str) – the names of the requested columns

Returns

the list of column data, in the form of dask arrays

Return type

list of dask.array.Array

save(output, columns=None, dataset=None, datasets=None, header='Header', compute=True)

Save the CatalogSource to a bigfile.BigFile.

Only the selected columns are saved and attrs are saved in header. The attrs of columns are stored in the datasets.

Parameters
  • output (str) – the name of the file to write to

  • columns (list of str) – the names of the columns to save in the file, or None to use all columns

  • dataset (str, optional) – dataset to store the columns under.

  • datasets (list of str, optional) – names for the data set where each column is stored; defaults to the name of the column (deprecated)

  • header (str, optional, or None) – the name of the data set holding the header information, where attrs is stored if header is None, do not save the header.

  • compute (boolean, default True) – if True, wait till the store operations finish if False, return a dictionary with column name and a future object for the store. use dask.compute() to wait for the store operations on the result.

property size

The number of objects in the CatalogSource on the local rank.

If the base attribute is set, the base.size attribute will be returned.

Important

This property must be defined for all subclasses.

sort(keys, reverse=False, usecols=None)

Return a CatalogSource, sorted globally across all MPI ranks in ascending order by the input keys.

Sort columns must be floating or integer type.

Note

After the sort operation, the data is scattered evenly across all ranks.

Parameters
  • keys (list, tuple) – the names of columns to sort by. If multiple columns are provided, the data is sorted consecutively in the order provided

  • reverse (bool, optional) – if True, perform descending sort operations

  • usecols (list, optional) – the name of the columns to include in the returned CatalogSource

to_mesh(Nmesh=None, BoxSize=None, dtype='f4', interlaced=False, compensated=False, resampler='cic', weight='Weight', value='Value', selection='Selection', position='Position', window=None)

Convert the CatalogSource to a MeshSource, using the specified parameters.

Parameters
  • Nmesh (int, optional) – the number of cells per side on the mesh; must be provided if not stored in attrs

  • BoxSize (scalar, 3-vector, optional) – the size of the box; must be provided if not stored in attrs

  • dtype (string, optional) – the data type of the mesh array

  • interlaced (bool, optional) – use the interlacing technique of Sefusatti et al. 2015 to reduce the effects of aliasing on Fourier space quantities computed from the mesh

  • compensated (bool, optional) – whether to correct for the resampler window introduced by the grid interpolation scheme

  • resampler (str, optional) – the string specifying which resampler interpolation scheme to use; see pmesh.resampler.methods

  • weight (str, optional) – the name of the column specifying the weight for each particle

  • value (str, optional) – the name of the column specifying the field value for each particle

  • selection (str, optional) – the name of the column that specifies which (if any) slice of the CatalogSource to take

  • position (str, optional) – the name of the column that specifies the position data of the objects in the catalog

  • window (str, deprecated) – use resampler instead.

Returns

mesh – a mesh object that provides an interface for gridding particle data onto a specified mesh

Return type

CatalogMesh

to_subvolumes(domain=None, position='Position', columns=None)

Domain Decompose a catalog, sending items to the ranks according to the supplied domain object. Using the position column as the Position.

This will read in the full position array and all of the requested columns.

Parameters
  • domain (pmesh.domain.GridND object, or None) – The domain to distribute the catalog. If None, try to evenly divide spatially. An easiest way to find a domain object is to use pm.domain, where pm is a pmesh.pm.ParticleMesh object.

  • position (string_like) – column to use to compute the position.

  • columns (list of string_like) – columns to include in the new catalog, if not supplied, all catalogs will be exchanged.

Returns

A decomposed catalog source, where each rank only contains objects belongs to the rank as claimed by the domain object.

self.attrs are carried over as a shallow copy to the returned object.

Return type

CatalogSource

view(type=None)

Return a “view” of the CatalogSource object, with the returned type set by type.

This initializes a new empty class of type type and attaches attributes to it via the __finalize__() mechanism.

Parameters

type (Python type) – the desired class type of the returned object.

class nbodykit.source.catalog.file.HDFCatalog(path, *args, **kwargs)

A CatalogSource that uses HDFFile to read data from disk.

Multiple files can be read at once by supplying a list of file names or a glob asterisk pattern as the path argument. See Reading Multiple Data Files at Once for examples.

Parameters
  • path (str) – the file path to load

  • root (str, optional) – the start path in the HDF file, loading all data below this path

  • exclude (list of str, optional) – list of path names to exclude; these can be absolute paths, or paths relative to root

  • comm (MPI Communicator, optional) – the MPI communicator instance; default (None) sets to the current communicator

  • attrs (dict, optional) – dictionary of meta-data to store in attrs

Examples

Please see the documentation for examples.

Attributes
Index

The attribute giving the global index rank of each particle in the list.

attrs

A dictionary storing relevant meta-data about the CatalogSource.

columns

All columns in the CatalogSource, including those hard-coded into the class’s defintion and override columns provided by the user.

csize

The total, collective size of the CatalogSource, i.e., summed across all ranks.

hardcolumns

The union of the columns in the file and any transformed columns.

size

The number of objects in the CatalogSource on the local rank.

Methods

Selection()

A boolean column that selects a subset slice of the CatalogSource.

Value()

When interpolating a CatalogSource on to a mesh, the value of this array is used as the Value that each particle contributes to a given mesh cell.

Weight()

The column giving the weight to use for each particle on the mesh.

compute(*args, **kwargs)

Our version of dask.compute() that computes multiple delayed dask collections at once.

copy()

Return a shallow copy of the object, where each column is a reference of the corresponding column in self.

get_hardcolumn(col)

Return a column from the underlying file source.

gslice(start, stop[, end, redistribute])

Execute a global slice of a CatalogSource.

make_column(array)

Utility function to convert an array-like object to a dask.array.Array.

persist([columns])

Return a CatalogSource, where the selected columns are computed and persist in memory.

query_range(start, end)

Seek to a range in the file catalog.

read(columns)

Return the requested columns as dask arrays.

save(output[, columns, dataset, datasets, ...])

Save the CatalogSource to a bigfile.BigFile.

sort(keys[, reverse, usecols])

Return a CatalogSource, sorted globally across all MPI ranks in ascending order by the input keys.

to_mesh([Nmesh, BoxSize, dtype, interlaced, ...])

Convert the CatalogSource to a MeshSource, using the specified parameters.

to_subvolumes([domain, position, columns])

Domain Decompose a catalog, sending items to the ranks according to the supplied domain object.

view([type])

Return a "view" of the CatalogSource object, with the returned type set by type.

create_instance

property Index

The attribute giving the global index rank of each particle in the list. It is an integer from 0 to self.csize.

Note that slicing changes this index value.

Selection()

A boolean column that selects a subset slice of the CatalogSource.

By default, this column is set to True for all particles, and all CatalogSource objects will contain this column.

Value()

When interpolating a CatalogSource on to a mesh, the value of this array is used as the Value that each particle contributes to a given mesh cell.

The mesh field is a weighted average of Value, with the weights given by Weight.

By default, this array is set to unity for all particles, and all CatalogSource objects will contain this column.

Weight()

The column giving the weight to use for each particle on the mesh.

The mesh field is a weighted average of Value, with the weights given by Weight.

By default, this array is set to unity for all particles, and all CatalogSource objects will contain this column.

__delitem__(col)

Delete a column; cannot delete a “hard-coded” column.

Note

If the base attribute is set, columns will be deleted from base instead of from self.

__finalize__(other)

Finalize the creation of a CatalogSource object by copying over any additional attributes from a second CatalogSource.

The idea here is to only copy over attributes that are similar to meta-data, so we do not copy some of the core attributes of the CatalogSource object.

Parameters

other – the second object to copy over attributes from; it needs to be a subclass of CatalogSourcBase for attributes to be copied

Returns

return self, with the added attributes

Return type

CatalogSource

__getitem__(sel)

The following types of indexing are supported:

  1. strings specifying a column in the CatalogSource; returns a dask array holding the column data

  2. boolean arrays specifying a slice of the CatalogSource; returns a CatalogSource holding only the revelant slice

  3. slice object specifying which particles to select

  4. list of strings specifying column names; returns a CatalogSource holding only the selected columns

Notes

  • Slicing is a collective operation

  • If the base attribute is set, columns will be returned from base instead of from self.

__len__()

The local size of the CatalogSource on a given rank.

__setitem__(col, value)

Add columns to the CatalogSource, overriding any existing columns with the name col.

property attrs

A dictionary storing relevant meta-data about the CatalogSource.

property columns

All columns in the CatalogSource, including those hard-coded into the class’s defintion and override columns provided by the user.

Note

If the base attribute is set, the value of base.columns will be returned.

compute(*args, **kwargs)

Our version of dask.compute() that computes multiple delayed dask collections at once.

This should be called on the return value of read() to converts any dask arrays to numpy arrays.

. note::

If the base attribute is set, compute() will called using base instead of self.

Parameters

args (object) – Any number of objects. If the object is a dask collection, it’s computed and the result is returned. Otherwise it’s passed through unchanged.

copy()

Return a shallow copy of the object, where each column is a reference of the corresponding column in self.

Note

No copy of data is made.

Note

This is different from view in that the attributes dictionary of the copy no longer related to self.

Returns

a new CatalogSource that holds all of the data columns of self

Return type

CatalogSource

property csize

The total, collective size of the CatalogSource, i.e., summed across all ranks.

It is the sum of size across all available ranks.

If the base attribute is set, the base.csize attribute will be returned.

get_hardcolumn(col)

Return a column from the underlying file source.

Columns are returned as dask arrays.

gslice(start, stop, end=1, redistribute=True)

Execute a global slice of a CatalogSource.

Note

After the global slice is performed, the data is scattered evenly across all ranks.

Note

The current algorithm generates an index on the root rank and does not scale well.

Parameters
  • start (int) – the start index of the global slice

  • stop (int) – the stop index of the global slice

  • step (int, optional) – the default step size of the global size

  • redistribute (bool, optional) – if True, evenly re-distribute the sliced data across all ranks, otherwise just return any local data part of the global slice

property hardcolumns

The union of the columns in the file and any transformed columns.

static make_column(array)

Utility function to convert an array-like object to a dask.array.Array.

Note

The dask array chunk size is controlled via the dask_chunk_size global option. See set_options.

Parameters

array (array_like) – an array-like object; can be a dask array, numpy array, ColumnAccessor, or other non-scalar array-like object

Returns

a dask array initialized from array

Return type

dask.array.Array

persist(columns=None)

Return a CatalogSource, where the selected columns are computed and persist in memory.

query_range(start, end)

Seek to a range in the file catalog.

Parameters
  • start (int) – start of the file relative to the physical file

  • end (int) – end of the file relative to the physical file

Returns

  • A new catalog that only accesses the given region of the file.

  • If the original catalog (self) contains any assigned columns not directly

  • obtained from the file, then the function will raise ValueError, since

  • the operation in that case is not well defined.

read(columns)

Return the requested columns as dask arrays.

Parameters

columns (list of str) – the names of the requested columns

Returns

the list of column data, in the form of dask arrays

Return type

list of dask.array.Array

save(output, columns=None, dataset=None, datasets=None, header='Header', compute=True)

Save the CatalogSource to a bigfile.BigFile.

Only the selected columns are saved and attrs are saved in header. The attrs of columns are stored in the datasets.

Parameters
  • output (str) – the name of the file to write to

  • columns (list of str) – the names of the columns to save in the file, or None to use all columns

  • dataset (str, optional) – dataset to store the columns under.

  • datasets (list of str, optional) – names for the data set where each column is stored; defaults to the name of the column (deprecated)

  • header (str, optional, or None) – the name of the data set holding the header information, where attrs is stored if header is None, do not save the header.

  • compute (boolean, default True) – if True, wait till the store operations finish if False, return a dictionary with column name and a future object for the store. use dask.compute() to wait for the store operations on the result.

property size

The number of objects in the CatalogSource on the local rank.

If the base attribute is set, the base.size attribute will be returned.

Important

This property must be defined for all subclasses.

sort(keys, reverse=False, usecols=None)

Return a CatalogSource, sorted globally across all MPI ranks in ascending order by the input keys.

Sort columns must be floating or integer type.

Note

After the sort operation, the data is scattered evenly across all ranks.

Parameters
  • keys (list, tuple) – the names of columns to sort by. If multiple columns are provided, the data is sorted consecutively in the order provided

  • reverse (bool, optional) – if True, perform descending sort operations

  • usecols (list, optional) – the name of the columns to include in the returned CatalogSource

to_mesh(Nmesh=None, BoxSize=None, dtype='f4', interlaced=False, compensated=False, resampler='cic', weight='Weight', value='Value', selection='Selection', position='Position', window=None)

Convert the CatalogSource to a MeshSource, using the specified parameters.

Parameters
  • Nmesh (int, optional) – the number of cells per side on the mesh; must be provided if not stored in attrs

  • BoxSize (scalar, 3-vector, optional) – the size of the box; must be provided if not stored in attrs

  • dtype (string, optional) – the data type of the mesh array

  • interlaced (bool, optional) – use the interlacing technique of Sefusatti et al. 2015 to reduce the effects of aliasing on Fourier space quantities computed from the mesh

  • compensated (bool, optional) – whether to correct for the resampler window introduced by the grid interpolation scheme

  • resampler (str, optional) – the string specifying which resampler interpolation scheme to use; see pmesh.resampler.methods

  • weight (str, optional) – the name of the column specifying the weight for each particle

  • value (str, optional) – the name of the column specifying the field value for each particle

  • selection (str, optional) – the name of the column that specifies which (if any) slice of the CatalogSource to take

  • position (str, optional) – the name of the column that specifies the position data of the objects in the catalog

  • window (str, deprecated) – use resampler instead.

Returns

mesh – a mesh object that provides an interface for gridding particle data onto a specified mesh

Return type

CatalogMesh

to_subvolumes(domain=None, position='Position', columns=None)

Domain Decompose a catalog, sending items to the ranks according to the supplied domain object. Using the position column as the Position.

This will read in the full position array and all of the requested columns.

Parameters
  • domain (pmesh.domain.GridND object, or None) – The domain to distribute the catalog. If None, try to evenly divide spatially. An easiest way to find a domain object is to use pm.domain, where pm is a pmesh.pm.ParticleMesh object.

  • position (string_like) – column to use to compute the position.

  • columns (list of string_like) – columns to include in the new catalog, if not supplied, all catalogs will be exchanged.

Returns

A decomposed catalog source, where each rank only contains objects belongs to the rank as claimed by the domain object.

self.attrs are carried over as a shallow copy to the returned object.

Return type

CatalogSource

view(type=None)

Return a “view” of the CatalogSource object, with the returned type set by type.

This initializes a new empty class of type type and attaches attributes to it via the __finalize__() mechanism.

Parameters

type (Python type) – the desired class type of the returned object.

class nbodykit.source.catalog.file.TPMBinaryCatalog(path, *args, **kwargs)

A CatalogSource that uses TPMBinaryFile to read data from disk.

Multiple files can be read at once by supplying a list of file names or a glob asterisk pattern as the path argument. See Reading Multiple Data Files at Once for examples.

Parameters
  • path (str) – the path to the binary file to load

  • precision ({'f4', 'f8'}, optional) – the string dtype specifying the precision

  • comm (MPI Communicator, optional) – the MPI communicator instance; default (None) sets to the current communicator

  • attrs (dict, optional) – dictionary of meta-data to store in attrs

Attributes
Index

The attribute giving the global index rank of each particle in the list.

attrs

A dictionary storing relevant meta-data about the CatalogSource.

columns

All columns in the CatalogSource, including those hard-coded into the class’s defintion and override columns provided by the user.

csize

The total, collective size of the CatalogSource, i.e., summed across all ranks.

hardcolumns

The union of the columns in the file and any transformed columns.

size

The number of objects in the CatalogSource on the local rank.

Methods

Selection()

A boolean column that selects a subset slice of the CatalogSource.

Value()

When interpolating a CatalogSource on to a mesh, the value of this array is used as the Value that each particle contributes to a given mesh cell.

Weight()

The column giving the weight to use for each particle on the mesh.

compute(*args, **kwargs)

Our version of dask.compute() that computes multiple delayed dask collections at once.

copy()

Return a shallow copy of the object, where each column is a reference of the corresponding column in self.

get_hardcolumn(col)

Return a column from the underlying file source.

gslice(start, stop[, end, redistribute])

Execute a global slice of a CatalogSource.

make_column(array)

Utility function to convert an array-like object to a dask.array.Array.

persist([columns])

Return a CatalogSource, where the selected columns are computed and persist in memory.

query_range(start, end)

Seek to a range in the file catalog.

read(columns)

Return the requested columns as dask arrays.

save(output[, columns, dataset, datasets, ...])

Save the CatalogSource to a bigfile.BigFile.

sort(keys[, reverse, usecols])

Return a CatalogSource, sorted globally across all MPI ranks in ascending order by the input keys.

to_mesh([Nmesh, BoxSize, dtype, interlaced, ...])

Convert the CatalogSource to a MeshSource, using the specified parameters.

to_subvolumes([domain, position, columns])

Domain Decompose a catalog, sending items to the ranks according to the supplied domain object.

view([type])

Return a "view" of the CatalogSource object, with the returned type set by type.

create_instance

property Index

The attribute giving the global index rank of each particle in the list. It is an integer from 0 to self.csize.

Note that slicing changes this index value.

Selection()

A boolean column that selects a subset slice of the CatalogSource.

By default, this column is set to True for all particles, and all CatalogSource objects will contain this column.

Value()

When interpolating a CatalogSource on to a mesh, the value of this array is used as the Value that each particle contributes to a given mesh cell.

The mesh field is a weighted average of Value, with the weights given by Weight.

By default, this array is set to unity for all particles, and all CatalogSource objects will contain this column.

Weight()

The column giving the weight to use for each particle on the mesh.

The mesh field is a weighted average of Value, with the weights given by Weight.

By default, this array is set to unity for all particles, and all CatalogSource objects will contain this column.

__delitem__(col)

Delete a column; cannot delete a “hard-coded” column.

Note

If the base attribute is set, columns will be deleted from base instead of from self.

__finalize__(other)

Finalize the creation of a CatalogSource object by copying over any additional attributes from a second CatalogSource.

The idea here is to only copy over attributes that are similar to meta-data, so we do not copy some of the core attributes of the CatalogSource object.

Parameters

other – the second object to copy over attributes from; it needs to be a subclass of CatalogSourcBase for attributes to be copied

Returns

return self, with the added attributes

Return type

CatalogSource

__getitem__(sel)

The following types of indexing are supported:

  1. strings specifying a column in the CatalogSource; returns a dask array holding the column data

  2. boolean arrays specifying a slice of the CatalogSource; returns a CatalogSource holding only the revelant slice

  3. slice object specifying which particles to select

  4. list of strings specifying column names; returns a CatalogSource holding only the selected columns

Notes

  • Slicing is a collective operation

  • If the base attribute is set, columns will be returned from base instead of from self.

__len__()

The local size of the CatalogSource on a given rank.

__setitem__(col, value)

Add columns to the CatalogSource, overriding any existing columns with the name col.

property attrs

A dictionary storing relevant meta-data about the CatalogSource.

property columns

All columns in the CatalogSource, including those hard-coded into the class’s defintion and override columns provided by the user.

Note

If the base attribute is set, the value of base.columns will be returned.

compute(*args, **kwargs)

Our version of dask.compute() that computes multiple delayed dask collections at once.

This should be called on the return value of read() to converts any dask arrays to numpy arrays.

. note::

If the base attribute is set, compute() will called using base instead of self.

Parameters

args (object) – Any number of objects. If the object is a dask collection, it’s computed and the result is returned. Otherwise it’s passed through unchanged.

copy()

Return a shallow copy of the object, where each column is a reference of the corresponding column in self.

Note

No copy of data is made.

Note

This is different from view in that the attributes dictionary of the copy no longer related to self.

Returns

a new CatalogSource that holds all of the data columns of self

Return type

CatalogSource

property csize

The total, collective size of the CatalogSource, i.e., summed across all ranks.

It is the sum of size across all available ranks.

If the base attribute is set, the base.csize attribute will be returned.

get_hardcolumn(col)

Return a column from the underlying file source.

Columns are returned as dask arrays.

gslice(start, stop, end=1, redistribute=True)

Execute a global slice of a CatalogSource.

Note

After the global slice is performed, the data is scattered evenly across all ranks.

Note

The current algorithm generates an index on the root rank and does not scale well.

Parameters
  • start (int) – the start index of the global slice

  • stop (int) – the stop index of the global slice

  • step (int, optional) – the default step size of the global size

  • redistribute (bool, optional) – if True, evenly re-distribute the sliced data across all ranks, otherwise just return any local data part of the global slice

property hardcolumns

The union of the columns in the file and any transformed columns.

static make_column(array)

Utility function to convert an array-like object to a dask.array.Array.

Note

The dask array chunk size is controlled via the dask_chunk_size global option. See set_options.

Parameters

array (array_like) – an array-like object; can be a dask array, numpy array, ColumnAccessor, or other non-scalar array-like object

Returns

a dask array initialized from array

Return type

dask.array.Array

persist(columns=None)

Return a CatalogSource, where the selected columns are computed and persist in memory.

query_range(start, end)

Seek to a range in the file catalog.

Parameters
  • start (int) – start of the file relative to the physical file

  • end (int) – end of the file relative to the physical file

Returns

  • A new catalog that only accesses the given region of the file.

  • If the original catalog (self) contains any assigned columns not directly

  • obtained from the file, then the function will raise ValueError, since

  • the operation in that case is not well defined.

read(columns)

Return the requested columns as dask arrays.

Parameters

columns (list of str) – the names of the requested columns

Returns

the list of column data, in the form of dask arrays

Return type

list of dask.array.Array

save(output, columns=None, dataset=None, datasets=None, header='Header', compute=True)

Save the CatalogSource to a bigfile.BigFile.

Only the selected columns are saved and attrs are saved in header. The attrs of columns are stored in the datasets.

Parameters
  • output (str) – the name of the file to write to

  • columns (list of str) – the names of the columns to save in the file, or None to use all columns

  • dataset (str, optional) – dataset to store the columns under.

  • datasets (list of str, optional) – names for the data set where each column is stored; defaults to the name of the column (deprecated)

  • header (str, optional, or None) – the name of the data set holding the header information, where attrs is stored if header is None, do not save the header.

  • compute (boolean, default True) – if True, wait till the store operations finish if False, return a dictionary with column name and a future object for the store. use dask.compute() to wait for the store operations on the result.

property size

The number of objects in the CatalogSource on the local rank.

If the base attribute is set, the base.size attribute will be returned.

Important

This property must be defined for all subclasses.

sort(keys, reverse=False, usecols=None)

Return a CatalogSource, sorted globally across all MPI ranks in ascending order by the input keys.

Sort columns must be floating or integer type.

Note

After the sort operation, the data is scattered evenly across all ranks.

Parameters
  • keys (list, tuple) – the names of columns to sort by. If multiple columns are provided, the data is sorted consecutively in the order provided

  • reverse (bool, optional) – if True, perform descending sort operations

  • usecols (list, optional) – the name of the columns to include in the returned CatalogSource

to_mesh(Nmesh=None, BoxSize=None, dtype='f4', interlaced=False, compensated=False, resampler='cic', weight='Weight', value='Value', selection='Selection', position='Position', window=None)

Convert the CatalogSource to a MeshSource, using the specified parameters.

Parameters
  • Nmesh (int, optional) – the number of cells per side on the mesh; must be provided if not stored in attrs

  • BoxSize (scalar, 3-vector, optional) – the size of the box; must be provided if not stored in attrs

  • dtype (string, optional) – the data type of the mesh array

  • interlaced (bool, optional) – use the interlacing technique of Sefusatti et al. 2015 to reduce the effects of aliasing on Fourier space quantities computed from the mesh

  • compensated (bool, optional) – whether to correct for the resampler window introduced by the grid interpolation scheme

  • resampler (str, optional) – the string specifying which resampler interpolation scheme to use; see pmesh.resampler.methods

  • weight (str, optional) – the name of the column specifying the weight for each particle

  • value (str, optional) – the name of the column specifying the field value for each particle

  • selection (str, optional) – the name of the column that specifies which (if any) slice of the CatalogSource to take

  • position (str, optional) – the name of the column that specifies the position data of the objects in the catalog

  • window (str, deprecated) – use resampler instead.

Returns

mesh – a mesh object that provides an interface for gridding particle data onto a specified mesh

Return type

CatalogMesh

to_subvolumes(domain=None, position='Position', columns=None)

Domain Decompose a catalog, sending items to the ranks according to the supplied domain object. Using the position column as the Position.

This will read in the full position array and all of the requested columns.

Parameters
  • domain (pmesh.domain.GridND object, or None) – The domain to distribute the catalog. If None, try to evenly divide spatially. An easiest way to find a domain object is to use pm.domain, where pm is a pmesh.pm.ParticleMesh object.

  • position (string_like) – column to use to compute the position.

  • columns (list of string_like) – columns to include in the new catalog, if not supplied, all catalogs will be exchanged.

Returns

A decomposed catalog source, where each rank only contains objects belongs to the rank as claimed by the domain object.

self.attrs are carried over as a shallow copy to the returned object.

Return type

CatalogSource

view(type=None)

Return a “view” of the CatalogSource object, with the returned type set by type.

This initializes a new empty class of type type and attaches attributes to it via the __finalize__() mechanism.

Parameters

type (Python type) – the desired class type of the returned object.