nbodykit.source.catalog.file

Functions

FileCatalogFactory(name, filetype[, examples]) Factory method to create a CatalogSource that uses a subclass of nbodykit.io.base.FileType to read data from disk.

Classes

BigFileCatalog(*args, **kwargs) A CatalogSource that uses BigFile to read data from disk.
BinaryCatalog(*args, **kwargs) A CatalogSource that uses BinaryFile to read data from disk.
CSVCatalog(*args, **kwargs) A CatalogSource that uses CSVFile to read data from disk.
FITSCatalog(*args, **kwargs) A CatalogSource that uses FITSFile to read data from disk.
FileCatalogBase(filetype[, args, kwargs, …]) Base class to create a source of particles from a single file, or multiple files, on disk.
Gadget1Catalog(*args, **kwargs) A CatalogSource that uses Gadget1File to read data from disk.
HDFCatalog(*args, **kwargs) A CatalogSource that uses HDFFile to read data from disk.
TPMBinaryCatalog(*args, **kwargs) A CatalogSource that uses TPMBinaryFile to read data from disk.
nbodykit.source.catalog.file.FileCatalogFactory(name, filetype, examples=None)[source]

Factory method to create a CatalogSource that uses a subclass of nbodykit.io.base.FileType to read data from disk.

Parameters:
  • name (str) – the name of the catalog class to create
  • filetype (subclass of nbodykit.io.base.FileType) – the subclass of the FileType that reads a specific type of data
  • examples (str, optional) – if given, a documentation cross-reference link where examples can be found
Returns:

the CatalogSource object that reads data using filetype

Return type:

subclass of FileCatalogBase

class nbodykit.source.catalog.file.FileCatalogBase(filetype, args=(), kwargs={}, comm=None, use_cache=False)[source]

Base class to create a source of particles from a single file, or multiple files, on disk.

Files of a specific type should be subclasses of this class.

Parameters:
  • filetype (subclass of FileType) – the file-like class used to load the data from file; should be a subclass of nbodykit.io.base.FileType
  • args (tuple, optional) – the arguments to pass to the filetype class when constructing each file object
  • kwargs (dict, optional) – the keyword arguments to pass to the filetype class when constructing each file object
  • comm (MPI Communicator, optional) – the MPI communicator instance; default (None) sets to the current communicator
  • use_cache (bool, optional) – whether to cache data read from disk; default is False

Attributes

Index The attribute giving the global index rank of each particle in the list.
attrs A dictionary storing relevant meta-data about the CatalogSource.
columns All columns in the CatalogSource, including those hard-coded into the class’s defintion and override columns provided by the user.
csize The total, collective size of the CatalogSource, i.e., summed across all ranks.
hardcolumns The union of the columns in the file and any transformed columns.
size The number of objects in the CatalogSource on the local rank.
use_cache If set to True, use the built-in caching features of dask to cache data in memory.

Methods

Selection() A boolean column that selects a subset slice of the CatalogSource.
Value() When interpolating a CatalogSource on to a mesh, the value of this array is used as the Value that each particle contributes to a given mesh cell.
Weight() The column giving the weight to use for each particle on the mesh.
compute(*args, **kwargs) Our version of dask.compute() that computes multiple delayed dask collections at once.
copy() Return a shallow copy of the object, where each column is a reference of the corresponding column in self.
get_hardcolumn(col) Return a column from the underlying file source.
gslice(start, stop[, end, redistribute]) Execute a global slice of a CatalogSource.
make_column(array) Utility function to convert an array-like object to a dask.array.Array.
read(columns) Return the requested columns as dask arrays.
save(output, columns[, datasets, header]) Save the CatalogSource to a bigfile.BigFile.
sort(keys[, reverse, usecols]) Return a CatalogSource, sorted globally across all MPI ranks in ascending order by the input keys.
to_mesh([Nmesh, BoxSize, dtype, interlaced, …]) Convert the CatalogSource to a MeshSource, using the specified parameters.
view([type]) Return a “view” of the CatalogSource object, with the returned type set by type.
get_hardcolumn(col)[source]

Return a column from the underlying file source.

Columns are returned as dask arrays.

hardcolumns

The union of the columns in the file and any transformed columns.

class nbodykit.source.catalog.file.CSVCatalog(*args, **kwargs)

A CatalogSource that uses CSVFile to read data from disk.

Multiple files can be read at once by supplying a list of file names or a glob asterisk pattern as the path argument. See Reading Multiple Data Files at Once for examples.

Parameters:
  • path (str) – the name of the file to load
  • names (list of str) – the names of the columns of the csv file; this should give names of all the columns in the file – pass usecols to select a subset of columns
  • blocksize (int, optional) – the file will be partitioned into blocks of bytes roughly of this size
  • dtype (dict, str, optional) – if specified as a string, assume all columns have this dtype, otherwise; each column can have a dtype entry in the dict; if not specified, the data types will be inferred from the file
  • usecols (list, optional) – a pandas.read_csv; a subset of names to store, ignoring all other columns
  • delim_whitespace (bool, optional) – a pandas.read_csv keyword; if the CSV file is space-separated, set this to True
  • **config – additional keyword arguments that will be passed to pandas.read_csv(); see the documentation of that function for a full list of possible options
  • comm (MPI Communicator, optional) – the MPI communicator instance; default (None) sets to the current communicator
  • use_cache (bool, optional) – whether to cache data read from disk; default is False
  • attrs (dict, optional) – dictionary of meta-data to store in attrs

Examples

Please see the documentation for examples.

Attributes

Index The attribute giving the global index rank of each particle in the list.
attrs A dictionary storing relevant meta-data about the CatalogSource.
columns All columns in the CatalogSource, including those hard-coded into the class’s defintion and override columns provided by the user.
csize The total, collective size of the CatalogSource, i.e., summed across all ranks.
hardcolumns The union of the columns in the file and any transformed columns.
size The number of objects in the CatalogSource on the local rank.
use_cache If set to True, use the built-in caching features of dask to cache data in memory.

Methods

Selection() A boolean column that selects a subset slice of the CatalogSource.
Value() When interpolating a CatalogSource on to a mesh, the value of this array is used as the Value that each particle contributes to a given mesh cell.
Weight() The column giving the weight to use for each particle on the mesh.
compute(*args, **kwargs) Our version of dask.compute() that computes multiple delayed dask collections at once.
copy() Return a shallow copy of the object, where each column is a reference of the corresponding column in self.
get_hardcolumn(col) Return a column from the underlying file source.
gslice(start, stop[, end, redistribute]) Execute a global slice of a CatalogSource.
make_column(array) Utility function to convert an array-like object to a dask.array.Array.
read(columns) Return the requested columns as dask arrays.
save(output, columns[, datasets, header]) Save the CatalogSource to a bigfile.BigFile.
sort(keys[, reverse, usecols]) Return a CatalogSource, sorted globally across all MPI ranks in ascending order by the input keys.
to_mesh([Nmesh, BoxSize, dtype, interlaced, …]) Convert the CatalogSource to a MeshSource, using the specified parameters.
view([type]) Return a “view” of the CatalogSource object, with the returned type set by type.
class nbodykit.source.catalog.file.BinaryCatalog(*args, **kwargs)

A CatalogSource that uses BinaryFile to read data from disk.

Multiple files can be read at once by supplying a list of file names or a glob asterisk pattern as the path argument. See Reading Multiple Data Files at Once for examples.

Parameters:
  • path (str) – the name of the binary file to load
  • dtype (numpy.dtype or list of tuples) – the dtypes of the columns to load; this should be either a numpy.dtype or be able to be converted to one via a numpy.dtype() call
  • offsets (dict, optional) – a dictionay specifying the byte offsets of each column in the binary file; if not supplied, the offsets are inferred from the dtype size of each column, assuming a fixed header size, and contiguous storage
  • header_size (int, optional) – the size of the header in bytes
  • size (int, optional) – the number of objects in the binary file; if not provided, the value is inferred from the dtype and the total size of the file in bytes
  • comm (MPI Communicator, optional) – the MPI communicator instance; default (None) sets to the current communicator
  • use_cache (bool, optional) – whether to cache data read from disk; default is False
  • attrs (dict, optional) – dictionary of meta-data to store in attrs

Examples

Please see the documentation for examples.

Attributes

Index The attribute giving the global index rank of each particle in the list.
attrs A dictionary storing relevant meta-data about the CatalogSource.
columns All columns in the CatalogSource, including those hard-coded into the class’s defintion and override columns provided by the user.
csize The total, collective size of the CatalogSource, i.e., summed across all ranks.
hardcolumns The union of the columns in the file and any transformed columns.
size The number of objects in the CatalogSource on the local rank.
use_cache If set to True, use the built-in caching features of dask to cache data in memory.

Methods

Selection() A boolean column that selects a subset slice of the CatalogSource.
Value() When interpolating a CatalogSource on to a mesh, the value of this array is used as the Value that each particle contributes to a given mesh cell.
Weight() The column giving the weight to use for each particle on the mesh.
compute(*args, **kwargs) Our version of dask.compute() that computes multiple delayed dask collections at once.
copy() Return a shallow copy of the object, where each column is a reference of the corresponding column in self.
get_hardcolumn(col) Return a column from the underlying file source.
gslice(start, stop[, end, redistribute]) Execute a global slice of a CatalogSource.
make_column(array) Utility function to convert an array-like object to a dask.array.Array.
read(columns) Return the requested columns as dask arrays.
save(output, columns[, datasets, header]) Save the CatalogSource to a bigfile.BigFile.
sort(keys[, reverse, usecols]) Return a CatalogSource, sorted globally across all MPI ranks in ascending order by the input keys.
to_mesh([Nmesh, BoxSize, dtype, interlaced, …]) Convert the CatalogSource to a MeshSource, using the specified parameters.
view([type]) Return a “view” of the CatalogSource object, with the returned type set by type.
class nbodykit.source.catalog.file.BigFileCatalog(*args, **kwargs)

A CatalogSource that uses BigFile to read data from disk.

Multiple files can be read at once by supplying a list of file names or a glob asterisk pattern as the path argument. See Reading Multiple Data Files at Once for examples.

Parameters:
  • path (str) – the name of the directory holding the bigfile data
  • exclude (list of str, optional) – the data sets to exlude from loading within bigfile; default is the header
  • header (str, optional) – the path to the header; default is to use a column ‘Header’. It is relative to the file, not the dataset.
  • dataset (str) – load a specific dataset from the bigfile; default is to starting from the root.
  • comm (MPI Communicator, optional) – the MPI communicator instance; default (None) sets to the current communicator
  • use_cache (bool, optional) – whether to cache data read from disk; default is False
  • attrs (dict, optional) – dictionary of meta-data to store in attrs

Examples

Please see the documentation for examples.

Attributes

Index The attribute giving the global index rank of each particle in the list.
attrs A dictionary storing relevant meta-data about the CatalogSource.
columns All columns in the CatalogSource, including those hard-coded into the class’s defintion and override columns provided by the user.
csize The total, collective size of the CatalogSource, i.e., summed across all ranks.
hardcolumns The union of the columns in the file and any transformed columns.
size The number of objects in the CatalogSource on the local rank.
use_cache If set to True, use the built-in caching features of dask to cache data in memory.

Methods

Selection() A boolean column that selects a subset slice of the CatalogSource.
Value() When interpolating a CatalogSource on to a mesh, the value of this array is used as the Value that each particle contributes to a given mesh cell.
Weight() The column giving the weight to use for each particle on the mesh.
compute(*args, **kwargs) Our version of dask.compute() that computes multiple delayed dask collections at once.
copy() Return a shallow copy of the object, where each column is a reference of the corresponding column in self.
get_hardcolumn(col) Return a column from the underlying file source.
gslice(start, stop[, end, redistribute]) Execute a global slice of a CatalogSource.
make_column(array) Utility function to convert an array-like object to a dask.array.Array.
read(columns) Return the requested columns as dask arrays.
save(output, columns[, datasets, header]) Save the CatalogSource to a bigfile.BigFile.
sort(keys[, reverse, usecols]) Return a CatalogSource, sorted globally across all MPI ranks in ascending order by the input keys.
to_mesh([Nmesh, BoxSize, dtype, interlaced, …]) Convert the CatalogSource to a MeshSource, using the specified parameters.
view([type]) Return a “view” of the CatalogSource object, with the returned type set by type.
class nbodykit.source.catalog.file.HDFCatalog(*args, **kwargs)

A CatalogSource that uses HDFFile to read data from disk.

Multiple files can be read at once by supplying a list of file names or a glob asterisk pattern as the path argument. See Reading Multiple Data Files at Once for examples.

Parameters:
  • path (str) – the file path to load
  • root (str, optional) – the start path in the HDF file, loading all data below this path
  • exclude (list of str, optional) – list of path names to exclude; these can be absolute paths, or paths relative to root
  • comm (MPI Communicator, optional) – the MPI communicator instance; default (None) sets to the current communicator
  • use_cache (bool, optional) – whether to cache data read from disk; default is False
  • attrs (dict, optional) – dictionary of meta-data to store in attrs

Examples

Please see the documentation for examples.

Attributes

Index The attribute giving the global index rank of each particle in the list.
attrs A dictionary storing relevant meta-data about the CatalogSource.
columns All columns in the CatalogSource, including those hard-coded into the class’s defintion and override columns provided by the user.
csize The total, collective size of the CatalogSource, i.e., summed across all ranks.
hardcolumns The union of the columns in the file and any transformed columns.
size The number of objects in the CatalogSource on the local rank.
use_cache If set to True, use the built-in caching features of dask to cache data in memory.

Methods

Selection() A boolean column that selects a subset slice of the CatalogSource.
Value() When interpolating a CatalogSource on to a mesh, the value of this array is used as the Value that each particle contributes to a given mesh cell.
Weight() The column giving the weight to use for each particle on the mesh.
compute(*args, **kwargs) Our version of dask.compute() that computes multiple delayed dask collections at once.
copy() Return a shallow copy of the object, where each column is a reference of the corresponding column in self.
get_hardcolumn(col) Return a column from the underlying file source.
gslice(start, stop[, end, redistribute]) Execute a global slice of a CatalogSource.
make_column(array) Utility function to convert an array-like object to a dask.array.Array.
read(columns) Return the requested columns as dask arrays.
save(output, columns[, datasets, header]) Save the CatalogSource to a bigfile.BigFile.
sort(keys[, reverse, usecols]) Return a CatalogSource, sorted globally across all MPI ranks in ascending order by the input keys.
to_mesh([Nmesh, BoxSize, dtype, interlaced, …]) Convert the CatalogSource to a MeshSource, using the specified parameters.
view([type]) Return a “view” of the CatalogSource object, with the returned type set by type.
class nbodykit.source.catalog.file.TPMBinaryCatalog(*args, **kwargs)

A CatalogSource that uses TPMBinaryFile to read data from disk.

Multiple files can be read at once by supplying a list of file names or a glob asterisk pattern as the path argument. See Reading Multiple Data Files at Once for examples.

Parameters:
  • path (str) – the path to the binary file to load
  • precision ({'f4', 'f8'}, optional) – the string dtype specifying the precision
  • comm (MPI Communicator, optional) – the MPI communicator instance; default (None) sets to the current communicator
  • use_cache (bool, optional) – whether to cache data read from disk; default is False
  • attrs (dict, optional) – dictionary of meta-data to store in attrs

Attributes

Index The attribute giving the global index rank of each particle in the list.
attrs A dictionary storing relevant meta-data about the CatalogSource.
columns All columns in the CatalogSource, including those hard-coded into the class’s defintion and override columns provided by the user.
csize The total, collective size of the CatalogSource, i.e., summed across all ranks.
hardcolumns The union of the columns in the file and any transformed columns.
size The number of objects in the CatalogSource on the local rank.
use_cache If set to True, use the built-in caching features of dask to cache data in memory.

Methods

Selection() A boolean column that selects a subset slice of the CatalogSource.
Value() When interpolating a CatalogSource on to a mesh, the value of this array is used as the Value that each particle contributes to a given mesh cell.
Weight() The column giving the weight to use for each particle on the mesh.
compute(*args, **kwargs) Our version of dask.compute() that computes multiple delayed dask collections at once.
copy() Return a shallow copy of the object, where each column is a reference of the corresponding column in self.
get_hardcolumn(col) Return a column from the underlying file source.
gslice(start, stop[, end, redistribute]) Execute a global slice of a CatalogSource.
make_column(array) Utility function to convert an array-like object to a dask.array.Array.
read(columns) Return the requested columns as dask arrays.
save(output, columns[, datasets, header]) Save the CatalogSource to a bigfile.BigFile.
sort(keys[, reverse, usecols]) Return a CatalogSource, sorted globally across all MPI ranks in ascending order by the input keys.
to_mesh([Nmesh, BoxSize, dtype, interlaced, …]) Convert the CatalogSource to a MeshSource, using the specified parameters.
view([type]) Return a “view” of the CatalogSource object, with the returned type set by type.
class nbodykit.source.catalog.file.Gadget1Catalog(*args, **kwargs)

A CatalogSource that uses Gadget1File to read data from disk.

Multiple files can be read at once by supplying a list of file names or a glob asterisk pattern as the path argument. See Reading Multiple Data Files at Once for examples.

Parameters:
  • path (str) – the path to the binary file to load
  • columndefs (list) – a list of triplets (columnname, element_dtype, particle_types)
  • ptype (int) – type of particle of interest.
  • hdtype (list, dtype) – dtype of the header; must define Massarr and Npart
  • comm (MPI Communicator, optional) – the MPI communicator instance; default (None) sets to the current communicator
  • use_cache (bool, optional) – whether to cache data read from disk; default is False
  • attrs (dict, optional) – dictionary of meta-data to store in attrs

Attributes

Index The attribute giving the global index rank of each particle in the list.
attrs A dictionary storing relevant meta-data about the CatalogSource.
columns All columns in the CatalogSource, including those hard-coded into the class’s defintion and override columns provided by the user.
csize The total, collective size of the CatalogSource, i.e., summed across all ranks.
hardcolumns The union of the columns in the file and any transformed columns.
size The number of objects in the CatalogSource on the local rank.
use_cache If set to True, use the built-in caching features of dask to cache data in memory.

Methods

Selection() A boolean column that selects a subset slice of the CatalogSource.
Value() When interpolating a CatalogSource on to a mesh, the value of this array is used as the Value that each particle contributes to a given mesh cell.
Weight() The column giving the weight to use for each particle on the mesh.
compute(*args, **kwargs) Our version of dask.compute() that computes multiple delayed dask collections at once.
copy() Return a shallow copy of the object, where each column is a reference of the corresponding column in self.
get_hardcolumn(col) Return a column from the underlying file source.
gslice(start, stop[, end, redistribute]) Execute a global slice of a CatalogSource.
make_column(array) Utility function to convert an array-like object to a dask.array.Array.
read(columns) Return the requested columns as dask arrays.
save(output, columns[, datasets, header]) Save the CatalogSource to a bigfile.BigFile.
sort(keys[, reverse, usecols]) Return a CatalogSource, sorted globally across all MPI ranks in ascending order by the input keys.
to_mesh([Nmesh, BoxSize, dtype, interlaced, …]) Convert the CatalogSource to a MeshSource, using the specified parameters.
view([type]) Return a “view” of the CatalogSource object, with the returned type set by type.
class nbodykit.source.catalog.file.FITSCatalog(*args, **kwargs)

A CatalogSource that uses FITSFile to read data from disk.

Multiple files can be read at once by supplying a list of file names or a glob asterisk pattern as the path argument. See Reading Multiple Data Files at Once for examples.

Parameters:
  • path (str) – the file path to load
  • ext (number or string, optional) – The extension. Either the numerical extension from zero or a string extension name. If not sent, data is read from the first HDU that has data.
  • comm (MPI Communicator, optional) – the MPI communicator instance; default (None) sets to the current communicator
  • use_cache (bool, optional) – whether to cache data read from disk; default is False
  • attrs (dict, optional) – dictionary of meta-data to store in attrs

Examples

Please see the documentation for examples.

Attributes

Index The attribute giving the global index rank of each particle in the list.
attrs A dictionary storing relevant meta-data about the CatalogSource.
columns All columns in the CatalogSource, including those hard-coded into the class’s defintion and override columns provided by the user.
csize The total, collective size of the CatalogSource, i.e., summed across all ranks.
hardcolumns The union of the columns in the file and any transformed columns.
size The number of objects in the CatalogSource on the local rank.
use_cache If set to True, use the built-in caching features of dask to cache data in memory.

Methods

Selection() A boolean column that selects a subset slice of the CatalogSource.
Value() When interpolating a CatalogSource on to a mesh, the value of this array is used as the Value that each particle contributes to a given mesh cell.
Weight() The column giving the weight to use for each particle on the mesh.
compute(*args, **kwargs) Our version of dask.compute() that computes multiple delayed dask collections at once.
copy() Return a shallow copy of the object, where each column is a reference of the corresponding column in self.
get_hardcolumn(col) Return a column from the underlying file source.
gslice(start, stop[, end, redistribute]) Execute a global slice of a CatalogSource.
make_column(array) Utility function to convert an array-like object to a dask.array.Array.
read(columns) Return the requested columns as dask arrays.
save(output, columns[, datasets, header]) Save the CatalogSource to a bigfile.BigFile.
sort(keys[, reverse, usecols]) Return a CatalogSource, sorted globally across all MPI ranks in ascending order by the input keys.
to_mesh([Nmesh, BoxSize, dtype, interlaced, …]) Convert the CatalogSource to a MeshSource, using the specified parameters.
view([type]) Return a “view” of the CatalogSource object, with the returned type set by type.