A Brief Introduction

In this section, we provide a brief overview of the major functionality of nbodykit, as well as an introduction to some of the technical jargon needed to get up and running quickly. We try to familiarize the user with the various aspects of nbodykit needed to take full advantage of nbodykit’s computing power. This section also serves as a nice outline of the documentation, with links to more detailed descriptions included throughout.

The lab framework

A core design goal of nbodykit is maintaining an interactive user experience, allowing the user to quickly experiment and play around with data sets and statistics, while still leveraging the power of parallel processing when necessary. Motivated by the power of Jupyter notebooks, we adopt a “lab” framework for nbodykit, where all of the necessary data containers and algorithms can be imported from a single module:

from nbodykit.lab import *

[insert cool science here]

See the documentation for nbodykit.lab for a full list of the imported members in this module.

With all of the necessary tools now in hand, the user can easily load a data set, compute statistics of that data via one of the built-in algorithms, and save the results in just a few lines. The end result is a reproducible scientific result, generated from clear and concise code that flows from step to step.

Setting up logging

We use the logging module throughout nbodykit to provide the user with output as scripts progress. This is especially helpful when problems are encountered when using nbodykit in parallel. Users can turn on logging via the nbodykit.setup_logging() function, optionally providing the “debug” argument to increase the logging level.

We typically begin our nbodykit scripts using:

from nbodykit.lab import *
from nbodykit import setup_logging

setup_logging() # log output to stdout

Parallel computation with MPI

The nbodykit package is fully parallelized using the Python bindings of the Message Passage Interface (MPI) available in mpi4py. While we aim to hide most of the complexities of MPI from the top-level user interface, it is helpful to know some basic aspects of the MPI framework for understanding how nbodykit works to compute its results. If you are unfamiliar with MPI, a good place to start is the documentation for mpi4py. Briefly, MPI allows nbodykit to use a specified number of CPUs, which work independently to achieve a common goal and pass messages back and forth to coordinate their work.

We provide a more in-depth discussion of the key MPI-related features of nbodykit in the Parallel Computation with nbodykit section. This section also includes a guide on how to execute nbodykit scripts in parallel using MPI.

Cosmology and units

nbodykit includes a cosmology calculator Cosmology, as well as several built-in cosmologies, in the nbodykit.cosmology module. This class uses the CLASS CMB Boltzmann code for the majority of its cosmology calculations by using the Python binding of the CLASS code provided by the classylss package. As such, the syntax used in the Cosmology class largely follows that of the CLASS code.

To best interface with CLASS, and avoid unnecessary confusion, nbodykit assumes a default set of units:

  • distance: \(h^{-1} \ \mathrm{Mpc}\)

  • wavenumber: \(h \ \mathrm{Mpc}^{-1}\)

  • velocity: \(\mathrm{km} \ \mathrm{s}^{-1}\)

  • temperature: \(\mathrm{K}\)

  • power: \(h^{-3} \ \mathrm{Mpc}^3\)

  • density: \(10^{10} (h^{-1} \ M_\odot) (h^{-1} \ \mathrm{Mpc})^{-3}\)

  • neutrino mass: \(\mathrm{eV}\)

  • time: \(\mathrm{Gyr}\)

  • \(H_0\): \((\mathrm{km} \ \mathrm{s^{-1}}) / (h^{-1} \ \mathrm{Mpc})\)

We choose to define quantities with respect to the dimensionless Hubble parameter \(h\) when appropriate. Users should always take care when loading data to verify that the units follow the conventions defined here. Also, note that when simulated data is generated by nbodykit, e.g., in HODCatalog, the units of quantities such as position and velocity will follow the above conventions.

The nbodykit.cosmology module also includes functionality for computing the theoretical linear power spectrum (using CLASS or analytic transfer functions), correlation functions, and the Zel’dovich power spectrum. See the Cosmological Calculations section for more details.

Interacting with data in nbodykit

The algorithms in nbodykit interface with user data in two main ways: “object catalogs” and “mesh fields”.

Catalogs

Catalogs hold columns of data for a set of discrete objects, typically galaxies. The columns typically include the three-dimensional positions of the objects, as well as properties of the object, e.g., mass, luminosity, etc. The catalog container represents the attributes of the objects as columns in the catalog. A catalog object behaves much like a structured NumPy array, with a fixed size and named data type fields, except that the data is provided by the random-access interface.

Catalog objects are subclasses of the CatalogSource base class and live in the nbodykit.source.catalog module. We provide several different subclasses that are capable of loading data from a variety of file formats on disk. We also provide catalog classes that can generate a simulated set of particles. Users can find a more in depth discussion of catalog data in Discrete Data Catalogs. For a full list of available catalogs, see the API docs.

Meshes

The mesh container is fundamentally different from the catalog object. It stores a discrete representation of a continuous fluid field on a uniform mesh. The array values on the mesh are generated via a process referred to as “painting” in nbodykit. During the painting step, the positions of the discrete objects in a catalog are interpolated onto a uniform mesh. The fluid field on the mesh is often the density field, as sampled by the discrete galaxy positions.

Mesh objects are subclasses of the MeshSource base class and live in the nbodykit.source.mesh module. We provide subclasses that are capable of loading mesh data from disk or from a Numpy array, as well as classes that can generate simulated meshes.

Furthermore, any catalog object can be converted to a mesh object via the to_mesh() function. This function returns a CatalogMesh object, which is a view of a CatalogSource as a MeshSource. A CatalogMesh “knows” how to generate the mesh data from the catalog data, i.e., the user has specified the desired size of the mesh, etc. using the to_mesh() function.

The Data on a Mesh section describes mesh objects in more detail. In particular, more details regarding the creation of mesh objects from catalogs can be found in Creating a Mesh. See the API docs for a full list of available meshes.

A component-based approach

The design of nbodykit focuses on a component-based approach. The components are exposed to the Python language as a set of classes and interfaces, and users can combine these components to construct complex applications. This design differs from the more commonly used alternative in cosmology software, which is a monolithic application controlled by a single configuration file (e.g., as in CLASS, CAMB, Gadget). From experience, we have found that a component-based approach offers the user greater freedom and flexibility to build complex applications with nbodykit.

../_images/nbodykit-interfaces.pdf

In the figure above, we diagram the important interfaces and components of nbodykit. There are a few items worth highlighting in more details:

  • Catalog: as discussed in the previous section, catalog objects derive from the CatalogSource class and hold information about discrete objects. Catalogs also implement a random-read interface that allows the user to access individual columns of data. The random-read nature of the column access makes use of the high throughput of a parallel file system when nbodykit is executed in parallel.

    However, the backend of the random-read interface does not have to be a file on disk at all. As an example, the ArrayCatalog simply converts a dictionary or a NumPy array object to a CatalogSource.

  • Mesh: as discussed in the previous section, mesh objects derive from the MeshSource class and store a discrete representation of a continuous quantity on a uniform mesh. These objects provide a “paintable” interface provided to the user via the paint() function. Calling this function re-samples the fluid field represented by the mesh object to a distributed three-dimensional array (returning either a RealField or ComplexField, as implemented by the pmesh package). See the Dealing with Data on a Mesh for more details.

  • Serialization: most objects in nbodykit are serializable via a save() function. For a more in-depth discussion of serialization, see Saving your Results.

    Algorithm classes not only save the result of the algorithm but also input parameters and meta-data stored in the attrs dictionary. Algorithms typically implement both a save() and load() function, such that the algorithm result can be de-serialized into an object of the same type. For example, the result of the FFTPower algorithm can be serialized with the save() function and the algorithm re-initialized with the load() function.

    The two main data containers, catalogs and meshes, can be serialized using nbodykit’s intrinsic format which relies on bigfile. The relevant functions are save() for catalogs and save() for meshes. These serialized results can later be loaded from disk by nbodykit as a BigFileCatalog or BigFileMesh object.

Catalogs and dask

The data columns of catalog objects are stored as dask arrays rather than the similar, more traditional NumPy arrays. Users unfamiliar with the dask package should start with the On Demand IO via dask.array section of the docs.

Briefly, there are two main features to keep in mind when dealing with dask arrays:

1. Operations on a dask array are not evaluated immediately, as is the case for NumPy arrays, but instead stored internally in a task graph. Thus, the usual array manipulations on dask arrays are nearly immediate.

2. A dask array can be evaluated, returning a NumPy array, via a call the compute() function of the dask array. This operation can be time-consuming – it evaluates all of the operations in the array’s task graph.

In most situations, users should manipulate catalog columns as they would NumPy arrays and allow the nbodykit internals to call the necessary compute() function to get the final result. When possible, users should opt to use the functions defined in the dask.array module instead of the equivalent function defined in numpy. The dask.array module is designed to provide the same functionality as the numpy package but for dask arrays.

Running your favorite algorithm

nbodykit aims to implement a canonical set of algorithms in the field of large-scale structure. The goal is to provide open source, state-of-the-art implementations of the most well-known algorithms used in the analysis of large-scale structure data. We have a wide and growing range of algorithms implemented so far. Briefly, nbodykit includes functionality for:

  • generating density fields via the painting operation

  • computing the power spectrum of density fields for both simulations and observational surveys.

  • calculating two-point and three-point correlation functions

  • computing groups of objects using a Friends-of-Friends method or a cylindrical radius method

  • generating HOD catalogs of galaxies from catalogs of dark matter halos

  • running quasi N-body simulations using the FastPM scheme.

For a full list of the available algorithms, see this section of the docs. We also aim to provide examples of many of the algorithms in The Cookbook.

The algorithms in nbodykit couple to data through the catalog and mesh objects described in the previous sections. Algorithms in nbodykit are implemented as Python classes. When the class is initialized, the algorithm is run and the returned instance holds the corresponding results via attributes. The specific attributes that hold the results vary from algorithm to algorithm – we direct users to the API docs to determine the specifics for a particular algorithm. Furthermore, the algorithm result can be serialized to disk for archiving, We also ensure that the appropriate meta-data is serialized to disk in order to sufficiently describe the input parameters for reproducibility.

As open source software, we hope community contributions will help to maximize the utility of the nbodykit package for its users. We believe community contributions and review can help increase scientific productivity for all researchers. If your favorite algorithm isn’t yet implemented, we encourage contributions and feature requests from the community (see our contributing guidelines).

The Cookbook

We’ve created a cookbook of recipes for users to learn nbodykit by example. These recipes are designed to illustrate interesting and common uses of nbodykit for users to learn from. The goal is to have working examples for most of the algorithms in nbodykit, as well as some of the more common data tasks.

The recipes are provided as Jupyter notebooks. Each notebook is available for download by clicking the “Source” link in the navigation bar at the top of the page.

We welcome contributions of new recipes! See our see our contributing guidelines.

Questions, feedback, and contributions

If you’ve run in to problems with nbodykit, do not hesitate to get in touch with us. See our Contact and Support section for details on how to best contact us.

User contributions are also very welcome! Please see our see our contributing guidelines if you’ve like to help grow the nbodykit project!