_images/nbodykit-logo.gif


a massively parallel, large-scale structure toolkit

nbodykit is an open source project written in Python that provides a set of state-of-the-art, large-scale structure algorithms useful in the analysis of cosmological datasets from N-body simulations and observational surveys. All algorithms are massively parallel and run using the Message Passing Interface (MPI).

Driven by the optimism regarding the abundance and availability of large-scale computing resources in the future, the development of nbodykit distinguishes itself from other similar software packages (i.e., nbodyshop, pynbody, yt, xi) by focusing on:

  • a unified treatment of simulation and observational datasets by insulating algorithms from data containers

  • support for a wide variety of data formats, as well as large volumes of data

  • the ability to reduce wall-clock time by scaling to thousands of cores

  • deployment and availability on large, supercomputing facilities

  • an interactive user interface that performs as well in a Jupyter notebook as on a supercomputing machine


Learning By Example

For users who wish to dive right in, an interactive environment containing our cookbook recipes is available to users via the BinderHub service. Just click the launch button below to get started!

binder

See The Cookbook for descriptions of the various notebook recipes.


Getting nbodykit

To get up and running with your own copy of nbodykit, please follow the installation instructions. nbodykit is currently supported on macOS and Linux architectures. The recommended installation method uses the Anaconda Python distribution.

nbodykit is compatible with Python versions 2.7, 3.5, and 3.6, and the source code is publicly available at https://github.com/bccp/nbodykit.


A 1 minute introduction to nbodykit

To start, we initialize the nbodykit “lab”:

from nbodykit.lab import *

There are two core data structures in nbodykit: catalogs and meshes. These represent the two main ways astronomers interact with data in large-scale structure analysis. Catalogs hold information describing a set of discrete objects, storing the data in columns. nbodykit includes functionality for initializing catalogs from a variety of file formats as well as more advanced techniques for generating catalogs of simulated particles.

Below, we create a very simple catalog of uniformly distributed particles in a box of side length \(L = 1 \ h^{-1} \mathrm{Mpc}\):

catalog = UniformCatalog(nbar=100, BoxSize=1.0)

Catalogs have a fixed size and a set of columns describing the particle data. In this case, our catalog has “Position” and “Velocity” columns. Users can easily manipulate the existing column data or add new columns:

BoxSize = 2500.
catalog['Position'] *= BoxSize # re-normalize units of Position
catalog['Mass'] = 10**(numpy.random(12, 15, size=len(catalog))) # add some random mass values

We can generate a representation of the density field on a mesh using our catalog of objects. Here, we interpolate the particles onto a mesh of size \(64^3\):

mesh = catalog.to_mesh(Nmesh=64, BoxSize=BoxSize)

We can save our mesh to disk to later re-load using nbodykit:

mesh.save('mesh.bigfile')

or preview a low-resolution, 2D projection of the mesh to make sure everythings looks as expected:

import matplotlib.pyplot as plt
plt.imshow(mesh.preview(axes=[0,1], Nmesh=32))

Finally, we can feed our density field mesh in to one of the nbodykit algorithms. For example, below we use the FFTPower algorithm to compute the power spectrum \(P(k,\mu)\) of the density mesh using a fast Fourier transform via

result = FFTPower(mesh, Nmu=5)

with the measured power stored as the power attribute of the result variable. The algorithm result and meta-data, input parameters, etc. can then be saved to disk as a JSON file:

result.save("power-result.json")

It is important to remember that nbodykit is fully parallelized using MPI. This means that the above code snippets can be excuted in a Jupyter notebook with only a single CPU or using a standalone Python script with an arbitrary number of MPI workers. We aim to hide as much of the parallel abstraction from users as possible. When executing in parallel, data will automatically be divided amongst the available MPI workers, and each worker computes its own smaller portion of the algorithm result before finally these calculations are combined into the final result.


Getting Started

We also provide detailed overviews of the two main data containers in nbodykit, catalogs and meshes, and we walk through the necessary background information for each of the available algorithms in nbodykit. The main areas of the documentation can be broken down into the following sub-sections:


Getting Help

Install nbodykit

Getting nbodykit

nbodykit is currently supported on macOS and Linux architectures. The recommended installation method is to install nbodykit and its dependencies as part of the Anaconda Python distribution. For more advanced users, or those without an Anaconda distribution, the software can also be installed using the pip utility or directly from the source. In these latter cases, additional difficulties may arise when trying to compile and install some of nbodykit’s dependencies.

The package is available for Python versions 2.7, 3.5, and 3.6.

Installing nbodykit with Anaconda

The easiest installation method uses the conda utility, as part of the Anaconda package manager. The distribution can be downloaded and installed for free from https://www.continuum.io/downloads.

We have pre-built binaries for nbodykit and all of its dependencies available via Anaconda that are compatible with Linux and macOS platforms. To avoid dependency conflicts, we recommend installing nbodykit into a fresh environment. This can be achieved with the following commands:

$ conda create --name nbodykit-env python=3 # or use python=2 for python 2.7*
$ source activate nbodykit-env
$ conda install -c bccp nbodykit
Updating nbodykit and its dependencies

To keep nbodykit and its dependencies up to date, use

$ conda update -c bccp --all
Installing From Source for Conda-based Installs

Users can also install nbodykit from source from within a conda environment, given that git is installed. First, clone the GitHub repository,

$ git clone http://github.com/bccp/nbodykit
$ cd nbodykit

Then, install all of the required dependencies using

$ conda install -c bccp --file requirements.txt
$ conda install -c bccp --file requirements-extras.txt

Now, the desired git branch can be installed easily. By default, the master branch is active, and the latest development state of nbodykit can be installed using

$ pip install -e .

Note that the -e flag installs nbodykit in “develop” mode, allowing the installed version to reflect the latest changes in the source code. The latest changes to the master branch can be incorporated from GitHub via

$ git checkout master
$ git pull origin master

Installing nbodykit with pip

Warning

The easiest and recommended method to install nbodykit and its dependencies is using the Anaconda package. See Installing nbodykit with Anaconda for more details.

To build nbodykit from source, you will need to make sure all of the dependencies are properly installed on your system. To start, the following dependencies should be installed first:

$ pip install numpy cython mpi4py

Next, we must compile the remaining dependencies, which depends on the user’s machine.

Linux

To install nbodykit as well as all of its external dependencies on a Linux machine into the default Python installation directory:

$ pip install nbodykit[extras]

A different installation directory can be specified via the --user or --root <dir> options of the pip install command.

macOS

More care is required to properly build the dependencies on macOS machines. The autotools software is required, which can be installed using the MacPorts package manager using:

$ sudo port install autoconf automake libtool

Using recent versions of MacPorts, we also need to tell mpicc to use gcc rather than the default clang compiler, which doesn’t compile fftw correctly due to the lack of openmp support. Additionally, the LDSHARED environment variable must be explicitly set.

In bash, the installation command is:

$ export OMPI_CC=gcc
$ export LDSHARED="mpicc -bundle -undefined dynamic_lookup -DOMPI_IMPORTS"; pip install nbodykit[extras]

This command will compile and install the dependencies of nbodykit and then install nbodykit. Again, a different installation directory can be specified via the --user or --root <dir> options of the pip install command.

nbodykit on NERSC

Note

This section covers using nbodykit on the computing nodes of NERSC. The computing nodes requires special care because they do not work with the simple MPI provided from Anaconda.

If instead you wish to use nbodykit on the login nodes of NERSC or the Jupyter Hub services (available at https://jupyter.nersc.gov and https://jupyter-dev.nersc.gov/), users should follow the Anaconda installation instructions to install nbodykit. The login nodes and JuptyerHub machines are very similar to standard computers. For more information on the JupyterHub services, see the official NERSC guide.

Development and testing of nbodykit was performed on the NERSC super-computing machines at Lawrence Berkeley National Laboratory. We maintain a daily build of the latest stable version of nbodykit on NERSC systems for Python versions 2.7, 3.5, and 3.6 and provide a tool to automatically load the appropriate environment when running jobs on either the Edison or Cori machines.

To load the latest stable version of nbodykit on NERSC, the following line should be added to the beginning of the user’s job script:

# load python 3.6 with latest stable nbodykit
# can also specify 2.7 or 3.5 here
source /usr/common/contrib/bccp/conda-activate.sh 3.6

If instead the user wishes to install the latest development version of nbodykit, the following lines should be added to the job script:

# first load python 3.6 with latest stable nbodykit
# can also specify 2.7 or 3.5 here
source /usr/common/contrib/bccp/conda-activate.sh 3.6

# overwrite nbodykit with the latest version from the tip of master
bcast-pip git+git://github.com/bccp/nbodykit.git

In the nbodykit source directory, we include an example Python script and job script for users. To run this example on NERSC, first download the necessary files:

# download the example script
$ wget https://raw.githubusercontent.com/bccp/nbodykit/master/nersc/example.py

# download the job script
$ wget https://raw.githubusercontent.com/bccp/nbodykit/master/nersc/example-job.slurm

and then if on the Cori machine, the job can be submitted using

$ sbatch -C haswell example-job.slurm

of if on the Edison machine, use

$ sbatch example-job.slurm

The example job script simply loads the nbodykit environment and executes the Python script in parallel, in this case, using 16 CPUs.

#!/bin/bash
#SBATCH -p debug
#SBATCH -o nbkit-example
#SBATCH -n 16

# load nbodykit
source /usr/common/contrib/bccp/conda-activate.sh 3.6

# run the main nbodykit example
srun -n 16 python example.py

If successful, this will save a file nbkit_example_power.json to the current working directory.

A Brief Introduction

In this section, we provide a brief overview of the major functionality of nbodykit, as well as an introduction to some of the technical jargon needed to get up and running quickly. We try to familiarize the user with the various aspects of nbodykit needed to take full advantage of nbodykit’s computing power. This section also serves as a nice outline of the documentation, with links to more detailed descriptions included throughout.

The lab framework

A core design goal of nbodykit is maintaining an interactive user experience, allowing the user to quickly experiment and play around with data sets and statistics, while still leveraging the power of parallel processing when necessary. Motivated by the power of Jupyter notebooks, we adopt a “lab” framework for nbodykit, where all of the necessary data containers and algorithms can be imported from a single module:

from nbodykit.lab import *

[insert cool science here]

See the documentation for nbodykit.lab for a full list of the imported members in this module.

With all of the necessary tools now in hand, the user can easily load a data set, compute statistics of that data via one of the built-in algorithms, and save the results in just a few lines. The end result is a reproducible scientific result, generated from clear and concise code that flows from step to step.

Setting up logging

We use the logging module throughout nbodykit to provide the user with output as scripts progress. This is especially helpful when problems are encountered when using nbodykit in parallel. Users can turn on logging via the nbodykit.setup_logging() function, optionally providing the “debug” argument to increase the logging level.

We typically begin our nbodykit scripts using:

from nbodykit.lab import *
from nbodykit import setup_logging

setup_logging() # log output to stdout

Parallel computation with MPI

The nbodykit package is fully parallelized using the Python bindings of the Message Passage Interface (MPI) available in mpi4py. While we aim to hide most of the complexities of MPI from the top-level user interface, it is helpful to know some basic aspects of the MPI framework for understanding how nbodykit works to compute its results. If you are unfamiliar with MPI, a good place to start is the documentation for mpi4py. Briefly, MPI allows nbodykit to use a specified number of CPUs, which work independently to achieve a common goal and pass messages back and forth to coordinate their work.

We provide a more in-depth discussion of the key MPI-related features of nbodykit in the Parallel Computation with nbodykit section. This section also includes a guide on how to execute nbodykit scripts in parallel using MPI.

Cosmology and units

nbodykit includes a cosmology calculator Cosmology, as well as several built-in cosmologies, in the nbodykit.cosmology module. This class uses the CLASS CMB Boltzmann code for the majority of its cosmology calculations by using the Python binding of the CLASS code provided by the classylss package. As such, the syntax used in the Cosmology class largely follows that of the CLASS code.

To best interface with CLASS, and avoid unnecessary confusion, nbodykit assumes a default set of units:

  • distance: \(h^{-1} \ \mathrm{Mpc}\)

  • wavenumber: \(h \ \mathrm{Mpc}^{-1}\)

  • velocity: \(\mathrm{km} \ \mathrm{s}^{-1}\)

  • temperature: \(\mathrm{K}\)

  • power: \(h^{-3} \ \mathrm{Mpc}^3\)

  • density: \(10^{10} (h^{-1} \ M_\odot) (h^{-1} \ \mathrm{Mpc})^{-3}\)

  • neutrino mass: \(\mathrm{eV}\)

  • time: \(\mathrm{Gyr}\)

  • \(H_0\): \((\mathrm{km} \ \mathrm{s^{-1}}) / (h^{-1} \ \mathrm{Mpc})\)

We choose to define quantities with respect to the dimensionless Hubble parameter \(h\) when appropriate. Users should always take care when loading data to verify that the units follow the conventions defined here. Also, note that when simulated data is generated by nbodykit, e.g., in HODCatalog, the units of quantities such as position and velocity will follow the above conventions.

The nbodykit.cosmology module also includes functionality for computing the theoretical linear power spectrum (using CLASS or analytic transfer functions), correlation functions, and the Zel’dovich power spectrum. See the Cosmological Calculations section for more details.

Interacting with data in nbodykit

The algorithms in nbodykit interface with user data in two main ways: “object catalogs” and “mesh fields”.

Catalogs

Catalogs hold columns of data for a set of discrete objects, typically galaxies. The columns typically include the three-dimensional positions of the objects, as well as properties of the object, e.g., mass, luminosity, etc. The catalog container represents the attributes of the objects as columns in the catalog. A catalog object behaves much like a structured NumPy array, with a fixed size and named data type fields, except that the data is provided by the random-access interface.

Catalog objects are subclasses of the CatalogSource base class and live in the nbodykit.source.catalog module. We provide several different subclasses that are capable of loading data from a variety of file formats on disk. We also provide catalog classes that can generate a simulated set of particles. Users can find a more in depth discussion of catalog data in Discrete Data Catalogs. For a full list of available catalogs, see the API docs.

Meshes

The mesh container is fundamentally different from the catalog object. It stores a discrete representation of a continuous fluid field on a uniform mesh. The array values on the mesh are generated via a process referred to as “painting” in nbodykit. During the painting step, the positions of the discrete objects in a catalog are interpolated onto a uniform mesh. The fluid field on the mesh is often the density field, as sampled by the discrete galaxy positions.

Mesh objects are subclasses of the MeshSource base class and live in the nbodykit.source.mesh module. We provide subclasses that are capable of loading mesh data from disk or from a Numpy array, as well as classes that can generate simulated meshes.

Furthermore, any catalog object can be converted to a mesh object via the to_mesh() function. This function returns a CatalogMesh object, which is a view of a CatalogSource as a MeshSource. A CatalogMesh “knows” how to generate the mesh data from the catalog data, i.e., the user has specified the desired size of the mesh, etc. using the to_mesh() function.

The Data on a Mesh section describes mesh objects in more detail. In particular, more details regarding the creation of mesh objects from catalogs can be found in Creating a Mesh. See the API docs for a full list of available meshes.

A component-based approach

The design of nbodykit focuses on a component-based approach. The components are exposed to the Python language as a set of classes and interfaces, and users can combine these components to construct complex applications. This design differs from the more commonly used alternative in cosmology software, which is a monolithic application controlled by a single configuration file (e.g., as in CLASS, CAMB, Gadget). From experience, we have found that a component-based approach offers the user greater freedom and flexibility to build complex applications with nbodykit.

_images/nbodykit-interfaces.pdf

In the figure above, we diagram the important interfaces and components of nbodykit. There are a few items worth highlighting in more details:

  • Catalog: as discussed in the previous section, catalog objects derive from the CatalogSource class and hold information about discrete objects. Catalogs also implement a random-read interface that allows the user to access individual columns of data. The random-read nature of the column access makes use of the high throughput of a parallel file system when nbodykit is executed in parallel.

    However, the backend of the random-read interface does not have to be a file on disk at all. As an example, the ArrayCatalog simply converts a dictionary or a NumPy array object to a CatalogSource.

  • Mesh: as discussed in the previous section, mesh objects derive from the MeshSource class and store a discrete representation of a continuous quantity on a uniform mesh. These objects provide a “paintable” interface provided to the user via the paint() function. Calling this function re-samples the fluid field represented by the mesh object to a distributed three-dimensional array (returning either a RealField or ComplexField, as implemented by the pmesh package). See the Dealing with Data on a Mesh for more details.

  • Serialization: most objects in nbodykit are serializable via a save() function. For a more in-depth discussion of serialization, see Saving your Results.

    Algorithm classes not only save the result of the algorithm but also input parameters and meta-data stored in the attrs dictionary. Algorithms typically implement both a save() and load() function, such that the algorithm result can be de-serialized into an object of the same type. For example, the result of the FFTPower algorithm can be serialized with the save() function and the algorithm re-initialized with the load() function.

    The two main data containers, catalogs and meshes, can be serialized using nbodykit’s intrinsic format which relies on bigfile. The relevant functions are save() for catalogs and save() for meshes. These serialized results can later be loaded from disk by nbodykit as a BigFileCatalog or BigFileMesh object.

Catalogs and dask

The data columns of catalog objects are stored as dask arrays rather than the similar, more traditional NumPy arrays. Users unfamiliar with the dask package should start with the On Demand IO via dask.array section of the docs.

Briefly, there are two main features to keep in mind when dealing with dask arrays:

1. Operations on a dask array are not evaluated immediately, as is the case for NumPy arrays, but instead stored internally in a task graph. Thus, the usual array manipulations on dask arrays are nearly immediate.

2. A dask array can be evaluated, returning a NumPy array, via a call the compute() function of the dask array. This operation can be time-consuming – it evaluates all of the operations in the array’s task graph.

In most situations, users should manipulate catalog columns as they would NumPy arrays and allow the nbodykit internals to call the necessary compute() function to get the final result. When possible, users should opt to use the functions defined in the dask.array module instead of the equivalent function defined in numpy. The dask.array module is designed to provide the same functionality as the numpy package but for dask arrays.

Running your favorite algorithm

nbodykit aims to implement a canonical set of algorithms in the field of large-scale structure. The goal is to provide open source, state-of-the-art implementations of the most well-known algorithms used in the analysis of large-scale structure data. We have a wide and growing range of algorithms implemented so far. Briefly, nbodykit includes functionality for:

  • generating density fields via the painting operation

  • computing the power spectrum of density fields for both simulations and observational surveys.

  • calculating two-point and three-point correlation functions

  • computing groups of objects using a Friends-of-Friends method or a cylindrical radius method

  • generating HOD catalogs of galaxies from catalogs of dark matter halos

  • running quasi N-body simulations using the FastPM scheme.

For a full list of the available algorithms, see this section of the docs. We also aim to provide examples of many of the algorithms in The Cookbook.

The algorithms in nbodykit couple to data through the catalog and mesh objects described in the previous sections. Algorithms in nbodykit are implemented as Python classes. When the class is initialized, the algorithm is run and the returned instance holds the corresponding results via attributes. The specific attributes that hold the results vary from algorithm to algorithm – we direct users to the API docs to determine the specifics for a particular algorithm. Furthermore, the algorithm result can be serialized to disk for archiving, We also ensure that the appropriate meta-data is serialized to disk in order to sufficiently describe the input parameters for reproducibility.

As open source software, we hope community contributions will help to maximize the utility of the nbodykit package for its users. We believe community contributions and review can help increase scientific productivity for all researchers. If your favorite algorithm isn’t yet implemented, we encourage contributions and feature requests from the community (see our contributing guidelines).

The Cookbook

We’ve created a cookbook of recipes for users to learn nbodykit by example. These recipes are designed to illustrate interesting and common uses of nbodykit for users to learn from. The goal is to have working examples for most of the algorithms in nbodykit, as well as some of the more common data tasks.

The recipes are provided as Jupyter notebooks. Each notebook is available for download by clicking the “Source” link in the navigation bar at the top of the page.

We welcome contributions of new recipes! See our see our contributing guidelines.

Questions, feedback, and contributions

If you’ve run in to problems with nbodykit, do not hesitate to get in touch with us. See our Contact and Support section for details on how to best contact us.

User contributions are also very welcome! Please see our see our contributing guidelines if you’ve like to help grow the nbodykit project!

Cosmological Calculations

The nbodykit.cosmology module includes functionality for representing cosmological parameter sets and computing various theoretical quantities prevalent in large-scale structure that depend on the background cosmological model. The module relies on the CLASS CMB Boltzmann code as the underlying engine for its cosmological calculations via the Python wrapper provided by the classylss package. For consistency, the cosmology syntax used in nbodykit largely follows that of the CLASS code. A useful set of links for reference material on CLASS is available here.

The Cosmology class

nbodykit provides the Cosmology class for representing cosmological parameter sets. Given an input set of parameters, this class uses the CLASS code to evaluate various quantities relevant to large-scale structure. The class can compute background quantities, such as distances, as a function of redshift, as well as linear and nonlinear power spectra and transfer functions.

Note

For a full list of the attributes and functions available, please see the API docs.

The CLASS code supports multiple sets of input parameters, e.g., the user can specify H0 or h. For simplicity, we use the following parameters to initialize a Cosmology object:

  • h : the dimensionless Hubble parameter

  • T0_cmb : the temperature of the CMB in Kelvins

  • Omega0_b : the current baryon density parameter, \(\Omega_{b,0}\)

  • Omega0_cdm : the current cold dark matter density parameter, \(\Omega_{cdm,0}\)

  • N_ur : the number of ultra-relativistic (massless neutrino) species

  • m_ncdm : the masses (in eV) for all massive neutrino species

  • P_k_max : the maximum wavenumber to compute power spectrum results to, in units of \(h \mathrm{Mpc}^{-1}\)

  • P_z_max : the maximum redshift to compute power spectrum results to

  • gauge : the General Relativity gauge, either “synchronous” or “newtonian”

  • n_s : the tilt of the primordial power spectrum

  • nonlinear : whether to compute nonlinear power spectrum results via HaloFit

Any valid CLASS parameters can be additionally specified as keywords to the Cosmology constructor. If these additional parameters conflict with the base set of parameters above, an exception will be raised. For a guide to the valid input CLASS parameters, the “explanatory.ini” parameter file in the CLASS GitHub is a good place to start.

Notes and Caveats

  • The default configuration assumes a flat cosmology, \(\Omega_{k,0}=0\). Pass Omega0_k as a keyword to specify the desired non-flat curvature.

  • By default, the value for N_ur is inferred from the number of massive neutrinos using the following logic: if you have respectively 1,2,3 massive neutrinos and use the default T_ncdm value (0.71611 K), which is designed to give m/omega of 93.14 eV, and you wish to have N_eff=3.046 in the early universe, then N_ur is set to 2.0328, 1.0196, 0.00641, respectively.

  • For consistency of variable names, the present day values can be passed with or without the “0” postfix, e.g., Omega0_cdm is translated to Omega_cdm as CLASS always uses the names without “0” as input parameters.

  • By default, a cosmological constant (Omega0_lambda) is assumed, with its density value inferred by the curvature condition.

  • Non-cosmological constant dark energy can be used by specifying the w0_fld, wa_fld, and/or Omega0_fld values.

  • To pass in CLASS parameters that are not valid Python argument names, use the dictionary/keyword arguments trick, e.g., Cosmology(..., **{'temperature contributions': 'y'})

Builtin Cosmologies

We include several builtin cosmologies for users, accessible from the nbodykit.cosmology module:

Name

Source

\(H_0\)

\(\Omega_{m,0}\)

Flat

WMAP5

Komatsu et al. 2009

70.2

0.277

Yes

WMAP7

Komatsu et al. 2011

70.4

0.272

Yes

WMAP9

Hinshaw et al. 2013

69.3

0.287

Yes

Planck13

Planck Collab 2013, Paper XVI

67.8

0.307

Yes

Planck15

Planck Collab 2015, Paper XIII

67.7

0.307

Yes

Updating cosmological parameters

The Cosmology class is immutable. To update cosmological parameters, we provide the following functions:

  • clone(): provides a new Cosmology object with only the specified parameters changed; this function accepts any keyword parameters that can be passed to the Cosmology constructor

  • match(): provides a new Cosmology object that has matched a specific, derived parameter value. The derived parameters that can be matched include:

    • sigma8: re-scale the scalar amplitude A_s to match sigma8

    • Omega0_cb: match the total sum of cdm and baryons density, \(\Omega_{cdm,0}+\Omega_{b,0}\)

    • Omega0_m: match the total density of matter-like components, \(\Omega_{cdm,0}+\Omega_{b,0}+\Omega_{ncdm,0}-\Omega_{pncdm,0}+\Omega_{dcdm,0}\)

For example, we can match a specific sigma8 value using:

[2]:
from nbodykit.lab import cosmology

cosmo = cosmology.Cosmology()
print("original sigma8 = %.4f" % cosmo.sigma8)

new_cosmo = cosmo.match(sigma8=0.82)
print("new sigma8 = %.4f" % new_cosmo.sigma8)
/home/yfeng1/anaconda3/install/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
original sigma8 = 0.8380
new sigma8 = 0.8200

And we can clone a Cosmology using

[3]:
cosmo = cosmo.clone(h=0.7, nonlinear=True)
Converting to a dictionary

The keyword parameters passed to the constructor of a Cosmology object are not passed directly to the underlying CLASS engine, but must be edited first to be compatible with CLASS. Users can access the full set of parameters passed directly to CLASS by converting the Cosmology object to a dict. This can be accomplished simply using:

[4]:
cosmo_dict = dict(cosmo)
print(cosmo_dict)
{'output': 'vTk dTk mPk', 'extra metric transfer functions': 'y', 'n_s': 0.9667, 'gauge': 'synchronous', 'N_ur': 2.0328, 'T_cmb': 2.7255, 'Omega_cdm': 0.26377065934278865, 'Omega_b': 0.0482754208891869, 'N_ncdm': 1, 'P_k_max_h/Mpc': 10.0, 'z_max_pk': 100.0, 'h': 0.7, 'm_ncdm': [0.06], 'non linear': 'halofit'}

Each of the parameters in this dictionary are valid CLASS parameters. For example, we see that the CLASS parameter “non linear” has been set to “halofit”, since we have set the nonlinear parameter to be True.

Note

To create a new Cosmology object from a dictionary, users should use Cosmology.from_dict(dict(c)). Because of the differences in syntax between the Cosmology constructor and CLASS, the following will not work: Cosmology(**dict(c)).

Interfacing with astropy

We provide some comptability for interfacing with the astropy.cosmology module. The following functions of the Cosmology() class allow users to convert between the nbodykit and astropy cosmology implementations:

to_astropy(self)

Initialize and return a subclass of astropy.cosmology.FLRW from the Cosmology class.

from_astropy(cosmo, \*\*kwargs)

Initialize and return a Cosmology object from a subclass of astropy.cosmology.FLRW.

We also provide support for some of the syntax used by astropy for attributes in its cosmology classes by aliasing the astropy names to the appropriate nbodykit attributes. For example,

Om0

Returns Omega0_m.

Odm0

Returns Omega0_cdm.

Ode0

Returns the sum of Omega0_lambda and Omega0_fld.

Ob0

Returns Omega0_b.

Warning

The conversion between astropy and nbodykit cosmology objects is not one-to-one, as the classes are implemented using different underlying engines. For example, the neutrino calculations performed in nbodykit using CLASS differ from those in astropy. We have attempted to provide as much compatibility as possible, but these differences will always remain when converting between the two implementations.

Theoretical Power Spectra

Linear power

The linear power spectrum can be computed for a given Cosmology and redshift using the LinearPower class. This object can compute the linear power spectrum using one of several transfer functions:

  • “CLASS”: calls CLASS to evaluate the total density transfer function at the specified redshift

  • “EisensteinHu”: uses the analytic transfer function from Eisenstein & Hu, 1998

  • “NoWiggleEisensteinHu”: uses the no-BAO analytic transfer function from Eisenstein & Hu, 1998

The LinearPower object is a callable function, taking wavenumber in units of \(h \mathrm{Mpc}^{-1}\) as input:

[5]:
import matplotlib.pyplot as plt
import numpy as np

c = cosmology.Planck15
Plin = cosmology.LinearPower(c, redshift=0., transfer='CLASS')

k = np.logspace(-3, 0, 100)
plt.loglog(k, Plin(k), c='k')

plt.xlabel(r"$k$ $[h \mathrm{Mpc}^{-1}]$")
plt.ylabel(r"$P$ $[h^{-3} \mathrm{Mpc}^{3}]$")
plt.show()
_images/getting-started_cosmology_9_0.png

The redshift and power spectrum normalization can easily be updated by adjusting the redshift and sigma8 attributes of the LinearPower instance. For example,

[6]:
# original power
plt.loglog(k, Plin(k), label=r"$z=0, \sigma_8=%.2f$" % Plin.sigma8)

# update the redshift and sigma8
Plin.redshift = 0.55
Plin.sigma8 = 0.80
plt.loglog(k, Plin(k), label=r"$z=0.55, \sigma_8=0.8$")

# format the axes
plt.legend()
plt.xlabel(r"$k$ $[h \mathrm{Mpc}^{-1}]$")
plt.ylabel(r"$P$ $[h^{-3} \mathrm{Mpc}^{3}]$")
plt.show()
_images/getting-started_cosmology_11_0.png
Nonlinear and Zel’dovich power

We also provide the HalofitPower and ZeldovichPower classes to compute the nonlinear and Zel’dovich power spectra, respectively. The nonlinear power uses the HaloFit prescription that is available in CLASS. These objects behavior very similarly to the LinearPower class. For example,

[7]:
# initialize the power objects
Plin = cosmology.LinearPower(c, redshift=0, transfer='CLASS')
Pnl = cosmology.HalofitPower(c, redshift=0)
Pzel = cosmology.ZeldovichPower(c, redshift=0)

# plot each kind
plt.loglog(k, Plin(k), label='linear')
plt.loglog(k, Pnl(k), label='nonlinear')
plt.loglog(k, Pzel(k), label="Zel'dovich")

# format the axes
plt.legend()
plt.xlabel(r"$k$ $[h \mathrm{Mpc}^{-1}]$")
plt.ylabel(r"$P$ $[h^{-3} \mathrm{Mpc}^{3}]$")
plt.show()
_images/getting-started_cosmology_13_0.png

Correlation functions

nbodykit also includes functionality for computing theoretical correlation functions by computing the Fourier transform of a power spectrum. We use the mcfit package, which provides a Python version of the FFTLog algorithm. The main class to compute correlation functions is CorrelationFunction. This class takes a power spectrum instance (one of LinearPower, HalofitPower, or ZeldovichPower) as input. Below, we show the correlation function for each of these classes:

[8]:
# initialize the correlation objects
cf_lin = cosmology.CorrelationFunction(Plin)
cf_nl = cosmology.CorrelationFunction(Pnl)
cf_zel = cosmology.CorrelationFunction(Pzel)

# plot each kind
r = np.logspace(-1, np.log10(150), 1000)
plt.plot(r, r**2 * cf_lin(r), label='linear')
plt.plot(r, r**2 * cf_nl(r), label='nonlinear')
plt.plot(r, r**2 * cf_zel(r), label="Zel'dovich")

# format the axes
plt.legend()
plt.xlabel(r"$r$ $[h^{-1} \mathrm{Mpc}]$")
plt.ylabel(r"$r^2 \xi \ [h^{-2} \mathrm{Mpc}^2]$")
plt.show()
_images/getting-started_cosmology_15_0.png

We also include utility functions for performing the $P(k) leftrightarrow xi(r)$ transformation:

xi_to_pk(r, xi[, ell, extrap])

Return a callable function returning the power spectrum multipole of degree \(\ell\), as computed from the Fourier transform of the input \(r\) and \(\xi_\ell(r)\) arrays.

pk_to_xi(k, Pk[, ell, extrap])

Return a callable function returning the correlation function multipole of degree \(\ell\), as computed from the Fourier transform of the input \(k\) and \(P_\ell(k)\) arrays.

These functions take discretely evaluated \((r, \xi(r))\) or \((k, P(k))\) arrays and return a function that evaluates the appropriate transformed quantity.

The Cookbook

Here, we provide a set of recipes detailing a broad selection of the functionality available in nbodykit. Ranging from simple tasks to more complex work flows, we hope that these recipes help users become acclimated with nbodykit as well as illustrate the power of nbodykit for large-scale structure data analysis.

For users who wish to dive right into the examples, an interactive environment containing the cookbook recipes is available to users via the BinderHub service. Just click the launch button below to get started!

binder

Angular Pair Counting

In this notebook, we explore the functionality of the AngularPairCount algorithm, which computes the number of pairs in angular separation bins for survey-like input data, assumed to be angular sky coordinates (i.e., right ascension and declination).

Note

The data used in this notebook is not realistic – rather, we choose the simplicity of generating mock data to help users get up and running quickly. Although the end results are not cosmologically interesting, we use the mock data to help illustrate the various steps necessary to use the AngularPairCount algorithm properly.

[1]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
[2]:
from nbodykit.lab import *
from nbodykit import setup_logging, style

import matplotlib.pyplot as plt
plt.style.use(style.notebook)
[3]:
setup_logging()
Initalizing Mock Data

In this notebook, we construct our fake “data” catalog simply by generating 10,000 objects that have uniformly distributed right ascension and declination values within a region of the sky.

[4]:
NDATA = 10000

# randomly distributed RA/DEC on the sky
data = RandomCatalog(NDATA, seed=42)
data['RA']  = data.rng.uniform(low=110, high=260)
data['DEC'] = data.rng.uniform(low=-3.6, high=60)
Adding Weights

The AngularPairCount algorithm also supports computing weighted pair counts. The total weight for a pair of objects is the product of the individual weights. Users can specify the name of column they would like to treat as a weight by passing the weight keyword to the algorithm constructor.

Here, we choose randomly distributed weights between 0 and 1.

[5]:
data['Weight'] = numpy.random.random(size=len(data))
Auto-correlation Pair Counts

To compute the pair counts, we simply specify the edges of the angular separation bins (in degrees), as well as the names of the columns holding the sky coordinates and weight values.

[6]:
# the angular bin edges in degrees
edges = numpy.linspace(0.1, 10., 20+1) # 20 total bins

# run the algorithm
r_auto = AngularPairCount(data, edges, ra='RA', dec='DEC', weight='Weight')
[ 000000.06 ]   0: 10-30 20:54  AngularPairCount INFO     using cpu grid decomposition: (1, 1, 1)
[ 000000.08 ]   0: 10-30 20:54  AngularPairCount INFO     correlating 10000 x 10000 objects in total
[ 000000.08 ]   0: 10-30 20:54  AngularPairCount INFO     correlating A x B = 10000 x 10000 objects (median) per rank
[ 000000.08 ]   0: 10-30 20:54  AngularPairCount INFO     min A load per rank = 10000
[ 000000.08 ]   0: 10-30 20:54  AngularPairCount INFO     max A load per rank = 10000
[ 000000.08 ]   0: 10-30 20:54  AngularPairCount INFO     (even distribution would result in 10000 x 10000)
[ 000000.08 ]   0: 10-30 20:54  AngularPairCount INFO     calling function 'Corrfunc.mocks.DDtheta_mocks.DDtheta_mocks'
[ 000000.38 ]   0: 10-30 20:54  AngularPairCount INFO     10% done
[ 000000.70 ]   0: 10-30 20:54  AngularPairCount INFO     20% done
[ 000000.98 ]   0: 10-30 20:54  AngularPairCount INFO     30% done
[ 000001.26 ]   0: 10-30 20:54  AngularPairCount INFO     40% done
[ 000001.58 ]   0: 10-30 20:54  AngularPairCount INFO     50% done
[ 000001.94 ]   0: 10-30 20:54  AngularPairCount INFO     60% done
[ 000002.22 ]   0: 10-30 20:54  AngularPairCount INFO     70% done
[ 000002.52 ]   0: 10-30 20:54  AngularPairCount INFO     80% done
[ 000002.83 ]   0: 10-30 20:54  AngularPairCount INFO     90% done
[ 000003.13 ]   0: 10-30 20:54  AngularPairCount INFO     100% done

The measured paircounts are stored in the result attribute. The number of pairs in a given angular bin is stored as the npairs field. Below, we plot the number of pairs \(DD(\theta)\) as a function of angular separation.

[7]:
pc = r_auto.result
plt.plot(pc['theta'], pc['npairs'])

# format the axes
plt.xlabel(r"$\theta$ [$\mathrm{degrees}$]")
plt.ylabel(r"$DD(\theta)$")
[7]:
<matplotlib.text.Text at 0x112e080b8>
_images/cookbook_angular-paircount_12_1.png

The total weight for a given pair is the product of their individual weights, and the average total weight is computed in each angular bin. This value is stored in the weightavg field of the result attribute. Below, we plot the average weight as a function of angular separation.

[8]:
plt.plot(r_auto.result['theta'], r_auto.result['weightavg'])

# format the axes
plt.xlabel(r"$\theta$ [$\mathrm{degrees}$]")
plt.ylabel(r"$\mathrm{average \ weight}$")
plt.ylim(0, 1.0)
[8]:
(0, 1.0)
_images/cookbook_angular-paircount_14_1.png

Here, we chose to weight our objects with random values ranging from 0 to 1. Thus, on average, the expectation value of the square of those weights will be 0.25, which we see reproduced in the figure above.

Cross-correlation Pair Counts

We can also compute the cross-correlation of pair counts between two sources. Below, we initialize a second randomly distributed catalog, and compute the pair counts between the two catalogs.

[9]:
# randomly distributed RA/DEC on the sky
data2 = RandomCatalog(NDATA, seed=84)
data2['RA']  = data2.rng.uniform(low=110, high=260, size=len(data2))
data2['DEC'] = data2.rng.uniform(low=-3.6, high=60., size=len(data2))
data2['Weight'] = numpy.random.random(size=len(data2))
[10]:
# compute the cross correlation pair counts
r_cross = AngularPairCount(data, edges, ra='RA', dec='DEC', weight='Weight', second=data2)
[ 000006.07 ]   0: 10-30 20:54  AngularPairCount INFO     using cpu grid decomposition: (1, 1, 1)
[ 000006.12 ]   0: 10-30 20:54  AngularPairCount INFO     correlating 10000 x 10000 objects in total
[ 000006.12 ]   0: 10-30 20:54  AngularPairCount INFO     correlating A x B = 10000 x 10000 objects (median) per rank
[ 000006.12 ]   0: 10-30 20:54  AngularPairCount INFO     min A load per rank = 10000
[ 000006.13 ]   0: 10-30 20:54  AngularPairCount INFO     max A load per rank = 10000
[ 000006.13 ]   0: 10-30 20:54  AngularPairCount INFO     (even distribution would result in 10000 x 10000)
[ 000006.14 ]   0: 10-30 20:54  AngularPairCount INFO     calling function 'Corrfunc.mocks.DDtheta_mocks.DDtheta_mocks'
[ 000006.44 ]   0: 10-30 20:54  AngularPairCount INFO     10% done
[ 000006.72 ]   0: 10-30 20:54  AngularPairCount INFO     20% done
[ 000007.00 ]   0: 10-30 20:54  AngularPairCount INFO     30% done
[ 000007.29 ]   0: 10-30 20:54  AngularPairCount INFO     40% done
[ 000007.59 ]   0: 10-30 20:54  AngularPairCount INFO     50% done
[ 000007.88 ]   0: 10-30 20:54  AngularPairCount INFO     60% done
[ 000008.16 ]   0: 10-30 20:54  AngularPairCount INFO     70% done
[ 000008.45 ]   0: 10-30 20:54  AngularPairCount INFO     80% done
[ 000008.74 ]   0: 10-30 20:54  AngularPairCount INFO     90% done
[ 000009.03 ]   0: 10-30 20:54  AngularPairCount INFO     100% done
[11]:
# plot the auto and cross pair counts
pc_auto = r_auto.result
pc_cross = r_cross.result
plt.plot(pc_auto['theta'], pc_auto['npairs'], label='auto correlation')
plt.plot(pc_cross['theta'], pc_cross['npairs'], label='cross correlation')

# format the axes
plt.legend(loc=0)
plt.xlabel(r"$\theta$ [$\mathrm{degrees}$]")
plt.ylabel(r"$DD(\theta)$")
[11]:
<matplotlib.text.Text at 0x1198cff28>
_images/cookbook_angular-paircount_19_1.png

The Multipoles of the BOSS DR12 Dataset

In this notebook, we use the ConvolvedFFTPower algorithm to compute the monopole, quadrupole, and hexadecapole of the BOSS DR12 LOWZ South Galactic Cap (SGC) galaxy sample. This data set is described in detail in Reid et al. 2016, and the cosmological results for the BOSS DR12 data are presented in Alam et al. 2017.

Here, we measure the power spectrum multipoles for the SGC LOWZ sample due to its relatively small size, such that this notebook can be executed in reasonable time on a laptop. The analysis presented in this notebook can easily be applied to the BOSS DR12 CMASS sample and samples in both the North and South Galactic Cap regions.

The BOSS dataset includes both the “data” and the “randoms” catalogs, each of which includes FKP weights, completeness weights, and \(n(z)\) values. This notebook illustrates how to incorporate these analysis steps into the power spectrum analysis using nbodykit.

[1]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
[2]:
from nbodykit.lab import *
from nbodykit import setup_logging, style

import os
import matplotlib.pyplot as plt
plt.style.use(style.notebook)
[3]:
setup_logging() # turn on logging to screen
Getting the Data

The BOSS DR12 galaxy data sets are available for download on the SDSS DR12 homepage. The “data” and “randoms” catalogs for the North Galactic CMASS sample are available from the following links:

Below, we’ve included a function to download these data files a directory of your choosing. The files will only be downloaded if they have not yet been downloaded.

[4]:
def print_download_progress(count, block_size, total_size):
    import sys
    pct_complete = float(count * block_size) / total_size
    msg = "\r- Download progress: {0:.1%}".format(pct_complete)
    sys.stdout.write(msg)
    sys.stdout.flush()

def download_data(download_dir):
    """
    Download the FITS data needed for this notebook to the specified directory.

    Parameters
    ----------
    download_dir : str
        the data will be downloaded to this directory
    """
    from six.moves import urllib
    import shutil
    import gzip

    urls = ['https://data.sdss.org/sas/dr12/boss/lss/galaxy_DR12v5_LOWZ_South.fits.gz',
            'https://data.sdss.org/sas/dr12/boss/lss/random0_DR12v5_LOWZ_South.fits.gz']
    filenames = ['galaxy_DR12v5_LOWZ_South.fits', 'random0_DR12v5_LOWZ_South.fits']

    # download both files
    for i, url in enumerate(urls):

        # the download path
        filename = url.split('/')[-1]
        file_path = os.path.join(download_dir, filename)
        final_path = os.path.join(download_dir, filenames[i])

        # do not re-download
        if not os.path.exists(final_path):
            print("Downloading %s" % url)

            # Check if the download directory exists, otherwise create it.
            if not os.path.exists(download_dir):
                os.makedirs(download_dir)

            # Download the file from the internet.
            file_path, _ = urllib.request.urlretrieve(url=url,
                                                      filename=file_path,
                                                      reporthook=print_download_progress)

            print()
            print("Download finished. Extracting files.")

            # unzip the file
            with gzip.open(file_path, 'rb') as f_in, open(final_path, 'wb') as f_out:
                shutil.copyfileobj(f_in, f_out)
            os.remove(file_path)
            print("Done.")
        else:
            print("Data has already been downloaded.")
[5]:
# download the data to the current directory
download_dir = "."
download_data(download_dir)
Data has already been downloaded.
Data has already been downloaded.
Loading the Data from Disk

The “data” and “randoms” catalogs are stored on disk as FITS objects, and we can use the FITSCatalog object to read the data from disk.

Note that the IO operations are performed on-demand, so no data is read from disk until an algorithm requires it to be read. For more details, see the On-Demand IO section of the documentation.

Note

To specify the directory where the BOSS catalogs were downloaded, change the path_to_catalogs variable below.

[6]:
# NOTE: change this path if you downloaded the data somewhere else!
data_path = os.path.join(download_dir, 'galaxy_DR12v5_LOWZ_South.fits')
randoms_path = os.path.join(download_dir, 'random0_DR12v5_LOWZ_South.fits')

# initialize the FITS catalog objects for data and randoms
data = FITSCatalog(data_path)
randoms = FITSCatalog(randoms_path)
[ 000001.32 ]   0: 08-03 17:15  CatalogSource   INFO     Extra arguments to FileType: ('./galaxy_DR12v5_LOWZ_South.fits',)
[ 000001.32 ]   0: 08-03 17:15  CatalogSource   INFO     Extra arguments to FileType: ('./random0_DR12v5_LOWZ_South.fits',)

We can analyze the available columns in the catalogs via the columns attribute:

[7]:
print('data columns = ', data.columns)
data columns =  ['AIRMASS', 'CAMCOL', 'COMP', 'DEC', 'DEVFLUX', 'EB_MINUS_V', 'EXPFLUX', 'EXTINCTION', 'FIBER2FLUX', 'FIBERID', 'FIELD', 'FINALN', 'FRACPSF', 'ICHUNK', 'ICOLLIDED', 'ID', 'IMAGE_DEPTH', 'IMATCH', 'INGROUP', 'IPOLY', 'ISECT', 'MJD', 'MODELFLUX', 'MULTGROUP', 'NZ', 'PLATE', 'PSFFLUX', 'PSF_FWHM', 'RA', 'RERUN', 'RUN', 'R_DEV', 'SKYFLUX', 'SPECTILE', 'Selection', 'TILE', 'Value', 'WEIGHT_CP', 'WEIGHT_FKP', 'WEIGHT_NOZ', 'WEIGHT_SEEING', 'WEIGHT_STAR', 'WEIGHT_SYSTOT', 'Weight', 'Z']
[8]:
print('randoms columns = ', randoms.columns)
randoms columns =  ['AIRMASS', 'DEC', 'EB_MINUS_V', 'IMAGE_DEPTH', 'IPOLY', 'ISECT', 'NZ', 'PSF_FWHM', 'RA', 'SKYFLUX', 'Selection', 'Value', 'WEIGHT_FKP', 'Weight', 'Z', 'ZINDX']
Select the Correct Redshift Range

The LOWZ galaxy sample is defined over a redshift range of \(0.15 < z < 0.43\), in order to not overlap with the CMASS sample. First, we will slice our input catalogs to the proper redshift using the Z column.

When manipulating loaded data, it is usually best to trim the loaded data via slice operations first before any additional columns or data cleaning is performed.

[9]:
ZMIN = 0.15
ZMAX = 0.43

# slice the randoms
valid = (randoms['Z'] > ZMIN)&(randoms['Z'] < ZMAX)
randoms = randoms[valid]

# slice the data
valid = (data['Z'] > ZMIN)&(data['Z'] < ZMAX)
data = data[valid]
Adding the Cartesian Coordinates

Both the “data” and “randoms” catalogs include positions of objects in terms of right ascension, declination, and redshift. Next, we add the Position column to both catalogs by converting from these sky coordinates to Cartesian coordinates, using the helper function transform.SkyToCartesian.

To convert from redshift to comoving distance, we use the fiducial DR12 BOSS cosmology, as defined in Alam et al. 2017.

[10]:
# the fiducial BOSS DR12 cosmology
cosmo = cosmology.Cosmology(h=0.676).match(Omega0_m=0.31)

# add Cartesian position column
data['Position'] = transform.SkyToCartesian(data['RA'], data['DEC'], data['Z'], cosmo=cosmo)
randoms['Position'] = transform.SkyToCartesian(randoms['RA'], randoms['DEC'], randoms['Z'], cosmo=cosmo)
Add the Completeness Weights

Next, we specify the completeness weights for the “data” and “randoms”. By construction, there are no systematic variations in the number density of the “randoms”, so the completenesss weights are set to unity for all objects. For the “data” catalog, the completeness weights are computed as defined in eq. 48 of Reid et al. 2016. These weights account for systematic issues, redshift failures, and missing objects due to close pair collisions on the fiber plate.

[11]:
randoms['WEIGHT'] = 1.0
data['WEIGHT'] = data['WEIGHT_SYSTOT'] * (data['WEIGHT_NOZ'] + data['WEIGHT_CP'] - 1.0)
Computing the Multipoles

To compute the multipoles, first we combine the “data” and “randoms” catalogs into a single FKPCatalog, which provides a common interface to the data in both catalogs. Then, we convert this FKPCatalog to a mesh object, specifying the number of mesh cells per side, as well as the names of the \(n(z)\) and weight columns.

[12]:
# combine the data and randoms into a single catalog
fkp = FKPCatalog(data, randoms)

We initialize a \(256^3\) mesh to paint the density field. Most likely, users will want to increase this number on machines with enough memory in order to avoid the effects of aliasing on the measured multipoles. We set the value to \(256^3\) to ensure this notebook runs on most machines.

We also tell the mesh object that \(n(z)\) column via the nbar keyword, the FKP weight column via the fkp_weight keyword, and the completeness weight column via the comp_weight keyword.

[13]:
mesh = fkp.to_mesh(Nmesh=256, nbar='NZ', fkp_weight='WEIGHT_FKP', comp_weight='WEIGHT', window='tsc')
[ 000017.34 ]   0: 08-03 17:15  FKPCatalog      INFO     BoxSize = [ 868. 1635.  919.]
[ 000017.34 ]   0: 08-03 17:15  FKPCatalog      INFO     cartesian coordinate range: [ 304.15007706 -787.61963937 -219.56693168] : [1154.5008732   815.19825493  680.73166297]
[ 000017.34 ]   0: 08-03 17:15  FKPCatalog      INFO     BoxCenter = [729.32547513  13.78930778 230.58236564]

Users can also pass a BoxSize keyword to the to_mesh() function in order to specify the size of the Cartesian box that the mesh is embedded in. By default, the maximum extent of the “randoms” catalog sets the size of the box.

Now, we are able to compute the desired multipoles. Here, we compute the \(\ell=0,2,\) and \(4\) multipoles using a wavenumber spacing of \(k = 0.005\) \(h/\mathrm{Mpc}\). The maximum \(k\) value computed is set by the Nyquist frequency of the mesh, \(k_\mathrm{max} = k_\mathrm{Nyq} = \pi N_\mathrm{mesh} / L_\mathrm{box}\).

[14]:
# compute the multipoles
r = ConvolvedFFTPower(mesh, poles=[0,2,4], dk=0.005, kmin=0.)
[ 000017.36 ]   0: 08-03 17:15  ConvolvedFFTPower INFO     using compensation function CompensateTSCShotnoise for source 'first'
[ 000017.36 ]   0: 08-03 17:15  ConvolvedFFTPower INFO     using compensation function CompensateTSCShotnoise for source 'second'
[ 000027.32 ]   0: 08-03 17:15  CatalogMesh     INFO     painted 5549994 out of 5549994 objects to mesh
[ 000027.33 ]   0: 08-03 17:15  CatalogMesh     INFO     painted 5549994 out of 5549994 objects to mesh
[ 000027.33 ]   0: 08-03 17:15  CatalogMesh     INFO     mean particles per cell is 0.0776352
[ 000027.33 ]   0: 08-03 17:15  CatalogMesh     INFO     sum is 1.3025e+06
[ 000027.33 ]   0: 08-03 17:15  CatalogMesh     INFO     normalized the convention to 1 + delta
[ 000027.74 ]   0: 08-03 17:15  CatalogMesh     INFO     painted 113525 out of 113525 objects to mesh
[ 000027.74 ]   0: 08-03 17:15  CatalogMesh     INFO     painted 113525 out of 113525 objects to mesh
[ 000027.75 ]   0: 08-03 17:15  CatalogMesh     INFO     mean particles per cell is 0.0016465
[ 000027.75 ]   0: 08-03 17:15  CatalogMesh     INFO     sum is 27623.7
[ 000027.75 ]   0: 08-03 17:15  CatalogMesh     INFO     normalized the convention to 1 + delta
[ 000027.80 ]   0: 08-03 17:15  FKPCatalogMesh  INFO     field: (FKPCatalog(species=['data', 'randoms']) as CatalogMesh) painting done
[ 000027.80 ]   0: 08-03 17:15  ConvolvedFFTPower INFO     tsc painting of 'first' done
[ 000028.17 ]   0: 08-03 17:15  ConvolvedFFTPower INFO     ell = 0 done; 1 r2c completed
[ 000031.15 ]   0: 08-03 17:15  ConvolvedFFTPower INFO     normalized power spectrum with `randoms.norm = 2.076940`
[ 000033.27 ]   0: 08-03 17:15  ConvolvedFFTPower INFO     ell = 2 done; 5 r2c completed
/home/yfeng1/anaconda3/install/lib/python3.6/site-packages/nbodykit/algorithms/fftpower.py:616: RuntimeWarning: invalid value encountered in sqrt
  xslab **= 0.5
/home/yfeng1/anaconda3/install/lib/python3.6/site-packages/nbodykit/meshtools.py:136: RuntimeWarning: divide by zero encountered in true_divide
  return sum(self.coords(i) * los[i] for i in range(self.ndim)) / self.norm2()**0.5
[ 000038.22 ]   0: 08-03 17:15  ConvolvedFFTPower INFO     ell = 4 done; 9 r2c completed
[ 000039.10 ]   0: 08-03 17:15  ConvolvedFFTPower INFO     higher order multipoles computed in elapsed time 00:00:07.94

The meta-data computed during the calculation is stored in attrs dictionary. See the documentation for more information.

[15]:
for key in r.attrs:
    print("%s = %s" % (key, str(r.attrs[key])))
poles = [0, 2, 4]
dk = 0.005
kmin = 0.0
use_fkp_weights = False
P0_FKP = None
Nmesh = [256 256 256]
BoxSize = [ 868. 1635.  919.]
BoxPad = [0.02 0.02 0.02]
BoxCenter = [729.32547513  13.78930778 230.58236564]
mesh.window = tsc
mesh.interlaced = False
alpha = 0.021211734643316733
data.norm = 2.0767224
randoms.norm = 2.076939564818603
shotnoise = 3631.26854558135
randoms.N = 5549994
randoms.W = 5549994.0
randoms.num_per_cell = 0.0776351608830188
data.N = 113525
data.W = 117725.0
data.num_per_cell = 0.001646501710638404
data.ext = 1
randoms.ext = 1

The measured multipoles are stored in the poles attribute. Below, we plot the monopole, quadrupole, and hexadecapole, making sure to subtract out the shot noise value from the monopole.

[16]:
poles = r.poles

for ell in [0, 2, 4]:
    label = r'$\ell=%d$' % (ell)
    P = poles['power_%d' %ell].real
    if ell == 0: P = P - r.attrs['shotnoise']
    plt.plot(poles['k'], poles['k']*P, label=label)

# format the axes
plt.legend(loc=0)
plt.xlabel(r"$k$ [$h \ \mathrm{Mpc}^{-1}$]")
plt.ylabel(r"$k \ P_\ell$ [$h^{-2} \ \mathrm{Mpc}^2$]")
plt.xlim(0.01, 0.25)
[16]:
(0.01, 0.25)
_images/cookbook_boss-dr12-data_30_1.png

For the LOWZ SGC sample, with \(N = 113525\) objects, the measured multipoles are noiser than for the other DR12 galaxy samples. Nonetheless, we have measured a clear monopole and quadrupole signal, with the hexadecapole remaining largely consistent with zero.

Note that the results here are measured up to the 1D Nyquist frequency, \(k_\mathrm{max} = k_\mathrm{Nyq} = \pi N_\mathrm{mesh} / L_\mathrm{box}\). Users can increase the Nyquist frequency and decrease the effects of aliasing on the measured power by increasing the mesh size. Using interlacing (by setting interlaced=True) can also reduce the effects of aliasing on the measured results.

Converting from \(P_\ell(k)\) to \(P(k,\mu)\)

The ConvolvedFFTPower.to_pkmu function allows users to rotate the measured multipoles, stored as the poles attribute, into \(P(k,\mu)\) wedges. Below, we convert our measurements of \(P_0\), \(P_2\), and \(P_4\) into 3 \(\mu\) wedges.

[17]:
# use the same number of mu wedges and number of multipoles
Nmu = Nell = 3
mu_edges = numpy.linspace(0, 1, Nmu+1)

# get a BinnedStatistic holding the P(k,mu) wedges
Pkmu = r.to_pkmu(mu_edges, 4)
[18]:
# plot each mu bin
for i in range(Pkmu.shape[1]):
    Pk = Pkmu[:,i] # select the ith mu bin
    label = r'$\mu$=%.2f' % (Pkmu.coords['mu'][i])
    plt.loglog(Pk['k'], Pk['power'].real - Pk.attrs['shotnoise'], label=label)

# format the axes
plt.legend(loc=0, ncol=2)
plt.xlabel(r"$k$ [$h \ \mathrm{Mpc}^{-1}$]")
plt.ylabel(r"$P(k, \mu)$ [$h^{-3}\mathrm{Mpc}^3$]")
plt.xlim(0.01, 0.25)
[18]:
(0.01, 0.25)
_images/cookbook_boss-dr12-data_34_1.png

The Power Spectrum of Survey Data

In this notebook, we explore the functionality of the ConvolvedFFTPower algorithm, which computes the power spectrum multipoles \(P_\ell(k)\) for data from a survey that includes non-trivial selection effects. The output of the algorithm is the true multipoles of the data, convoled with the window function of the survey.

The input data for this algorithm is assumed to be from an observational survey, with the position coordinates specified in terms of right ascension, declination, and redshift.

Note

The data used in this notebook is not realistic – rather, we choose the simplicity of generating mock data to help users get up and running quickly. Although the end results are not cosmologically interesting, we use the mock data to help illustrate the various steps necessary to use the ConvolvedFFTPower algorithm properly.

[1]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
[2]:
from nbodykit.lab import *
from nbodykit import setup_logging, style

from scipy.interpolate import InterpolatedUnivariateSpline
import matplotlib.pyplot as plt
plt.style.use(style.notebook)
[3]:
setup_logging()
Initalizing Mock Data

We start by generating mock catalogs to mimic the “data” and “randoms” catalogs needed by the ConvolvedFFTPower algorithm. Here, the “data” catalog usually gives the information about the galaxy objects, and the “randoms” catalog is a catalog of synthetic objects without any cosmological clustering signal. The “randoms” usually have a higher number density than the “data” and are a Monte Carlo representation of the survey volume.

In this notebook, we construct our fake “data” and “randoms” catalogs simply by generating uniformly distributed right ascension and declination values within a region of the sky, with redshifts drawn from a Gaussian distribution with \(\mu=0.5\) and \(\sigma=0.1\).

[4]:
NDATA = 50000

# initialize data and randoms catalogs
data = RandomCatalog(NDATA, seed=42)
randoms = RandomCatalog(NDATA*10, seed=84)

# add the (ra, dec, z) columns
for s in [data, randoms]:
    s['z']   = s.rng.normal(loc=0.5, scale=0.1)
    s['ra']  = s.rng.uniform(low=110, high=260)
    s['dec'] = s.rng.uniform(low=-3.6, high=60.)
Adding the Cartesian Coordinates

Next, we add the Position column to both the “data” and “randoms” by converting from sky coordinates to Cartesian coordinates, using the helper function transform.SkyToCartesian. The redshift to comoving distance transformation requires a cosmology instance, so we first initialize our desired cosmology parameters.

[5]:
# specify our cosmology
cosmo = cosmology.Cosmology(h=0.7).match(Omega0_m=0.31)

# add Cartesian position column
data['Position'] = transform.SkyToCartesian(data['ra'], data['dec'], data['z'], cosmo=cosmo)
randoms['Position'] = transform.SkyToCartesian(randoms['ra'], randoms['dec'], randoms['z'], cosmo=cosmo)
Specifying the “data” \(n(z)\)

The ConvolvedFFTPower algorithm requires the number density as a function of redshift for the “data” catalog. Here, we use the RedshiftHistogram algorithm to compute the redshift histogram of the “randoms” catalog, and then re-normalize the number density to that of the “data” catalog.

[6]:
# the sky fraction, used to compute volume in n(z)
FSKY = 0.15 # a made-up value

# compute n(z) from the randoms
zhist = RedshiftHistogram(randoms, FSKY, cosmo, redshift='z')

# re-normalize to the total size of the data catalog
alpha = 1.0 * data.csize / randoms.csize

# add n(z) from randoms to the FKP source
nofz = InterpolatedUnivariateSpline(zhist.bin_centers, alpha*zhist.nbar)

# plot
plt.plot(zhist.bin_centers, alpha*zhist.nbar)
plt.xlabel(r"$z$", fontsize=16)
plt.ylabel(r"$n(z)$ $[h^{3} \mathrm{Mpc}^{-3}]$", fontsize=16)
[ 000001.11 ]   0: 08-03 17:16  RedshiftHistogram INFO     using Scott's rule to determine optimal binning; h = 4.40e-03, N_bins = 207
[ 000001.14 ]   0: 08-03 17:16  RedshiftHistogram INFO     using cosmology {'output': 'vTk dTk mPk', 'extra metric transfer functions': 'y', 'n_s': 0.9667, 'gauge': 'synchronous', 'N_ur': 2.0328, 'h': 0.7, 'T_cmb': 2.7255, 'N_ncdm': 1, 'P_k_max_h/Mpc': 10.0, 'z_max_pk': 100.0, 'Omega_b': 0.04775550899153668, 'Omega_cdm': 0.2609299279412303, 'm_ncdm': [0.06]} to compute volume in units of (Mpc/h)^3
[ 000001.14 ]   0: 08-03 17:16  RedshiftHistogram INFO     sky fraction used in volume calculation: 0.1500
[6]:
Text(0,0.5,'$n(z)$ $[h^{3} \\mathrm{Mpc}^{-3}]$')
_images/cookbook_convpower_10_2.png

In this figure, note that the measured \(n(z)\) for the data closely resembles the input distribution we used: a Gaussian distribution with \(\mu=0.5\) and \(\sigma=0.1\).

Next, we initialize the FKPCatalog, which combines the “data” and “randoms” catalogs into a single object. Columns are now available in the FKPCatalog prefixed by either “data/” or “randoms/”.

[7]:
# initialize the FKP source
fkp = FKPCatalog(data, randoms)

# print out the columns
print("columns in FKPCatalog = ", fkp.columns)
columns in FKPCatalog =  ['data/FKPWeight', 'data/Position', 'data/Selection', 'data/Value', 'data/Weight', 'data/dec', 'data/ra', 'data/z', 'randoms/FKPWeight', 'randoms/Position', 'randoms/Selection', 'randoms/Value', 'randoms/Weight', 'randoms/dec', 'randoms/ra', 'randoms/z']

And we add the \(n(z)\) column to both the “data” and “randoms”, using the appropriate redshift column to compute the results.

[8]:
# add the n(z) columns to the FKPCatalog
fkp['randoms/NZ'] = nofz(randoms['z'])
fkp['data/NZ'] = nofz(data['z'])
Adding FKP Weights

Here, we add a column FKPWeight that gives the appropriate FKP weight for each catalog. The FKP weights are given by:

\[w_\mathrm{FKP} = \frac{1}{1 + n(z)P_0}.\]

Here, we use a value of \(P_0 = 10^4 \ h^{-3} \mathrm{Mpc}^3\).

[9]:
fkp['data/FKPWeight'] = 1.0 / (1 + fkp['data/NZ'] * 1e4)
fkp['randoms/FKPWeight'] = 1.0 / (1 + fkp['randoms/NZ'] * 1e4)
Adding Completeness Weights

The ConvolvedFFTPower algorithm also supports the use of completeness weights, which weight the number density fields of the “data” and “randoms” catalogs. Here, we add random weights to both catalogs as the Weight column.

Completeness weights change the number density field such that the weighted number density field on the mesh is equal to \(n'(\mathbf{r}) = w_c(\mathbf{r}) n(\mathbf{r})\), where \(w_c\) represents the completeness weights.

[10]:
fkp['data/Weight'] = numpy.random.random(size=data.size)
fkp['randoms/Weight'] = numpy.random.random(size=randoms.size)
Computing the Multipoles

To compute the multipoles, first we convert our FKPCatalog to a mesh object, specifying the number of mesh cells per side, as well as the names of the \(n(z)\) and weight columns.

If a Cartesian box size is not specified by the user, the size will be computed from the maximum extent of the Position column automatically.

[11]:
mesh = fkp.to_mesh(Nmesh=256, nbar='NZ', comp_weight='Weight', fkp_weight='FKPWeight')
[ 000003.45 ]   0: 08-03 17:16  FKPCatalog      INFO     BoxSize = [2126. 4028. 2042.]
[ 000003.45 ]   0: 08-03 17:16  FKPCatalog      INFO     cartesian coordinate range: [-2114.09611013 -1983.75342661  -126.22300371] : [ -30.15508148 1964.60785817 1875.18039017]
[ 000003.45 ]   0: 08-03 17:16  FKPCatalog      INFO     BoxCenter = [-1072.1255958     -9.57278422   874.47869323]

Now, we are able to compute the desired multipoles. Here, we compute the \(\ell=0,2,\) and \(4\) multipoles using a wavenumber spacing of \(k = 0.005\) \(h/\mathrm{Mpc}\). The maximum \(k\) value computed is set by the Nyquist frequency of the mesh, \(k_\mathrm{max} = k_\mathrm{Nyq} = \pi N_\mathrm{mesh} / L_\mathrm{box}\).

[12]:
# compute the multipoles
r = ConvolvedFFTPower(mesh, poles=[0,2,4], dk=0.005, kmin=0.01)
[ 000003.59 ]   0: 08-03 17:16  ConvolvedFFTPower INFO     using compensation function CompensateCICShotnoise for source 'first'
[ 000003.59 ]   0: 08-03 17:16  ConvolvedFFTPower INFO     using compensation function CompensateCICShotnoise for source 'second'
[ 000004.55 ]   0: 08-03 17:16  CatalogMesh     INFO     painted 500000 out of 500000 objects to mesh
[ 000004.56 ]   0: 08-03 17:16  CatalogMesh     INFO     painted 500000 out of 500000 objects to mesh
[ 000004.56 ]   0: 08-03 17:16  CatalogMesh     INFO     mean particles per cell is 0.0124907
[ 000004.56 ]   0: 08-03 17:16  CatalogMesh     INFO     sum is 209559
[ 000004.56 ]   0: 08-03 17:16  CatalogMesh     INFO     normalized the convention to 1 + delta
[ 000004.64 ]   0: 08-03 17:16  CatalogMesh     INFO     painted 50000 out of 50000 objects to mesh
[ 000004.65 ]   0: 08-03 17:16  CatalogMesh     INFO     painted 50000 out of 50000 objects to mesh
[ 000004.65 ]   0: 08-03 17:16  CatalogMesh     INFO     mean particles per cell is 0.00125849
[ 000004.65 ]   0: 08-03 17:16  CatalogMesh     INFO     sum is 21114
[ 000004.65 ]   0: 08-03 17:16  CatalogMesh     INFO     normalized the convention to 1 + delta
[ 000004.70 ]   0: 08-03 17:16  FKPCatalogMesh  INFO     field: (FKPCatalog(species=['data', 'randoms']) as CatalogMesh) painting done
[ 000004.70 ]   0: 08-03 17:16  ConvolvedFFTPower INFO     cic painting of 'first' done
[ 000005.07 ]   0: 08-03 17:16  ConvolvedFFTPower INFO     ell = 0 done; 1 r2c completed
[ 000005.67 ]   0: 08-03 17:16  ConvolvedFFTPower INFO     normalized power spectrum with `randoms.norm = 0.333066`
[ 000007.68 ]   0: 08-03 17:16  ConvolvedFFTPower INFO     ell = 2 done; 5 r2c completed
/home/yfeng1/anaconda3/install/lib/python3.6/site-packages/nbodykit/algorithms/fftpower.py:616: RuntimeWarning: invalid value encountered in sqrt
  xslab **= 0.5
/home/yfeng1/anaconda3/install/lib/python3.6/site-packages/nbodykit/meshtools.py:136: RuntimeWarning: divide by zero encountered in true_divide
  return sum(self.coords(i) * los[i] for i in range(self.ndim)) / self.norm2()**0.5
[ 000012.58 ]   0: 08-03 17:16  ConvolvedFFTPower INFO     ell = 4 done; 9 r2c completed
[ 000013.47 ]   0: 08-03 17:16  ConvolvedFFTPower INFO     higher order multipoles computed in elapsed time 00:00:07.81

The meta-data computed during the calculation is stored in attrs dictionary. See the documentation for more information.

[13]:
for key in r.attrs:
    print("%s = %s" % (key, str(r.attrs[key])))
poles = [0, 2, 4]
dk = 0.005
kmin = 0.01
use_fkp_weights = False
P0_FKP = None
Nmesh = [256 256 256]
BoxSize = [2126. 4028. 2042.]
BoxPad = [0.02 0.02 0.02]
BoxCenter = [-1072.1255958     -9.57278422   874.47869323]
mesh.window = cic
mesh.interlaced = False
alpha = 0.10068100439630336
data.norm = 0.3319168209225001
randoms.norm = 0.3330663605881116
shotnoise = 39294.08527131943
randoms.N = 500000
randoms.W = 250038.26038984343
randoms.num_per_cell = 0.012490674008300421
data.N = 50000
data.W = 25174.10319355387
data.num_per_cell = 0.0012584947277048816
data.seed = 42
randoms.seed = 84

The measured multipoles are stored in the poles attribute. Below, we plot the monopole, quadrupole, and hexadecapole, making sure to subtract out the shot noise value from the monopole.

[14]:
poles = r.poles

for ell in [0, 2, 4]:
    label = r'$\ell=%d$' % (ell)
    P = poles['power_%d' %ell].real
    if ell == 0: P = P - poles.attrs['shotnoise']
    plt.plot(poles['k'], P, label=label)

# format the axes
plt.legend(loc=0)
plt.xlabel(r"$k$ [$h \ \mathrm{Mpc}^{-1}$]")
plt.ylabel(r"$P_\ell$ [$h^{-3} \mathrm{Mpc}^3$]")
plt.xlim(0.01, 0.3)
[14]:
(0.01, 0.3)
_images/cookbook_convpower_28_1.png

Note that, as expected, there is no measurably cosmological signal, since the input catalogs were simply uniformly distributed objects on the sky.

The Power Spectrum of Data in a Simulation Box

In this notebook, we explore the functionality of the FFTPower algorithm, which can compute the 1D power spectrum \(P(k)\), 2D power spectrum \(P(k,\mu)\), and multipoles \(P_\ell(k)\). The algorithm is suitable for use on data sets in periodic simulation boxes, as the power spectrum is computed via a single FFT of the density mesh.

[1]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
[2]:
from nbodykit.lab import *
from nbodykit import setup_logging, style

import matplotlib.pyplot as plt
plt.style.use(style.notebook)
[3]:
setup_logging()
Initalizing a Log-normal Mock

We start by generating a mock catalog of biased objects (\(b_1 = 2\) ) at a redshif \(z=0.55\). We use the Planck 2015 cosmology and the Eisenstein-Hu linear power spectrum fitting formula. We generate the catalog in a box of side length \(L = 1380 \ \mathrm{Mpc}/h\) with a constant number density \(\bar{n} = 3 \times 10^{-3} \ h^{3} \mathrm{Mpc}^{-3}\).

[4]:
redshift = 0.55
cosmo = cosmology.Planck15
Plin = cosmology.LinearPower(cosmo, redshift, transfer='EisensteinHu')
b1 = 2.0

cat = LogNormalCatalog(Plin=Plin, nbar=3e-4, BoxSize=1380., Nmesh=256, bias=b1, seed=42)

We update the Position column to add redshift-space distortions along the z axis of the box using the VelocityOffset column.

[5]:
# add RSD
line_of_sight = [0,0,1]
cat['RSDPosition'] = cat['Position'] + cat['VelocityOffset'] * line_of_sight
Computing the 1D Power, \(P(k)\)

In this section, we compute and plot the 1D power spectrum \(P(k)\) of the log-normal mock.

We must first convert our CatalogSource object to a MeshSource, by setting up the mesh and specifying which interpolation kernel we wish to use. Here, we use “TSC” interpolation, and specify via compensated=True that we wish to correct for the effects of the interpolation window in Fourier space.

[6]:
# convert to a MeshSource, using TSC interpolation on 256^3 mesh
mesh = cat.to_mesh(window='tsc', Nmesh=256, compensated=True, position='RSDPosition')

We compute the 1D power spectrum by specifying mode as “1d”. We also choose the desired linear k binning by specifying the bin spacing via dk and the minimum k value via kmin.

[7]:
# compute the power, specifying desired linear k-binning
r = FFTPower(mesh, mode='1d', dk=0.005, kmin=0.01)
[ 000010.90 ]   0: 10-30 20:44  CatalogMesh     INFO     painted 787121 out of 787121 objects to mesh
[ 000010.90 ]   0: 10-30 20:44  CatalogMesh     INFO     mean particles per cell is 0.0469161
[ 000010.91 ]   0: 10-30 20:44  CatalogMesh     INFO     sum is 787120
[ 000010.91 ]   0: 10-30 20:44  CatalogMesh     INFO     normalized the convention to 1 + delta
[ 000011.25 ]   0: 10-30 20:44  CatalogMesh     INFO     field: (LogNormalCatalog(seed=42, bias=2) as CatalogMesh) painting done

The result is computed when initializing the FFTPower class and the power spectrum results are stored as a BinnedStatistic object as the power attribute.

[8]:
# the result is stored at "power" attribute
Pk = r.power
print(Pk)
<BinnedStatistic: dims: (k: 115), variables: ('k', 'power', 'modes')>

The coords attribute of the BinnedStatistic object specifies the coordinate grid for the binned result, which in this case, is just the center values of the k bins.

By default, the power is computed up to the 1D Nyquist frequency, which is defined as \(k_\mathrm{Nyq} = \pi N_\mathrm{mesh} / L_\mathrm{box}\), which in this case is equal to \(k_\mathrm{Nyq} = \pi \cdot 256 / 1380 = 0.5825 \ h \ \mathrm{Mpc}^{-1}\).

[9]:
print(Pk.coords)
{'k': array([ 0.0125,  0.0175,  0.0225,  0.0275,  0.0325,  0.0375,  0.0425,
        0.0475,  0.0525,  0.0575,  0.0625,  0.0675,  0.0725,  0.0775,
        0.0825,  0.0875,  0.0925,  0.0975,  0.1025,  0.1075,  0.1125,
        0.1175,  0.1225,  0.1275,  0.1325,  0.1375,  0.1425,  0.1475,
        0.1525,  0.1575,  0.1625,  0.1675,  0.1725,  0.1775,  0.1825,
        0.1875,  0.1925,  0.1975,  0.2025,  0.2075,  0.2125,  0.2175,
        0.2225,  0.2275,  0.2325,  0.2375,  0.2425,  0.2475,  0.2525,
        0.2575,  0.2625,  0.2675,  0.2725,  0.2775,  0.2825,  0.2875,
        0.2925,  0.2975,  0.3025,  0.3075,  0.3125,  0.3175,  0.3225,
        0.3275,  0.3325,  0.3375,  0.3425,  0.3475,  0.3525,  0.3575,
        0.3625,  0.3675,  0.3725,  0.3775,  0.3825,  0.3875,  0.3925,
        0.3975,  0.4025,  0.4075,  0.4125,  0.4175,  0.4225,  0.4275,
        0.4325,  0.4375,  0.4425,  0.4475,  0.4525,  0.4575,  0.4625,
        0.4675,  0.4725,  0.4775,  0.4825,  0.4875,  0.4925,  0.4975,
        0.5025,  0.5075,  0.5125,  0.5175,  0.5225,  0.5275,  0.5325,
        0.5375,  0.5425,  0.5475,  0.5525,  0.5575,  0.5625,  0.5675,
        0.5725,  0.5775,  0.5825])}

The input parameters to the algorithm, as well as the meta-data computed during the calculation, are stored in the attrs dictionary attribute. A key attribute is the Poisson shot noise, stored as the “shotnoise” key.

[10]:
# print out the meta-data
for k in Pk.attrs:
    print("%s = %s" %(k, str(Pk.attrs[k])))
Nmesh = [256 256 256]
BoxSize = [ 1380.  1380.  1380.]
dk = 0.005
kmin = 0.01
Lx = 1380.0
Ly = 1380.0
Lz = 1380.0
volume = 2628072000.0
mode = 1d
los = [0, 0, 1]
Nmu = 1
poles = []
N1 = 787121
N2 = 787121
shotnoise = 3338.84116927

Now, we plot the 1D power, first subtracting out the shot noise. Note that the power is complex, so we only plot the real part.

[11]:
# print the shot noise subtracted P(k)
plt.loglog(Pk['k'], Pk['power'].real - Pk.attrs['shotnoise'])

# format the axes
plt.xlabel(r"$k$ [$h \ \mathrm{Mpc}^{-1}$]")
plt.ylabel(r"$P(k)$ [$h^{-3}\mathrm{Mpc}^3$]")
plt.xlim(0.01, 0.6)
[11]:
(0.01, 0.6)
_images/cookbook_fftpower_20_1.png
Computing the 2D Power, \(P(k,\mu)\)

In this section, we compute and plot the 2D power spectrum \(P(k,\mu)\). For illustration, we compute results using a line-of-sight that is both parallel and perpendicular to the direction that we added redshift-space distortions.

Here, we compute \(P(k,\mu)\) where \(\mu\) is defined with respect to the z axis (los=[0,0,1]), using 5 \(\mu\) bins ranging from \(\mu=0\) to \(\mu=1\).

[12]:
# compute the 2D power
r = FFTPower(mesh, mode='2d', dk=0.005, kmin=0.01, Nmu=5, los=[0,0,1])
Pkmu = r.power
print(Pkmu)
[ 000014.28 ]   0: 10-30 20:44  CatalogMesh     INFO     painted 787121 out of 787121 objects to mesh
[ 000014.29 ]   0: 10-30 20:44  CatalogMesh     INFO     mean particles per cell is 0.0469161
[ 000014.29 ]   0: 10-30 20:44  CatalogMesh     INFO     sum is 787120
[ 000014.29 ]   0: 10-30 20:44  CatalogMesh     INFO     normalized the convention to 1 + delta
[ 000014.71 ]   0: 10-30 20:44  CatalogMesh     INFO     field: (LogNormalCatalog(seed=42, bias=2) as CatalogMesh) painting done
<BinnedStatistic: dims: (k: 115, mu: 5), variables: ('k', 'mu', 'power', 'modes')>

The coords attribute of the result now gives the centers of both the k and mu bins.

[13]:
print(Pkmu.coords)
{'k': array([ 0.0125,  0.0175,  0.0225,  0.0275,  0.0325,  0.0375,  0.0425,
        0.0475,  0.0525,  0.0575,  0.0625,  0.0675,  0.0725,  0.0775,
        0.0825,  0.0875,  0.0925,  0.0975,  0.1025,  0.1075,  0.1125,
        0.1175,  0.1225,  0.1275,  0.1325,  0.1375,  0.1425,  0.1475,
        0.1525,  0.1575,  0.1625,  0.1675,  0.1725,  0.1775,  0.1825,
        0.1875,  0.1925,  0.1975,  0.2025,  0.2075,  0.2125,  0.2175,
        0.2225,  0.2275,  0.2325,  0.2375,  0.2425,  0.2475,  0.2525,
        0.2575,  0.2625,  0.2675,  0.2725,  0.2775,  0.2825,  0.2875,
        0.2925,  0.2975,  0.3025,  0.3075,  0.3125,  0.3175,  0.3225,
        0.3275,  0.3325,  0.3375,  0.3425,  0.3475,  0.3525,  0.3575,
        0.3625,  0.3675,  0.3725,  0.3775,  0.3825,  0.3875,  0.3925,
        0.3975,  0.4025,  0.4075,  0.4125,  0.4175,  0.4225,  0.4275,
        0.4325,  0.4375,  0.4425,  0.4475,  0.4525,  0.4575,  0.4625,
        0.4675,  0.4725,  0.4775,  0.4825,  0.4875,  0.4925,  0.4975,
        0.5025,  0.5075,  0.5125,  0.5175,  0.5225,  0.5275,  0.5325,
        0.5375,  0.5425,  0.5475,  0.5525,  0.5575,  0.5625,  0.5675,
        0.5725,  0.5775,  0.5825]), 'mu': array([ 0.1,  0.3,  0.5,  0.7,  0.9])}

We plot each the power for each of the 5 \(\mu\) bins, and we see the effects of redshift-space distortions as a function of \(\mu\).

[14]:
# plot each mu bin
for i in range(Pkmu.shape[1]):
    Pk = Pkmu[:,i] # select the ith mu bin
    label = r'$\mu$=%.1f' % (Pkmu.coords['mu'][i])
    plt.loglog(Pk['k'], Pk['power'].real - Pk.attrs['shotnoise'], label=label)

# format the axes
plt.legend(loc=0, ncol=2)
plt.xlabel(r"$k$ [$h \ \mathrm{Mpc}^{-1}$]")
plt.ylabel(r"$P(k, \mu)$ [$h^{-3}\mathrm{Mpc}^3$]")
plt.xlim(0.01, 0.6)
[14]:
(0.01, 0.6)
_images/cookbook_fftpower_27_1.png

Now, we specify the line-of-sight as the x axis and again compute the 2D power.

[15]:
r = FFTPower(mesh, mode='2d', dk=0.005, kmin=0.01, Nmu=5, los=[1,0,0])
Pkmu = r.power
[ 000018.03 ]   0: 10-30 20:44  CatalogMesh     INFO     painted 787121 out of 787121 objects to mesh
[ 000018.03 ]   0: 10-30 20:44  CatalogMesh     INFO     mean particles per cell is 0.0469161
[ 000018.03 ]   0: 10-30 20:44  CatalogMesh     INFO     sum is 787120
[ 000018.03 ]   0: 10-30 20:44  CatalogMesh     INFO     normalized the convention to 1 + delta
[ 000018.45 ]   0: 10-30 20:44  CatalogMesh     INFO     field: (LogNormalCatalog(seed=42, bias=2) as CatalogMesh) painting done

Again we plot the 2D power for each of the \(\mu\) bins, but now that \(\mu\) is not defined along the same axis as the redshift-space distortions, we measure isotropic power as a function of \(\mu\).

[16]:
# plot each mu bin
for i in range(Pkmu.shape[1]):
    Pk = Pkmu[:,i] # select the ith mu bin
    label = r'$\mu$=%.1f' % (Pkmu.coords['mu'][i])
    plt.loglog(Pk['k'], Pk['power'].real - Pk.attrs['shotnoise'], label=label)

# format the axes
plt.legend(loc=0, ncol=2)
plt.xlabel(r"$k$ [$h \ \mathrm{Mpc}^{-1}$]")
plt.ylabel(r"$P(k, \mu)$ [$h^{-3}\mathrm{Mpc}^3$]")
plt.xlim(0.01, 0.6)
[16]:
(0.01, 0.6)
_images/cookbook_fftpower_31_1.png
Computing the Multipoles, \(P_\ell(k)\)

In this section, we also measure the power spectrum multipoles, \(P_\ell\), which projects the 2D power on to a basis defined by Legendre weights. The desired multipole numbers \(\ell\) should be specified as the poles keyword.

[17]:
# compute the 2D power AND ell=0,2,4 multipoles
r = FFTPower(mesh, mode='2d', dk=0.005, kmin=0.01, Nmu=5, los=[0,0,1], poles=[0,2,4])
[ 000021.70 ]   0: 10-30 20:44  CatalogMesh     INFO     painted 787121 out of 787121 objects to mesh
[ 000021.70 ]   0: 10-30 20:44  CatalogMesh     INFO     mean particles per cell is 0.0469161
[ 000021.70 ]   0: 10-30 20:44  CatalogMesh     INFO     sum is 787120
[ 000021.70 ]   0: 10-30 20:44  CatalogMesh     INFO     normalized the convention to 1 + delta
[ 000022.04 ]   0: 10-30 20:44  CatalogMesh     INFO     field: (LogNormalCatalog(seed=42, bias=2) as CatalogMesh) painting done

Now, there is an additional attribute poles which stores the \(P_\ell(k)\) result.

[18]:
poles = r.poles
print(poles)
print("variables = ", poles.variables)
<BinnedStatistic: dims: (k: 115), variables: 5 total>
variables =  ['k', 'power_0', 'power_2', 'power_4', 'modes']

We plot each multipole, subtracting the shot noise only from the monopole \(P_0\). The multipoles are stored using the variable names “power_0”, “power_2”, etc for \(\ell=0,2,\) etc.

[19]:
for ell in [0, 2, 4]:
    label = r'$\ell=%d$' % (ell)
    P = poles['power_%d' %ell].real
    if ell == 0: P = P - poles.attrs['shotnoise']
    plt.plot(poles['k'], poles['k'] * P, label=label)

# format the axes
plt.legend(loc=0)
plt.xlabel(r"$k$ [$h \ \mathrm{Mpc}^{-1}$]")
plt.ylabel(r"$k \ P_\ell$ [$h^{-2} \mathrm{Mpc}^2$]")
plt.xlim(0.01, 0.6)
[19]:
(0.01, 0.6)
_images/cookbook_fftpower_37_1.png

Halo Occupation Distribution (HOD) Mocks

In this notebook, we will generate a mock halo catalog by running a FOF finder on a LogNormalCatalog. We will then populate those halos with galaxies via the HOD technique and compare the galaxy and halo power spectra in both real and redshift space.

Note

When viewing results in this notebook, keep in mind that the FOF catalog generated from the LogNormalCatalog object is only semi-realistic. We choose to use the LogNormalCatalog rather than a more realistic data set due to the ease with which we can load the mock data.

[1]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
[2]:
from nbodykit.lab import *
from nbodykit import setup_logging, style

import matplotlib.pyplot as plt
plt.style.use(style.notebook)
[3]:
setup_logging() # turn on logging to screen
Generating Halos via the FOF Algorithm

We’ll start by generating a mock galaxy catalog at \(z=0.55\) from a log-normal density field.

[4]:
redshift = 0.55
cosmo = cosmology.Planck15
Plin = cosmology.LinearPower(cosmo, redshift, transfer='EisensteinHu')
b1 = 2.0

input_cat = LogNormalCatalog(Plin=Plin, nbar=3e-3, BoxSize=1380., Nmesh=256, bias=b1, seed=42)

And then, we run a friends-of-friends (FOF) algorithm on the mock galaxies to identify halos, using a linking length of 0.2 times the mean separation between objects. Here, we only keep halos with at least 5 particle members.

[5]:
fof = FOF(input_cat, linking_length=0.2, nmin=5)

Now, we set the particle mass as \(10^{12} \ M_\odot/h\) and create a HaloCatalog using the Planck 2015 parameters at a redshift of \(z = 0.55\). Here, ‘vir’ specifies that we wish to create the Mass column as the virial mass. This mass definition is necessary for computing the analytic concentration, needed when generating the HOD galaxies.

[6]:
halos = fof.to_halos(cosmo=cosmo, redshift=redshift, particle_mass=1e12, mdef='vir')

Finally, we convert the HaloCatalog object to a format recognized by the halotools package and initialze our HOD catalog, using the default HOD parameters.

[7]:
halotools_halos = halos.to_halotools()
hod = HODCatalog(halotools_halos)
[ 000048.48 ]   0: 10-30 20:53  HOD             INFO     satellite fraction: 0.05
[ 000048.48 ]   0: 10-30 20:53  HOD             INFO     mean log10 halo mass: 13.10
[ 000048.49 ]   0: 10-30 20:53  HOD             INFO     std log10 halo mass: 0.31

Using the gal_type column, we can create subsets of the HOD catalog that contain only centrals and only satellites.

[8]:
cen_idx = hod['gal_type'] == 0
sat_idx = hod['gal_type'] == 1

cens = hod[cen_idx]
sats = hod[sat_idx]
Real-space Power Spectra

In this section, we compute the real-space power spectrum of the FOF halos and the HOD galaxies and compare.

[9]:
# compute P(k) for the FOF halos
halo_power = FFTPower(halos, mode='1d', Nmesh=512)
[ 000049.20 ]   0: 10-30 20:53  CatalogMesh     INFO     painted 24720 out of 24720 objects to mesh
[ 000049.20 ]   0: 10-30 20:53  CatalogMesh     INFO     mean particles per cell is 0.000184178
[ 000049.20 ]   0: 10-30 20:53  CatalogMesh     INFO     sum is 24720
[ 000049.20 ]   0: 10-30 20:53  CatalogMesh     INFO     normalized the convention to 1 + delta
[ 000054.30 ]   0: 10-30 20:53  CatalogMesh     INFO     field: (HaloCatalog(size=24720) as CatalogMesh) painting done

And we can compute P(k) for all galaxies, centrals only, satellites only, and the cross power between centrals and satellites.

[10]:
gal_power = FFTPower(hod, mode='1d', Nmesh=256)
cen_power = FFTPower(cens, mode='1d', Nmesh=256)
sat_power = FFTPower(sats, mode='1d', Nmesh=256)
cen_sat_power = FFTPower(cens, second=sats, mode='1d', Nmesh=256)
[ 000063.02 ]   0: 10-30 20:53  CatalogMesh     INFO     painted 9551 out of 9551 objects to mesh
[ 000063.02 ]   0: 10-30 20:53  CatalogMesh     INFO     mean particles per cell is 0.000569284
[ 000063.02 ]   0: 10-30 20:53  CatalogMesh     INFO     sum is 9551
[ 000063.02 ]   0: 10-30 20:53  CatalogMesh     INFO     normalized the convention to 1 + delta
[ 000063.60 ]   0: 10-30 20:53  CatalogMesh     INFO     field: (HODCatalog(logMmin=13.03, sigma_logM=0.38, alpha=0.76, logM0=13.27, logM1=14.08) as CatalogMesh) painting done
[ 000064.60 ]   0: 10-30 20:53  CatalogMesh     INFO     painted 9119 out of 9119 objects to mesh
[ 000064.60 ]   0: 10-30 20:53  CatalogMesh     INFO     mean particles per cell is 0.000543535
[ 000064.60 ]   0: 10-30 20:53  CatalogMesh     INFO     sum is 9119
[ 000064.60 ]   0: 10-30 20:53  CatalogMesh     INFO     normalized the convention to 1 + delta
[ 000065.10 ]   0: 10-30 20:53  CatalogMesh     INFO     field: (HODCatalog(logMmin=13.03, sigma_logM=0.38, alpha=0.76, logM0=13.27, logM1=14.08) as CatalogMesh) painting done
[ 000066.14 ]   0: 10-30 20:53  CatalogMesh     INFO     painted 432 out of 432 objects to mesh
[ 000066.14 ]   0: 10-30 20:53  CatalogMesh     INFO     mean particles per cell is 2.57492e-05
[ 000066.15 ]   0: 10-30 20:53  CatalogMesh     INFO     sum is 432
[ 000066.15 ]   0: 10-30 20:53  CatalogMesh     INFO     normalized the convention to 1 + delta
[ 000067.08 ]   0: 10-30 20:53  CatalogMesh     INFO     field: (HODCatalog(logMmin=13.03, sigma_logM=0.38, alpha=0.76, logM0=13.27, logM1=14.08) as CatalogMesh) painting done
[ 000069.09 ]   0: 10-30 20:53  CatalogMesh     INFO     painted 9119 out of 9119 objects to mesh
[ 000069.09 ]   0: 10-30 20:53  CatalogMesh     INFO     mean particles per cell is 0.000543535
[ 000069.10 ]   0: 10-30 20:53  CatalogMesh     INFO     sum is 9119
[ 000069.10 ]   0: 10-30 20:53  CatalogMesh     INFO     normalized the convention to 1 + delta
[ 000069.93 ]   0: 10-30 20:53  CatalogMesh     INFO     field: (HODCatalog(logMmin=13.03, sigma_logM=0.38, alpha=0.76, logM0=13.27, logM1=14.08) as CatalogMesh) painting done
[ 000070.06 ]   0: 10-30 20:53  CatalogMesh     INFO     painted 432 out of 432 objects to mesh
[ 000070.07 ]   0: 10-30 20:53  CatalogMesh     INFO     mean particles per cell is 2.57492e-05
[ 000070.07 ]   0: 10-30 20:53  CatalogMesh     INFO     sum is 432
[ 000070.07 ]   0: 10-30 20:53  CatalogMesh     INFO     normalized the convention to 1 + delta
[ 000070.74 ]   0: 10-30 20:53  CatalogMesh     INFO     field: (HODCatalog(logMmin=13.03, sigma_logM=0.38, alpha=0.76, logM0=13.27, logM1=14.08) as CatalogMesh) painting done
[11]:
# plot galaxy auto power, centrals auto power, and sats auto power
labels = [r"$P^{gg}$", r"$P^{cc}$", r"$P^{ss}$"]
for i, r in enumerate([gal_power, cen_power, sat_power]):
    Pk = r.power
    plt.loglog(Pk['k'], Pk['power'].real-Pk.attrs['shotnoise'], label=labels[i])

# central-satellite power
Pk = cen_sat_power.power
plt.loglog(Pk['k'], Pk['power'].real, label=r"$P^{cs}$")

# the halo power
Phalo = halo_power.power
plt.loglog(Phalo['k'], Phalo['power'].real-Phalo.attrs['shotnoise'], c='k', label=r"$P^\mathrm{halo}$")

# add a legend and axis labels
plt.legend(loc=0, ncol=2)
plt.xlabel(r"$k$ [$h \ \mathrm{Mpc}^{-1}$]")
plt.ylabel(r"$P(k)$ [$h^{-3}\mathrm{Mpc}^3$]")
plt.xlim(0.01, 0.6)
plt.ylim(1e4, 5e6)
[11]:
(10000.0, 5000000.0)
_images/cookbook_hod-mocks_19_1.png

Here, we see the high bias of the satellite power spectrum (green). Because the satellite fraction is relatively low (\(f_\mathrm{sat} \simeq 0.04\)), the amplitude of the total galaxy power \(P^{gg}\) is only slightly larger than the centrals auto power \(P^{cc}\) – the centrals dominate the clustering signal.

Redshift-space Power Spectra

First, we convert HOD galaxy positions from real to redshift space.

[12]:
LOS = [0, 0, 1]
hod['RSDPosition'] = hod['Position'] + hod['VelocityOffset'] * LOS

And then compute \(P(k,\mu)\) for the full HOD galaxy catalog in five \(\mu\) bins.

[13]:
mesh = hod.to_mesh(position='RSDPosition', Nmesh=256, compensated=True)
rsd_gal_power = FFTPower(mesh, mode='2d', Nmu=5)
[ 000074.18 ]   0: 10-30 20:53  CatalogMesh     INFO     painted 9551 out of 9551 objects to mesh
[ 000074.18 ]   0: 10-30 20:53  CatalogMesh     INFO     mean particles per cell is 0.000569284
[ 000074.18 ]   0: 10-30 20:53  CatalogMesh     INFO     sum is 9551
[ 000074.18 ]   0: 10-30 20:53  CatalogMesh     INFO     normalized the convention to 1 + delta
[ 000074.77 ]   0: 10-30 20:53  CatalogMesh     INFO     field: (HODCatalog(logMmin=13.03, sigma_logM=0.38, alpha=0.76, logM0=13.27, logM1=14.08) as CatalogMesh) painting done

Finally, plot the galaxy \(P(k,\mu)\) and compare to the measured halo power in real space.

[14]:
pkmu = rsd_gal_power.power

# plot each mu bin
for i in range(pkmu.shape[1]):
    Pk = pkmu[:,i]
    label = r'$P^{gg}(\mu$=%.1f)' %pkmu.coords['mu'][i]
    plt.loglog(Pk['k'], Pk['power'].real - Pk.attrs['shotnoise'], label=label)

# the halo power
Phalo = halo_power.power
plt.loglog(Phalo['k'], Phalo['power'].real-Phalo.attrs['shotnoise'], c='k', label=r"$P^\mathrm{halo}$")

# add a legend and axis labels
plt.legend(loc=0, ncol=2)
plt.xlabel(r"$k$ [$h \ \mathrm{Mpc}^{-1}$]")
plt.ylabel(r"$P(k, \mu)$ [$h^{-3}\mathrm{Mpc}^3$]")
plt.xlim(0.01, 0.6)
plt.ylim(1e4, 2e6)
[14]:
(10000.0, 2000000.0)
_images/cookbook_hod-mocks_27_1.png

In this figure, we see the effects of RSD on the HOD catalog, with the power spectrum damped at higher \(\mu\) on small scales (larger \(k\)).

The Effects of Interlaced Painting

In this notebook, we’ll outline the effects of using the “interlacing” technique when painting a density field to a mesh. Interlacing is designed to reduce the effects of aliasing in Fourier space, which is caused by the finite sampling of the mesh. See Section 3.1 of Sefusatti et al. 2015 for an introuction to the interlacing technique. Aliasing leads to spurious contributions to Fourier space statistics, and the effects become worse as one approaches the Nyquist frequency, defined as:

\[k_\mathrm{nyq} = \pi \frac{N_\mathrm{mesh}}{L_\mathrm{box}}\]

Interlacing is designed to remove the odd, high-frequency image multiples, thus reducing the overall aliasing contribution at \(k < k_\mathrm{nyq}\).

Here, we will consider the effects on interlacing on the power spectrum when using a mesh of size \(N_\mathrm{mesh}=256\), as compared to the “truth”, computed using a mesh with 2x resoluton, i.e., \(N_\mathrm{mesh} = 512\). We consider the effects when using Cloud in Cell (CIC) and Triangular Shaped Cloud interpoaltion methods.

We correct the measured power spectrum for the effects of the interpolation window using equation 18 of Jing et al. 2005. This amounts to dividing the power spectrum by:

\[W^2(k_x, k_y, k_z) = \Pi_i^{x,y,z} \left [ \frac{\sin[\pi k_i / (2 k_\mathrm{nyq})]}{ \pi k_i / (2 k_\mathrm{nyq})} \right ]^p,\]

where \(p = 4,6\) for CIC and TSC windows, respectively. To accomplish this correction in nbodykit, we use the CompensateTSC and CompensateCIC functions.

Note

We do not apply any additional corrections for aliasing outside of turning on interlacing. An additional approximation to correct aliasing is often used (see equation 20 of Jing et al. 2005), but we do not use that correction in order to focus on the effects of interlacing. See the functions CompensateTSCAliasing and CompensateCICAliasing for more info on these additional corrections.

[1]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
[2]:
from nbodykit.lab import *
from nbodykit import style
from scipy.interpolate import InterpolatedUnivariateSpline as spline

import matplotlib.pyplot as plt
plt.style.use(style.notebook)
Initalizing a Log-normal Mock

We start by generating a mock catalog of biased objects (\(b_1 = 2\) ) at a redshif \(z=0.55\). We use the Planck 2015 cosmology and the Eisenstein-Hu linear power spectrum fitting formula. We generate the catalog in a box of side length \(L = 1380 \ \mathrm{Mpc}/h\) with a constant number density \(\bar{n} = 3 \times 10^{-3} \ h^{3} \mathrm{Mpc}^{-3}\).

[3]:
redshift = 0.55
cosmo = cosmology.Planck15
Plin = cosmology.LinearPower(cosmo, redshift, transfer='EisensteinHu')
BoxSize = 1380.
Nmesh = 256

cat = LogNormalCatalog(Plin=Plin, nbar=3e-4, BoxSize=BoxSize, Nmesh=Nmesh, bias=2.0, seed=42)
Generating the “Truth”

We generate the “truth” power spectrum by using a higher resolution mesh with Nmesh=512, using the TSC window interpolation method. With a higher resolution mesh, the Nyquist frequency is larger, decreasing the effects of aliasing at a given \(k\) value.

[4]:
# use a high-resolution mesh to get the truth
mesh = cat.to_mesh(window='tsc', Nmesh=512, compensated=True)

# compute the 1D power of this mesh
r = FFTPower(mesh, mode='1d')

# create a smooth interpolation
truth = r.power
truth = spline(truth['k'], truth['power'].real - truth.attrs['shotnoise'])
The Effects of Interlacing

In this section, we use a mesh of size \(256^3\) to compute the 1D power \(P(k)\) using the CIC and TSC windows and while using and not using the interlacing technique. We will then compare this \(P(k)\) to the “truth”, measured using the higher resolution mesh of size \(512^3\).

[5]:
for interlaced in [True, False]:
    for window in ['CIC', 'TSC']:

        # convert catalog to a mesh with desired window and interlacing
        mesh = cat.to_mesh(Nmesh=256, window=window, compensated=False, interlaced=interlaced)

        # apply correction for the window to the mesh
        compensation = mesh.CompensateCIC if window == 'CIC' else mesh.CompensateTSC
        mesh = mesh.apply(compensation, kind='circular', mode='complex')

        # compute the 1D power P(k)
        r = FFTPower(mesh, mode='1d')
        Pk = r.power

        # compare P(k) to the hi-resolution mesh P(k)
        label = 'interlaced=%s, window=%s' %(interlaced, window)
        plt.plot(Pk['k'], (Pk['power'].real - Pk.attrs['shotnoise']) / truth(Pk['k']), label=label)


# plot Nyquist frequency
k_ny = numpy.pi * Nmesh / BoxSize
plt.axvline(x=k_ny, c='k', label="Nyquist frequency for Nmesh=256")

# format the axes
plt.legend(loc=0, ncol=2)
plt.xlabel(r"$k$ [$h \ \mathrm{Mpc}^{-1}$]")
plt.ylabel(r"$P(k) / P(k)^\mathrm{truth}$")
plt.ylim(0.9, 1.2)
[5]:
(0.9, 1.2)
_images/cookbook_interlacing_8_1.png

In this figure, we see clearly see the effects of aliasing on the lower-resolution mesh results, causing the results to deviate from unity. There are two trends to note in this figure. First, TSC interpolation performs better than the CIC window, since it is a higher-order interpolation scheme. Second, for a fixed window, turning interlacing drastically reduces the contributions from aliasing. When using TSC and interlacing, \(P(k)\) is recovered almost perfectly up to the Nyquist frequency.

Mesh Interpolation Windows

In this notebook we explore the different window options for interpolating discrete objects on to a mesh. We compute the 1D power spectrum \(P(k)\) of a density field interpolated from a log-normal mock using different windows. The windows we test are:

  • Cloud in Cell: cic (default in nbodykit)

  • Triangular Shaped Cloud: tsc

  • Lanczos Kernel with \(a=2\) and \(a=3\): lanczos2 and lanczos3

  • Symmetric Daubechies wavelet with sizes 6, 12, and 20: sym6, sym12, and sym20

We also include timing tests when using each of these windows. The CIC window (default) is the fastest and can be considerably faster than the other kernels, especially the wavelet windows with large sizes.

When computing the power spectrum \(P(k)\), we de-convolve the effects of the interpolation on the measured power (compensated=True) for the CIC and TSC windows, but do not apply any corrections for the other windows.

For more information on using the Daubechies wavelets for power spectrum measurements, see Cui et al. 2008.

[1]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
[2]:
from nbodykit.lab import *
from nbodykit import style
from scipy.interpolate import InterpolatedUnivariateSpline as spline

import matplotlib.pyplot as plt
plt.style.use(style.notebook)
Initalizing a Log-normal Mock

We start by generating a mock catalog of biased objects (\(b_1 = 2\) ) at a redshif \(z=0.55\). We use the Planck 2015 cosmology and the Eisenstein-Hu linear power spectrum fitting formula. We generate the catalog in a box of side length \(L = 1380 \ \mathrm{Mpc}/h\) with a constant number density \(\bar{n} = 3 \times 10^{-3} \ h^{3} \mathrm{Mpc}^{-3}\).

[3]:
redshift = 0.55
cosmo = cosmology.Planck15
Plin = cosmology.LinearPower(cosmo, redshift, transfer='EisensteinHu')

cat = LogNormalCatalog(Plin=Plin, nbar=3e-4, BoxSize=1380., Nmesh=256, bias=2.0, seed=42)
cat_noise = UniformCatalog(nbar=3e-4, BoxSize=1380., seed=42)
Generating the “Truth”

We generate the “truth” power spectrum by using a higher resolution mesh with Nmesh=512, using the TSC window interpolation method. With a higher resolution mesh, the Nyquist frequency is larger, decreasing the effects of aliasing at a given \(k\) value.

[4]:
r = FFTPower(cat.to_mesh(window='tsc', Nmesh=512, compensated=True),
            mode='1d') # hi-resolution mesh
truth = r.power
truth = spline(truth['k'], truth['power'].real - truth.attrs['shotnoise'])

The truth power of the Poisson shot noise is just \(\frac{1}{\bar{n}}\).

[5]:
truth_noise = lambda k: 0.0 * k + 1 / 3e-4
Comparing Different Windows / LCDM-like

We compare the results when using different windows to measure a LCDM-like \(P(k)\) as compared to the high-resolution “truth” measured in the previous section. The goal is to identify which windows perform better/worse at minimizing this differences. We compute results using a lower-resultion mesh, Nmesh=256. The effects of aliasing are clear: the measured power to differ from the “truth”. For CIC and TSC, there is a residuel bias at small scale even after the resampling window has been compensated. This is because the true power-spectrum is LCDM-like, while the compensation window is derived for a shot-noise like signal.

[6]:
for window in ['cic', 'tsc', 'sym6', 'sym12', 'sym20', 'lanczos2', 'lanczos3']:

    print("computing power for window '%s'\n" %window + "-"*32)
    compensated = True if window in ['cic', 'tsc'] else False
    mesh = cat.to_mesh(Nmesh=256, window=window, compensated=compensated, interlaced=False)

    # compute the power
    %time r = FFTPower(mesh, mode='1d')
    Pk = r.power

    plt.plot(Pk['k'], (Pk['power'].real - Pk.attrs['shotnoise']) / truth(Pk['k']), label=window)

plt.legend(loc=0, ncol=2, fontsize=16)
plt.xlabel(r"$k$ [$h \ \mathrm{Mpc}^{-1}$]")
plt.ylabel(r"$P(k) / P(k)^\mathrm{truth}$")
plt.xlim(0.01, 0.6)
plt.ylim(0.9, 1.1)
computing power for window 'cic'
--------------------------------
CPU times: user 1.54 s, sys: 89.1 ms, total: 1.63 s
Wall time: 1.64 s
computing power for window 'tsc'
--------------------------------
CPU times: user 1.58 s, sys: 88.4 ms, total: 1.67 s
Wall time: 1.68 s
computing power for window 'sym6'
--------------------------------
CPU times: user 5.21 s, sys: 159 ms, total: 5.36 s
Wall time: 5.48 s
computing power for window 'sym12'
--------------------------------
CPU times: user 12.3 s, sys: 253 ms, total: 12.5 s
Wall time: 13 s
computing power for window 'sym20'
--------------------------------
CPU times: user 19.5 s, sys: 195 ms, total: 19.7 s
Wall time: 19.8 s
computing power for window 'lanczos2'
--------------------------------
CPU times: user 2.13 s, sys: 93.4 ms, total: 2.22 s
Wall time: 2.23 s
computing power for window 'lanczos3'
--------------------------------
CPU times: user 4.01 s, sys: 143 ms, total: 4.15 s
Wall time: 4.3 s
[6]:
(0.9, 1.1)
_images/cookbook_interpolation-windows_10_2.png

In this figure, we see that each window kernel has a specific \(k_\mathrm{max}\) where we can reasonably trust the measured results. The wavelet kernels sym12 and sym20 perform the best but taking over an order magnitude longer to compute than the default kernel (CIC). The Triangular Shaped Cloud window performs the next best, followed by the sym6 kernel and the CIC kernel. The Lanczos kernels do not perform well and should not be used by the user without additional compensation corrections (currently not implemented by default in nbodykit).

Comparing Different Windows / Shotnoise-like

We compare the results when using different windows to measure a shotnois-like \(P(k)\) as compared to the high-resolution “truth” measured in the previous section. We compute results using a lower-resultion mesh, Nmesh=256. The effects of aliasing and resampling is clear: compensation is effective at compensating the ‘missing’ power, recovering the flat shot-noise power. The wavelet inspired resampling windows also recovers the shot-noise power excellently. The lanczos based windows perform relatively badly due to a lack of suitable compensation function.

[7]:
for window in ['cic', 'tsc', 'sym6', 'sym12', 'sym20', 'lanczos2', 'lanczos3']:

    print("computing power for window '%s'\n" %window + "-"*32)
    compensated = True if window in ['cic', 'tsc'] else False
    mesh = cat_noise.to_mesh(Nmesh=256, window=window, compensated=compensated, interlaced=False)

    # compute the power
    %time r = FFTPower(mesh, mode='1d')
    Pk = r.power

    plt.plot(Pk['k'], (Pk['power'].real) / truth_noise(Pk['k']), label=window)

plt.legend(loc=0, ncol=2, fontsize=16)
plt.xlabel(r"$k$ [$h \ \mathrm{Mpc}^{-1}$]")
plt.ylabel(r"$P(k) / P(k)^\mathrm{truth}$")
plt.xlim(0.01, 0.6)
plt.ylim(0.9, 1.1)
computing power for window 'cic'
--------------------------------
CPU times: user 1.71 s, sys: 122 ms, total: 1.83 s
Wall time: 1.86 s
computing power for window 'tsc'
--------------------------------
CPU times: user 1.99 s, sys: 151 ms, total: 2.14 s
Wall time: 2.21 s
computing power for window 'sym6'
--------------------------------
CPU times: user 7.99 s, sys: 231 ms, total: 8.22 s
Wall time: 8.59 s
computing power for window 'sym12'
--------------------------------
CPU times: user 18.5 s, sys: 395 ms, total: 18.9 s
Wall time: 19.5 s
computing power for window 'sym20'
--------------------------------
CPU times: user 28.1 s, sys: 473 ms, total: 28.6 s
Wall time: 29.4 s
computing power for window 'lanczos2'
--------------------------------
CPU times: user 3.41 s, sys: 205 ms, total: 3.61 s
Wall time: 3.8 s
computing power for window 'lanczos3'
--------------------------------
CPU times: user 5.42 s, sys: 138 ms, total: 5.56 s
Wall time: 5.57 s
[7]:
(0.9, 1.1)
_images/cookbook_interpolation-windows_13_2.png

Recommendations

Given the increased speed costs of the wavelet windows, we recommend users use either the CIC or TSC windows. For optimized solutions providing the most precise results in the fastest amount of time, users should test the effects of setting interlaced=True (see this tutorial) while also trying various mesh sizes by changing the Nmesh parameter. The most robust results are obtained by running a series of convergence tests by computing results with a high-resolution mesh, which reduces the contributions of aliasing at fixed \(k\). This allows the user to robustly determine the best configuration for their desired accuracy.

Log-normal Mocks

In this notebook, we generate a LogNormalCatalog at \(z=0.55\) using the Planck 2015 cosmology parameters and the Eisenstein-Hu linear power spectrum fitting formula. We then plot the power spectra in both real and redshift space and compare to the input linear power spectrum.

[1]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
[2]:
from nbodykit.lab import *
from nbodykit import style, setup_logging

import matplotlib.pyplot as plt
plt.style.use(style.notebook)
[3]:
setup_logging() # turn on logging to screen
Initalizing the Log-normal Mock

Here, we generate a mock catalog of biased objects (\(b_1 = 2\) ) at a redshif \(z=0.55\). We use the Planck 2015 cosmology and the Eisenstein-Hu linear power spectrum fitting formula.

We generate the catalog in a box of side length \(L = 1380 \ \mathrm{Mpc}/h\) with a constant number density \(\bar{n} = 3 \times 10^{-3} \ h^{3} \mathrm{Mpc}^{-3}\).

[4]:
redshift = 0.55
cosmo = cosmology.Planck15
Plin = cosmology.LinearPower(cosmo, redshift, transfer='EisensteinHu')
b1 = 2.0

cat = LogNormalCatalog(Plin=Plin, nbar=3e-3, BoxSize=1380., Nmesh=256, bias=b1, seed=42)
Real-space Power Spectrum

In this section, we compute and plot the power spectrum, \(P(k,\mu)\), in real space (no redshift distortions) using five \(\mu\) bins.

[5]:
# convert the catalog to the mesh, with CIC interpolation
real_mesh = cat.to_mesh(compensated=True, window='cic', position='Position')

# compute the 2d P(k,mu) power, with 5 mu bins
r = FFTPower(real_mesh, mode='2d', Nmu=5)
pkmu = r.power
[ 000017.54 ]   0: 10-30 20:51  CatalogMesh     INFO     painted 7887677 out of 7887677 objects to mesh
[ 000017.54 ]   0: 10-30 20:51  CatalogMesh     INFO     mean particles per cell is 0.470142
[ 000017.54 ]   0: 10-30 20:51  CatalogMesh     INFO     sum is 7.88768e+06
[ 000017.55 ]   0: 10-30 20:51  CatalogMesh     INFO     normalized the convention to 1 + delta
[ 000017.91 ]   0: 10-30 20:51  CatalogMesh     INFO     field: (LogNormalCatalog(seed=42, bias=2) as CatalogMesh) painting done
[6]:
# plot each mu bin
for i in range(pkmu.shape[1]):
    Pk = pkmu[:,i]
    label = r'$\mu$=%.1f' %pkmu.coords['mu'][i]
    plt.loglog(Pk['k'], Pk['power'].real - Pk.attrs['shotnoise'], label=label)

# plot the biased linear power spectrum
k = numpy.logspace(-2, 0, 512)
plt.loglog(k, b1**2 * Plin(k), c='k', label=r'$b_1^2 P_\mathrm{lin}$')

# add a legend and axes labels
plt.legend(loc=0, ncol=2, fontsize=16)
plt.xlabel(r"$k$ [$h \mathrm{Mpc}^{-1}$]")
plt.ylabel(r"$P(k, \mu)$ [$h^{-3} \mathrm{Mpc}^3$]")
plt.xlim(0.01, 0.6)
plt.ylim(500, 2e5)
[6]:
(500, 200000.0)
_images/cookbook_lognormal-mocks_8_1.png

In this figure, we see the power spectrum in real-space is independent of \(\mu\), as expected. At low \(k\), the measured power agrees with the biased linear power \(b_1^2 P_\mathrm{lin}\). But at high \(k\), we see the effects of applying the Zel’dovich approximation to the density field, mirroring the effects of nonlinear evolution.

Redshift-space Power Spectrum

In this section, we compute and plot the power spectrum, \(P(k,\mu)\), in redshift space and see the effects of redshift-space distortions on the 2D power spectrum.

[7]:
def kaiser_pkmu(k, mu):
    """
    Returns the Kaiser linear P(k,mu) in redshift space

    .. math::

        P(k,mu) = (1 + f/b_1 mu^2)^2 b_1^2 P_\mathrm{lin}(k)
    """
    beta = cosmo.scale_independent_growth_rate(redshift) / b1
    return (1 + beta*mu**2)**2 * b1**2 * Plin(k)

Here, we add redshift space distortions (RSD) along the \(z\)-axis of the box. We use the VelocityOffset column which is equal to the Velocity column re-normalized properly for RSD.

[8]:
LOS = [0,0,1] # the z-axis

# add RSD to the Position
cat['RSDPosition'] = cat['Position'] + cat['VelocityOffset'] * LOS
[9]:
# from catalog to mesh
rsd_mesh = cat.to_mesh(compensated=True, window='cic', position='RSDPosition')

# compute the 2D power
r = FFTPower(rsd_mesh, mode='2d', Nmu=5, los=LOS)
pkmu = r.power

# plot each mu power bin, normalized by the expected Kaiser power
for i in range(pkmu.shape[1]):
    Pk = pkmu[:,i]
    mu = pkmu.coords['mu'][i]
    label = r'$\mu$=%.1f' %mu

    P = Pk['power'].real-Pk.attrs['shotnoise']
    plt.plot(Pk['k'], P / kaiser_pkmu(Pk['k'], mu), label=label)

# add a legend and axis labels
plt.legend(loc=0, ncol=2)
plt.xlabel(r"$k$ [$h \ \mathrm{Mpc}^{-1}$]")
plt.ylabel(r"$P(k, \mu) / P^\mathrm{Kaiser}(k,\mu)$")
plt.xlim(0.01, 0.6)
plt.ylim(0, 2.5)
[ 000026.10 ]   0: 10-30 20:51  CatalogMesh     INFO     painted 7887677 out of 7887677 objects to mesh
[ 000026.10 ]   0: 10-30 20:51  CatalogMesh     INFO     mean particles per cell is 0.470142
[ 000026.10 ]   0: 10-30 20:51  CatalogMesh     INFO     sum is 7.88768e+06
[ 000026.10 ]   0: 10-30 20:51  CatalogMesh     INFO     normalized the convention to 1 + delta
[ 000026.46 ]   0: 10-30 20:51  CatalogMesh     INFO     field: (LogNormalCatalog(seed=42, bias=2) as CatalogMesh) painting done
/Users/nhand/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel_launcher.py:15: RuntimeWarning: divide by zero encountered in true_divide
  from ipykernel import kernelapp as app
[9]:
(0, 2.5)
_images/cookbook_lognormal-mocks_14_2.png

In this figure, we clearly see the effects of RSD, which causes the higher \(\mu\) bins to be damped at high \(k\), relative to the lower \(\mu\) bins. Also, the measured power agrees with the expected Kaiser formula for linear RSD at low \(k\) (where linear theory holds): \(P^\mathrm{Kaiser} = (1 + f/b_1 \mu^2)^2 b_1^2 P_\mathrm{lin}(k)\).

Painting a Catalog to a Mesh

In this notebook, we outline the process of painting a discrete catalog of objects to a mesh. We discuss some of the most common examples, including the default behavior, which paints \(1+\delta\), painting the line-of-sight momentum field, and painting multiple species of particles to the same mesh.

Each section in this notebook is designed to be stand-alone, providing a full tutorial of the specific topic for users.

[1]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
[2]:
from nbodykit.lab import *
from nbodykit import style, setup_logging

import matplotlib.pyplot as plt
plt.style.use(style.notebook)
[3]:
setup_logging() # turn on logging to screen
Painting the Overdensity Field

When painting a catalog to a mesh, the painted density field is equal to \(1+\delta(\mathbf{x})\) when using the default configuration of nbodykit. In this section, we will paint a log-normal mock catalog of objects to a mesh and take a look at some of the properties of the resulting \(1+\delta(\mathbf{x})\) field.

To start, we initialize a log-normal mock catalog of unbiased (\(b_1=1\)) objects at a redshift of \(z=0.55\) in a box of side length \(L_\mathrm{box} = 1380 \ \mathrm{Mpc}/h\).

[4]:
redshift = 0.55
cosmo = cosmology.Planck15
Plin = cosmology.LinearPower(cosmo, redshift, transfer='EisensteinHu')

cat = LogNormalCatalog(Plin=Plin, nbar=3e-3, BoxSize=1380., Nmesh=256, bias=1.0, seed=42)

Next, we convert the catalog to a mesh object, using the Triangular Shaped Cloud interpolation window.

[5]:
mesh = cat.to_mesh(window='tsc')

The paint() function returns the painted density field on the mesh, as a RealField object.

[6]:
one_plus_delta = mesh.paint(mode='real')
print(type(one_plus_delta))
[ 000020.17 ]   0: 10-30 20:42  CatalogMesh     INFO     painted 7886161 out of 7886161 objects to mesh
[ 000020.17 ]   0: 10-30 20:42  CatalogMesh     INFO     mean particles per cell is 0.470052
[ 000020.17 ]   0: 10-30 20:42  CatalogMesh     INFO     sum is 7.88616e+06
[ 000020.17 ]   0: 10-30 20:42  CatalogMesh     INFO     normalized the convention to 1 + delta
[ 000020.22 ]   0: 10-30 20:42  CatalogMesh     INFO     field: (LogNormalCatalog(seed=42, bias=1) as CatalogMesh) painting done
<class 'pmesh.pm.RealField'>

The underlying array data of the RealField object is stored in the value attribute. Below, we compute the mean, maximum, and minimum of the \(1+\delta\) field. As expected, these values are consistent with the properties of the overdensity field \(\delta\), specifically \(\mathrm{min}(\delta) = -1\) and \(\langle \delta \rangle = 0\).

[7]:
print("mean of 1+delta = ", one_plus_delta.value.mean())
print("min of 1+delta = ", one_plus_delta.value.min())
print("max of 1+delta = ", one_plus_delta.value.max())
mean of 1+delta =  1.0
min of 1+delta =  0.0
max of 1+delta =  17.8592

We can preview a 2D projection of the density field with the preview function.

[8]:
plt.imshow(one_plus_delta.preview(axes=[0,1]))
[8]:
<matplotlib.image.AxesImage at 0x117a1bac8>
_images/cookbook_painting_14_1.png

Finally, we take a look at a 1D histogram of the \(1+\delta\) field values in each cell on the mesh. The distribution of values in the \(1+\delta\) field very roughly follows a log-normal distribution, providing a semi-realistic representation of the distribution of mass in the true Universe.

[9]:
# histogram of 1+delta in log-spaced bins
bins = numpy.logspace(-7, numpy.log10(30.), 100)
_ = plt.hist(one_plus_delta.value.flatten(), bins=bins)

# format the axes
plt.xscale('log')
plt.xlabel(r"$1+\delta$")
plt.ylabel(r"$N_\mathrm{cells}$")
plt.xlim(1e-2, 20)
[9]:
(0.01, 20)
_images/cookbook_painting_16_1.png
Painting the Line-of-sight Momentum Field

In this section, we describe how to use nbodykit to paint the line-of-sight momentum field. The paint() function paints mass-weighted (or equivalently, number-weighted) quantities to the mesh, as

\[F(\mathbf{x}) = \left [ 1 + \delta(\mathbf{x}) \right] V(\mathbf{x}),\]

where \(\delta\) is the usual overdensity field and \(V(\mathbf{x})\) is the field value. By default, \(V(\mathbf{x})\) is equal to unity, such that the field painted is \(1+\delta\) (see the previous section). Here, we set the field value \(V(\mathbf{x})\) to be the line-of-sight velocity field, \(V(\mathbf{x}) = v_\parallel(\mathbf{x})\), and paint the line-of-sight momentum field (mass-weighted velocity).

In this section, we compute the cross power spectrum of the momentum field with the density field, the auto spectrum of the momentum field, and the density auto spectrum for a log-normal mock and compare to the corresponding linear predictions for these terms. For more details on these momentum correlators, see Vlah et al. 2012.

We start by initializing a log-normal mock catalog of unbiased (\(b_1=1\)) objects at a redshift of \(z=0.55\) in a box of side length \(L_\mathrm{box} = 1380 \ \mathrm{Mpc}/h\).

[10]:
redshift = 0.55
cosmo = cosmology.Planck15
Plin = cosmology.LinearPower(cosmo, redshift, transfer='EisensteinHu')

cat = LogNormalCatalog(Plin=Plin, nbar=3e-3, BoxSize=1380., Nmesh=256, bias=1.0, seed=42)

We choose the \(x\) axis of the box to be our line-of-sight, and add a column to our catalog holding the \(x\) component of the velocity field, properly normalized by the conformal Hubble factor \(\mathcal{H}\), where \(\mathcal{H} = a H\). We perform this normalization such that the resulting velocity field has units of \(\mathrm{Mpc}/h\). Conveniently, the VelocityOffset column holds the quantity we desire, so we select the first column for the \(x\) component.

[11]:
# line-of-sight is x-axis
LOS = [1, 0, 0]


# this is the velocity / (a*H)
cat['Vx'] = cat['VelocityOffset'][:,0]  # units are Mpc/h

Next, we convert our catalog to mesh objects in order to paint the momentum and overdensity fields. To paint the momentum, we simply specify the value keyword as the Vx column, holding the line-of-sight velocity.

[12]:
# mesh to paint momentum by specifying "value" keyword
momentum_mesh = cat.to_mesh(compensated=True, window='tsc', position='Position', value='Vx')

# mesh to paint 1+delta by using default value=1.0
mesh = cat.to_mesh(compensated=True, window='tsc', position='Position')

Now, we can use these mesh objects to compute the desired auto and cross spectra of the overdensity and momentum fields using the FFTPower algorithm. First, we compute the usual density auto spectrum using

[13]:
# the auto power spectrum of the overdensity field
r00 = FFTPower(mesh, mode='1d', los=[1,0,0])
[ 000044.22 ]   0: 10-30 20:42  CatalogMesh     INFO     painted 7886161 out of 7886161 objects to mesh
[ 000044.22 ]   0: 10-30 20:42  CatalogMesh     INFO     mean particles per cell is 0.470052
[ 000044.23 ]   0: 10-30 20:42  CatalogMesh     INFO     sum is 7.88616e+06
[ 000044.23 ]   0: 10-30 20:42  CatalogMesh     INFO     normalized the convention to 1 + delta
[ 000044.60 ]   0: 10-30 20:42  CatalogMesh     INFO     field: (LogNormalCatalog(seed=42, bias=1) as CatalogMesh) painting done

Next, we compute the cross power spectrum of the momentum and overdensity fields. The corresponding power spectrum has \(\mu\) dependence (see section 3.2 of Vlah et al. 2012), so we compute the \(\ell=1\) multipole to extract this angular dependence. Note that we also specify the \(x\) axis as the line-of-sight to use when determining the \(\mu\) coordinate.

[14]:
# the cross spectrum of the momentum and density
r01 = FFTPower(momentum_mesh, mode='1d', poles=[1], second=mesh, los=LOS)
[ 000051.03 ]   0: 10-30 20:43  CatalogMesh     INFO     painted 7886161 out of 7886161 objects to mesh
[ 000051.03 ]   0: 10-30 20:43  CatalogMesh     INFO     mean particles per cell is 0.470052
[ 000051.03 ]   0: 10-30 20:43  CatalogMesh     INFO     sum is 13.9668
[ 000051.03 ]   0: 10-30 20:43  CatalogMesh     INFO     normalized the convention to 1 + delta
[ 000051.40 ]   0: 10-30 20:43  CatalogMesh     INFO     field: (LogNormalCatalog(seed=42, bias=1) as CatalogMesh) painting done
[ 000056.84 ]   0: 10-30 20:43  CatalogMesh     INFO     painted 7886161 out of 7886161 objects to mesh
[ 000056.85 ]   0: 10-30 20:43  CatalogMesh     INFO     mean particles per cell is 0.470052
[ 000056.85 ]   0: 10-30 20:43  CatalogMesh     INFO     sum is 7.88616e+06
[ 000056.85 ]   0: 10-30 20:43  CatalogMesh     INFO     normalized the convention to 1 + delta
[ 000057.22 ]   0: 10-30 20:43  CatalogMesh     INFO     field: (LogNormalCatalog(seed=42, bias=1) as CatalogMesh) painting done

And finally, we compute the auto power spectrum of the momentum field. We are interested in the \(\mu^2\) dependence of the power spectrum (see section 3.2 of Vlah et al. 2012), so we compute the \(\ell=2\) multipole to extract this angular dependence.

[15]:
# the auto spectrum of the momentum field
r11 = FFTPower(momentum_mesh, mode='1d', poles=[2], los=LOS)
[ 000064.09 ]   0: 10-30 20:43  CatalogMesh     INFO     painted 7886161 out of 7886161 objects to mesh
[ 000064.09 ]   0: 10-30 20:43  CatalogMesh     INFO     mean particles per cell is 0.470052
[ 000064.09 ]   0: 10-30 20:43  CatalogMesh     INFO     sum is 13.9668
[ 000064.09 ]   0: 10-30 20:43  CatalogMesh     INFO     normalized the convention to 1 + delta
[ 000064.55 ]   0: 10-30 20:43  CatalogMesh     INFO     field: (LogNormalCatalog(seed=42, bias=1) as CatalogMesh) painting done

Following Vlah et al. 2012, we can use these measured momentum correlators to compute their contributions to the redshift-space power spectrum, \(P(k,\mu)\). Specifically, we can compute the the first three terms as

\[P(k,\mu) = P_{00}(k) - 2 i k \mu P_{01}(k,\mu) + k^2 \mu^2 P_{11}(k,\mu).\]

In linear theory, these terms are given by (see Kaiser 1987):

\[P_\mathrm{lin}(k,\mu) = P_\mathrm{lin}(k) + 2 f \mu^2 P_\mathrm{lin}(k) + f^2 \mu^2 P_\mathrm{lin}(k).\]

Below, we will compute the contributions to \(P(k,\mu)\) from our measured power spectra and compare to their linear counterparts.

[16]:
# shot noise subtracted density auto spectrum
P00 = r00.power['power'].real - r00.attrs['shotnoise']

The density-momentum cross spectrum has \(\mu\) dependence, so the dipole (\(\ell=1\)) is equal to our desired \(P_{01}\). Note that the dipole is imaginary, so we select the imaginary component of the measured spectrum. Finally, we multiply the correct pre-factor of \(2 k\).

Note that the because \(P_{01}\) is imaginary, we obtain a real result, as required. Also, note that the final contribution to \(P(k,\mu)\) has \(\mu^2\) dependence, as expected.

[17]:
# density-momentum contribution
P01 = 2 * r01.poles['k'] * r01.poles['power_1'].imag

We are interested in the \(\mu^2\) term of the \(P_{11}(k,\mu)\) correlator, which can be computed by rescaling the measured \(\ell=2\) multipole of the momentum auto spectrum by a factor of \(3/2\). We then multiply by the necessary \(k^2\) factor from the above equation to obtain the final contribution of the third term to \(P(k,\mu)\).

Note that the final contribution to \(P(k,\mu)\) has \(\mu^4\) dependence, as expected.

[18]:
# momentum auto contribution
P11 = 1.5 * r11.poles['k']**2 * r11.poles['power_2'].real

The linear contributions to \(P(k,\mu)\) are given by:

[19]:
# set up the wavenumbers
k = numpy.logspace(-2, 0, 1000)

# the growth rate at this redshift
f = cosmo.scale_independent_growth_rate(redshift)

# linear first term
P00_lin = Plin(k)

# linear second term
P01_lin = 2*f*P00_lin

# linear third term
P11_lin = f**2 * P00_lin

Finally, we can plot the results and compare!

[20]:
# P00
plt.loglog(r00.power['k'], P00, label=r'$P_{00}$')
plt.loglog(k, P00_lin, c='k', label='linear prediction')

# P01
plt.loglog(r01.power['k'], P01, label=r"$P_{01}$")
plt.loglog(k, P01_lin, c='k')

# P11
plt.loglog(r11.power['k'], P11, label=r"$P_{11}$")
plt.loglog(k, P11_lin, c='k')

# format the axes
plt.legend(loc='lower left', ncol=2)
plt.xlabel(r"$k$ [$h \ \mathrm{Mpc}^{-1}$]")
plt.ylabel(r"$P_{xy}$ [$h^{-3} \mathrm{Mpc}^3$]")
plt.xlim(0.01, 0.6)
[20]:
(0.01, 0.6)
_images/cookbook_painting_39_1.png
Painting Multiple Species of Particles

In this section, we generate two “fake” catalogs of data, designed to represent dark matter particles and hydro particles from a cosmological simulation. We assign the two particles different masses, as is often the case in simulations, and paint the catalogs to the same mesh using the MultipleSpeciesCatalog object. Finally, we compute the combined power spectrum and compare the power spectra of the individual catalogs.

We begin by generating mock catalogs to represent dark matter and hydro particles of a simulation. We use our log-normal mock generator at a redshift of \(z=0.55\) and a box size of \(L_\mathrm{box} = 1380 \ \mathrm{Mpc}/h\).

[21]:
redshift = 0.55
cosmo = cosmology.Planck15
Plin = cosmology.LinearPower(cosmo, redshift, transfer='EisensteinHu')

# the "hydro" particles
hydro = LogNormalCatalog(Plin=Plin, nbar=3e-3, BoxSize=1380., Nmesh=256, bias=2.0, seed=42)

# the "dark matter" particles
dm = LogNormalCatalog(Plin=Plin, nbar=3e-3, BoxSize=1380., Nmesh=256, bias=1.0, seed=84)

Often in simulations, different species of particles will have different masses. Here, we arbitrarily assign the dark matter particles to be 3x as heavy as the hydro particles. We add a new column to each catalog to represent this Mass column.

[22]:
hydro['Mass'] = 0.25
dm['Mass'] = 0.75

m_hydro = hydro['Mass']
m_dm = dm['Mass']

We can combine both species of particles into a single catalog using the MultipleSpeciesCatalog object. This object takes a list of strings giving the names of each species and the individual catalogs.

[23]:
combined = MultipleSpeciesCatalog(['hydro', 'dm'], hydro, dm)
print(combined)
MultipleSpeciesCatalog(species=['hydro', 'dm'])

In this combined catalog, the columns for individual species can be accessed by prefixing the name with the name of the species and a forward slash, i.e., “dm/” or “hydro/”.

[24]:
print(combined.columns)
['hydro/Mass', 'hydro/Position', 'hydro/Selection', 'hydro/Value', 'hydro/Velocity', 'hydro/VelocityOffset', 'hydro/Weight', 'dm/Mass', 'dm/Position', 'dm/Selection', 'dm/Value', 'dm/Velocity', 'dm/VelocityOffset', 'dm/Weight']

For example, the mass of the dark matter is set to 0.75 for all dark matter particles.

[25]:
print(combined['dm/Mass'].compute())
[ 0.75  0.75  0.75 ...,  0.75  0.75  0.75]

And the mass of the hydro particles is set to 0.25.

[26]:
print(combined['hydro/Mass'].compute())
[ 0.25  0.25  0.25 ...,  0.25  0.25  0.25]

Now, we can convert our catalogs to mesh objects, and specify the Mass column to be used as the weight value when painting to the mesh. This is done via the weight keyword of the to_mesh() function. By specifying weight=Mass, each hydro particle will contribute a value of 0.25 to the mesh and each dark matter particle will contribute a value of 0.75. The resulting density field is normalized as \(1+\delta\), using the sum of the densities for each species.

So, the painted field is:

\[F(\mathbf{x}) = 1 + \delta_\mathrm{tot}(\mathbf{x}),\]

where we \(\delta_\mathrm{tot}\) is given by

\[\delta_\mathrm{tot}(\mathbf{x}) = \frac{1}{\bar{n}} \left[ w_\mathrm{dm}(\mathbf{x}) n_\mathrm{dm}(\mathbf{x}) + w_\mathrm{hydro}(\mathbf{x}) n_\mathrm{hydro}(\mathbf{x})\right] - 1,\]

and the mean density is determined from the sum of each species as

\[\bar{n} = \langle w_\mathrm{dm}(\mathbf{x}) \rangle \bar{n}_\mathrm{dm} + \langle w_\mathrm{hydro}(\mathbf{x}) \rangle \bar{n}_\mathrm{hydro}.\]

Here, the mean number density of the particles is assumed to be constant throughout the box, given by the total number of the species divided by the volume.

Note

nbodykit supports spatially varying weights, even though in our case, the weights are constant, such that \(w_\mathrm{dm}(\mathbf{x}) = 0.75\) and \(w_\mathrm{hydro}(\mathbf{x}) = 0.25\).

[27]:
# the combined mesh, weighted by mass
combined_mesh = combined.to_mesh(Nmesh=256, BoxSize=1380.0, compensated=True, window='tsc', weight='Mass')

# the hydro-only mesh
hydro_mesh = hydro.to_mesh(Nmesh=256, BoxSize=1380.0, compensated=True, window='tsc', weight='Mass')

# the dm-only mesh
dm_mesh = dm.to_mesh(Nmesh=256, BoxSize=1380.0, compensated=True, window='tsc', weight='Mass')

First, compute the 1D power \(P(k)\) of the combined mesh of particles.

[28]:
r_combined = FFTPower(combined_mesh, mode='1d')
[ 000095.27 ]   0: 10-30 20:43  MultipleSpeciesCatalogMesh INFO     painting the 'hydro' species
[ 000100.98 ]   0: 10-30 20:43  CatalogMesh     INFO     painted 7887677 out of 7887677 objects to mesh
[ 000100.98 ]   0: 10-30 20:43  CatalogMesh     INFO     mean particles per cell is 0.117536
[ 000100.98 ]   0: 10-30 20:43  CatalogMesh     INFO     sum is 1.97192e+06
[ 000100.98 ]   0: 10-30 20:43  CatalogMesh     INFO     normalized the convention to 1 + delta
[ 000101.00 ]   0: 10-30 20:43  MultipleSpeciesCatalogMesh INFO     painting the 'dm' species
[ 000108.29 ]   0: 10-30 20:43  CatalogMesh     INFO     painted 7881785 out of 7881785 objects to mesh
[ 000108.29 ]   0: 10-30 20:43  CatalogMesh     INFO     mean particles per cell is 0.352343
[ 000108.29 ]   0: 10-30 20:43  CatalogMesh     INFO     sum is 7.88326e+06
[ 000108.29 ]   0: 10-30 20:43  CatalogMesh     INFO     normalized the convention to 1 + delta
[ 000109.11 ]   0: 10-30 20:43  MultipleSpeciesCatalogMesh INFO     field: (MultipleSpeciesCatalog(species=['hydro', 'dm']) as CatalogMesh) painting done

And also compute the dark matter only and hydro only power spectra.

[29]:
# dark matter only power
r_dm = FFTPower(dm_mesh, mode='1d')

# hydro only power
r_hydro = FFTPower(hydro_mesh, mode='1d')
[ 000118.44 ]   0: 10-30 20:44  CatalogMesh     INFO     painted 7881785 out of 7881785 objects to mesh
[ 000118.44 ]   0: 10-30 20:44  CatalogMesh     INFO     mean particles per cell is 0.352343
[ 000118.44 ]   0: 10-30 20:44  CatalogMesh     INFO     sum is 5.91133e+06
[ 000118.44 ]   0: 10-30 20:44  CatalogMesh     INFO     normalized the convention to 1 + delta
[ 000119.13 ]   0: 10-30 20:44  CatalogMesh     INFO     field: (LogNormalCatalog(seed=84, bias=1) as CatalogMesh) painting done
[ 000128.23 ]   0: 10-30 20:44  CatalogMesh     INFO     painted 7887677 out of 7887677 objects to mesh
[ 000128.23 ]   0: 10-30 20:44  CatalogMesh     INFO     mean particles per cell is 0.117536
[ 000128.23 ]   0: 10-30 20:44  CatalogMesh     INFO     sum is 1.97192e+06
[ 000128.23 ]   0: 10-30 20:44  CatalogMesh     INFO     normalized the convention to 1 + delta
[ 000128.69 ]   0: 10-30 20:44  CatalogMesh     INFO     field: (LogNormalCatalog(seed=42, bias=2) as CatalogMesh) painting done

We can express the total overdensity field as a weighted sum of the individual fields:

\[\delta_\mathrm{tot} = \frac{\alpha_\mathrm{dm}\delta_\mathrm{dm} + \alpha_\mathrm{hydro}\delta_\mathrm{hydro}}{\alpha_\mathrm{dm} + \alpha_\mathrm{hydro}},\]

where \(\alpha_i = \langle w_i \rangle \bar{n}_i\). Using this equation, we can compute the power spectrum of the combined mesh from the individual power spectra as

\[P_\mathrm{tot}(k) = \left(\frac{\alpha_\mathrm{dm}}{\alpha_\mathrm{dm} + \alpha_\mathrm{hydro}}\right)^2 P_\mathrm{dm} + \left(\frac{\alpha_\mathrm{hydro}}{\alpha_\mathrm{dm} + \alpha_\mathrm{hydro}}\right)^2 P_\mathrm{hydro}.\]

Note that we have assumed the fields to be un-correlated, which is true in this case, such that there is no cross power spectrum between the fields.

[30]:
# the individual power spectra
P_dm = r_dm.power['power'].real - r_dm.attrs['shotnoise']
P_hydro = r_hydro.power['power'].real - r_hydro.attrs['shotnoise']

# the predicted power spectrum
norm = m_hydro.sum() + m_dm.sum()
Ptot_predicted = (m_hydro.sum()/norm)**2 * P_hydro + (m_dm.sum()/norm)**2 * P_dm

Now, we compare the individual power spectra, the measured combined power spectrum, and our predicted power spectrum

[31]:
# the measured combined power
Ptot = r_combined.power['power'].real
plt.loglog(r_combined.power['k'], Ptot - r_combined.attrs['shotnoise'], label='DM + hydro (measured)')

# predicted combined power
plt.loglog(r_combined.power['k'], Ptot_predicted, label='DM + hydro (predicted)')

# plot the individual spectra
plt.loglog(r_dm.power['k'], P_dm , label='DM only')
plt.loglog(r_hydro.power['k'], P_hydro, label='hydro only')

# format the axes
plt.legend(loc='lower left', ncol=2)
plt.xlabel(r"$k$ [$h \ \mathrm{Mpc}^{-1}$]")
plt.ylabel(r"$P(k)$ [$h^{-3} \mathrm{Mpc}^3$]")
plt.xlim(0.01, 0.6)
[31]:
(0.01, 0.6)
_images/cookbook_painting_62_1.png

We see that the measured power spectrum for the combined DM + hydro mesh (blue) agrees well with the predicted value computed from the individual power spectra (orange). As expected, the hydro and dark matter only power spectra have similar shapes, but the hydro only power is more strongly biased (\(b_1=2\) vs \(b_1=1\)).

The Projected Power Spectrum of Data in a Simulation Box

In this notebook, we explore the functionality of the ProjectedFFTPower algorithm, which can compute the power spectrum of density fields projected in 1D and 2D, commonly referred to as \(P_\mathrm{1D}(k)\) and \(P_\mathrm{2D}(k)\). The algorithm is suitable for use on data sets in periodic simulation boxes, as the power spectrum is computed via a single FFT of the projected density mesh.

[1]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
[2]:
from nbodykit.lab import *
from nbodykit import setup_logging, style

import matplotlib.pyplot as plt
plt.style.use(style.notebook)
[3]:
setup_logging()
Initalizing a Log-normal Mock

We start by generating a mock catalog of biased objects (\(b_1 = 2\) ) at a redshif \(z=0.55\). We use the Planck 2015 cosmology and the Eisenstein-Hu linear power spectrum fitting formula. We generate the catalog in a box of side length \(L = 1380 \ \mathrm{Mpc}/h\) with a constant number density \(\bar{n} = 3 \times 10^{-3} \ h^{3} \mathrm{Mpc}^{-3}\).

[4]:
redshift = 0.55
cosmo = cosmology.Planck15
Plin = cosmology.LinearPower(cosmo, redshift, transfer='EisensteinHu')
BoxSize = 1380.
b1 = 2.0

cat = LogNormalCatalog(Plin=Plin, nbar=3e-3, BoxSize=BoxSize, Nmesh=256, bias=b1, seed=42)

We update the Position column to add redshift-space distortions along the z axis of the box using the VelocityOffset column.

[5]:
# add RSD
line_of_sight = [0,0,1]
cat['Position'] += cat['VelocityOffset'] * line_of_sight
Computing the Projected Power

In this section, we compute and plot the power spectra of the projected 1D and 2D density fields, and compare to the full 3D power spectrum.

We must first convert our CatalogSource object to a MeshSource, by setting up the mesh and specifying which interpolation kernel we wish to use. Here, we use “TSC” interpolation, and specify via compensated=True that we wish to correct for the effects of the interpolation window in Fourier space.

[6]:
# convert to a MeshSource, using TSC interpolation on 256^3 mesh
mesh = cat.to_mesh(window='tsc', Nmesh=256, compensated=True)

We compute the 2D projected power by specifying that we wish to project the density field on to the (x,y) plane by passing axes=[0,1] to ProjectedFFTPower.

[7]:
# the projected 2D power
r_2d = ProjectedFFTPower(mesh, dk=0.005, kmin=0.01, axes=[0,1])
[ 000018.50 ]   0: 10-30 20:51  CatalogMesh     INFO     painted 7887677 out of 7887677 objects to mesh
[ 000018.50 ]   0: 10-30 20:51  CatalogMesh     INFO     mean particles per cell is 0.470142
[ 000018.50 ]   0: 10-30 20:51  CatalogMesh     INFO     sum is 7.88768e+06
[ 000018.51 ]   0: 10-30 20:51  CatalogMesh     INFO     normalized the convention to 1 + delta
[ 000018.89 ]   0: 10-30 20:51  CatalogMesh     INFO     field: (LogNormalCatalog(seed=42, bias=2) as CatalogMesh) painting done

We compute the 1D projected power by specifying that we wish to project the density field on to the x plane by passing axes=[0] to ProjectedFFTPower.

[8]:
# the projected 1D power
r_1d = ProjectedFFTPower(mesh, dk=0.005, kmin=0.01, axes=[0])
[ 000025.35 ]   0: 10-30 20:52  CatalogMesh     INFO     painted 7887677 out of 7887677 objects to mesh
[ 000025.35 ]   0: 10-30 20:52  CatalogMesh     INFO     mean particles per cell is 0.470142
[ 000025.35 ]   0: 10-30 20:52  CatalogMesh     INFO     sum is 7.88768e+06
[ 000025.35 ]   0: 10-30 20:52  CatalogMesh     INFO     normalized the convention to 1 + delta
[ 000025.71 ]   0: 10-30 20:52  CatalogMesh     INFO     field: (LogNormalCatalog(seed=42, bias=2) as CatalogMesh) painting done

We also use the FFTPower algorithm to compute the power spectrum of the full 3D density field for comparison purposes.

[9]:
# the 3D power P(k)
r_3d = FFTPower(mesh, mode='1d', dk=0.005, kmin=0.01)
[ 000031.43 ]   0: 10-30 20:52  CatalogMesh     INFO     painted 7887677 out of 7887677 objects to mesh
[ 000031.43 ]   0: 10-30 20:52  CatalogMesh     INFO     mean particles per cell is 0.470142
[ 000031.43 ]   0: 10-30 20:52  CatalogMesh     INFO     sum is 7.88768e+06
[ 000031.43 ]   0: 10-30 20:52  CatalogMesh     INFO     normalized the convention to 1 + delta
[ 000031.78 ]   0: 10-30 20:52  CatalogMesh     INFO     field: (LogNormalCatalog(seed=42, bias=2) as CatalogMesh) painting done

The power spectrum results are stored as a BinnedStatistic object as the power attribute.

[10]:
# the result is stored at "power" attribute
P1D = r_1d.power
P2D = r_2d.power
P3D = r_3d.power

Here, we plot each of the measured power spectra, scaled appropriately by the box size \(L_\mathrm{box}\) in order to have the same units for each power spectra.

[11]:
# plot the 1D, 2D, and 3D power spectra
plt.loglog(P1D['k'], P1D['power'].real * BoxSize**2, label=r'$L_\mathrm{box}^2 P_\mathrm{1D}$')
plt.loglog(P2D['k'], P2D['power'].real * BoxSize, label=r"$L_\mathrm{box} P_\mathrm{2D}$")
plt.loglog(P3D['k'], P3D['power'].real, label=r"$P_\mathrm{3D}$")

# format the axes
plt.legend(loc=0)
plt.xlabel(r"$k$ [$h \ \mathrm{Mpc}^{-1}$]")
plt.ylabel(r"$P(k)$ [$h^{-3}\mathrm{Mpc}^3$]")
plt.xlim(0.01, 0.6)
[11]:
(0.01, 0.6)
_images/cookbook_projected-fftpower_20_1.png

We can see that \(P_\mathrm{1D}\) and \(P_\mathrm{2D}\) are considerably noisier than the 3D power spectrum, which results from the fact that the number of Fourier modes to average over in each bin is drastically reduced when averaging the fields over certain axes.


The (static) recipes below are provided as Jupyter notebooks and are available for download by clicking the “Source” link in the navigation bar at the top of the page.

Data Recipes

Painting Recipes

Algorithm Recipes

Contributing

If you have an application of nbodykit that is concise and interesting, please consider adding it to our cookbook. We also welcome feedback and improvements for these recipes. Users can submit issues or open a pull request on the nbodykit cookbook repo on GitHub.

Cookbook recipes should be in the form of Jupyter notebooks. See the existing recipes for examples. The recipes are designed to illustrate interesting uses of nbodykit for other users to learn from.

We appreciate any and all contributions!

Dealing with Discrete Data

The main interface for dealing with data in the form of catalogs of discrete objects is provided by subclasses of the nbodykit.base.catalog.CatalogSource object. In this section, we provide an overview of this class and note important things to know.

What is a CatalogSource?

Most often the user starts with a catalog of discrete objects, with a set of fields describing each object, such as the position coordinates, velocity, mass, etc. Given this input data, the user wishes to use nbodykit to perform a task, i.e., computing the power spectrum or grouping together objects with a friends-of-friends algorithm. To achieve these goals, nbodykit provides the nbodykit.base.catalog.CatalogSource base class.

The CatalogSource object behaves much like a numpy structured array, where the fields of the array are referred to as “columns”. These columns store the information about the objects in the catalog; common columns are “Position”, “Velocity”, “Mass”, etc. A list of the column names that are valid for a given catalog can be accessed via the CatalogSource.columns attribute.

Use Cases

The CatalogSource is an abstract base class – it cannot be directly initialized. Instead, nbodykit includes several specialized catalog subclasses of CatalogSource in the nbodykit.source.catalog module. In general, these subclasses fall into two categories:

  1. Reading data from disk (see Reading Catalogs from Disk)

  2. Generating mock data at run time (see Generating Catalogs of Mock Data)

Requirements

A well-defined size

The only requirement to initialize a CatalogSource is that the object has a well-defined size. Information about the length of a CatalogSource is stored in two attributes:

  • CatalogSource.size : the local size of the catalog, equal to the number of objects in the catalog on the local rank

  • CatalogSource.csize : the collective, global size of the catalog, equal to the sum of size across all MPI ranks

So, the user can think of a CatalogSource object as storing information for a total of csize objects, which is divided amongst the available MPI ranks such that each process only stores information about size objects.

The Position column

All CatalogSource objects must include the Position column, which should be a (N,3) array giving the Cartesian position of each of the N objects in the catalog.

Often, the user will have the Cartesian coordinates stored as separate columns or have the object coordinates in terms of right ascension, declination, and redshift. See Common Data Operations for more details about how to construct the Position column for these cases.

Default Columns

All CatalogSource objects include several default columns. These columns are used broadly throughout nbodykit and can be summarized as follows:

Name

Description

Default Value

Weight

The weight to use for each particle when interpolating a CatalogSource on to a mesh. The mesh field is a weighted average of Value, with the weights given by Weight.

1.0

Value

When interpolating a CatalogSource on to a mesh, the value of this array is used as the field value that each particle contributes to a given mesh cell. The mesh field is a weighted average of Value, with the weights given by Weight. For example, the Value column could represent Velocity, in which case the field painted to the mesh will be momentum (mass-weighted velocity).

1.0

Selection

A boolean column that selects a subset slice of the CatalogSource. When converting a CatalogSource to a mesh object, only the objects where the Selection column is True will be painted to the mesh.

True

Storing Meta-data

For all CatalogSource objects, the input parameters and additional meta-data are stored in the attrs dictionary attribute.

API

For more information about specific catalog objects, please see the API section.

Reading Catalogs from Disk

Supported Data Formats

nbodykit provides support for initializing CatalogSource objects by reading tabular data stored on disk in a variety of formats:

In this section, we provide short examples illustrating how to read data stored in each of these formats. If your data format is not currently supported, you can either read it in yourself, then adapt it as a Catalog Adapting in memory data to a Catalog, or write your own Catalog class Reading a Custom Data Format.

Plaintext Data

Reading data stored as columns in plaintext files is supported via the CSVCatalog class. This class partitions the CSV file into chunks, and data is only read from the relevant chunks of the file, using the pandas.read_csv() function. The class accepts any configuration keywords that this function does. The partitioning step provides a significant speed-up when reading from the end of the file, since the entirety of the data does not need to be read first.

Caveats

  • By default, the class reads space-separated columns, but this can be changed by setting delim_whitespace=False and changing the delimiter keyword

  • A pandas index column is not supported – all columns should represent data columns to read.

  • Commented lines in the file are not supported – please remove all comments from the file before loading into nbodykit.

  • There should not be a header line in the file – column names should be passed to CSVCatalog via the names argument.

As an example, below we generate 5 columns for 100 fake objects and write to a plaintext file:

[2]:
import numpy
from nbodykit.source.catalog import CSVCatalog

# generate some fake ASCII data
data = numpy.random.random(size=(100,5))

# save to a plaintext file
numpy.savetxt('csv-example.txt', data, fmt='%.7e')

# name each of the 5 input columns
names =['x', 'y', 'z', 'w', 'v']

# read the data
f = CSVCatalog('csv-example.txt', names)

# combine x, y, z to Position, and add boxsize
f['Position'] = f['x'][:, None] * [1, 0, 0] + f['y'][:, None] * [0, 1, 0] + f['z'][:, None] * [0, 0, 1]
f.attrs['BoxSize'] = 1.0

print(f)
print("columns = ", f.columns) # default Weight,Selection also present
print("total size = ", f.csize)
CSVCatalog(size=100, FileStack(CSVFile(path=/tmp/tmpedljiijj/csv-example.txt, dataset=*, ncolumns=5, shape=(100,)>, ... 1 files))
columns =  ['Position', 'Selection', 'Value', 'Weight', 'v', 'w', 'x', 'y', 'z']
total size =  100
Binary Data

The BinaryCatalog object reads binary data that is stored on disk in a column-major format. The class can read any numpy data type and can handle arbitrary byte offsets between columns.

Caveats

  • Columns must be stored in consecutive order in the binary file (column-major format).

For example, below we save Position and Velocity columns to a binary file and load them into a BinaryCatalog:

[3]:
from nbodykit.source.catalog import BinaryCatalog

# generate some fake data and save to a binary file
with open('binary-example.dat', 'wb') as ff:
    pos = numpy.random.random(size=(1024, 3)) # fake Position column
    vel = numpy.random.random(size=(1024, 3)) # fake Velocity column
    pos.tofile(ff); vel.tofile(ff); ff.seek(0)

# create the binary catalog
f = BinaryCatalog(ff.name, [('Position', ('f8', 3)), ('Velocity', ('f8', 3))], size=1024)

print(f)
print("columns = ", f.columns) # default Weight,Selection also present
print("total size = ", f.csize)
BinaryCatalog(size=1024, FileStack(BinaryFile(path=/tmp/tmpedljiijj/binary-example.dat, dataset=*, ncolumns=2, shape=(1024,)>, ... 1 files))
columns =  ['Position', 'Selection', 'Value', 'Velocity', 'Weight']
total size =  1024
HDF Data

The HDFCatalog object uses the h5py module to read HDF5 files. The class supports reading columns stored in h5py.Dataset objects and in h5py.Group objects, assuming that all arrays are of the same length since catalog objects must have a fixed size. Columns stored in different datasets or groups can be accessed via their full path in the HDF5 file.

Caveats

  • HDFCatalog attempts to load all possible datasets or groups from the HDF5 file. This can present problems if the data has different lengths. Use the exclude keyword to explicitly exclude data that has the wrong size.

In the example below, we load fake data from both the dataset “Data1” and from the group “Data2” in an example HDF5 file. “Data1” is a single structured numpy array with Position and Velocity columns, while “Data2” is a group storing the Position and Velocity columns separately. nbodykit is able to load both types of data from HDF5 files, and the corresponding column names are the full paths of the data in the file.

[4]:
import h5py
from nbodykit.source.catalog import HDFCatalog

# generate some fake data
dset = numpy.empty(1024, dtype=[('Position', ('f8', 3)), ('Mass', 'f8')])
dset['Position'] = numpy.random.random(size=(1024, 3))
dset['Mass'] = numpy.random.random(size=1024)

# write to a HDF5 file
with h5py.File('hdf-example.hdf5' , 'w') as ff:
    ff.create_dataset('Data1', data=dset)
    grp = ff.create_group('Data2')
    grp.create_dataset('Position', data=dset['Position']) # column as dataset
    grp.create_dataset('Mass', data=dset['Mass']) # column as dataset

# intitialize the catalog
f = HDFCatalog('hdf-example.hdf5')

print(f)
print("columns = ", f.columns) # default Weight,Selection also present
print("total size = ", f.csize)
HDFCatalog(size=1024, FileStack(HDFFile(path=/tmp/tmpedljiijj/hdf-example.hdf5, dataset=/, ncolumns=4, shape=(1024,)>, ... 1 files))
columns =  ['Data1/Mass', 'Data1/Position', 'Data2/Mass', 'Data2/Position', 'Selection', 'Value', 'Weight']
total size =  1024
Bigfile Data

The bigfile package is a massively parallel IO library for large, hierarchical datasets, and nbodykit supports reading data stored in this format using BigFileCatalog.

Caveats

  • Similiar to the HDFCatalog class, datasets of the wrong size stored in a bigfile format should be explicitly excluded using the exclude keyword.

Below, we load Position and Velocity columns, stored in the bigfile format:

[5]:
import bigfile
from nbodykit.source.catalog import BigFileCatalog

# generate some fake data
data = numpy.empty(512, dtype=[('Position', ('f8', 3)), ('Velocity', ('f8',3))])
data['Position'] = numpy.random.random(size=(512, 3))
data['Velocity'] = numpy.random.random(size=(512,3))

# save fake data to a BigFile
with bigfile.BigFile('bigfile-example', create=True) as tmpff:
    with tmpff.create("Position", dtype=('f4', 3), size=512) as bb:
        bb.write(0, data['Position'])
    with tmpff.create("Velocity", dtype=('f4', 3), size=512) as bb:
        bb.write(0, data['Velocity'])
    with tmpff.create("Header") as bb:
        bb.attrs['Size'] = 512.

# initialize the catalog
f = BigFileCatalog('bigfile-example', header='Header')

print(f)
print("columns = ", f.columns) # default Weight,Selection also present
print("total size = ", f.csize)
BigFileCatalog(size=512, FileStack(BigFile(path=/tmp/tmpedljiijj/bigfile-example, dataset=./, ncolumns=2, shape=(512,)>, ... 1 files))
columns =  ['Position', 'Selection', 'Value', 'Velocity', 'Weight']
total size =  512
FITS Data

The FITS data format is supported via the FITSCatalog object. nbodykit relies on the fitsio package to perform the read operation.

Caveats

  • The FITS file must contain a readable binary table of data.

  • Specific extensions to read can be passed via the ext keyword. By default, data is read from the first HDU that has readable data.

For example, below we load Position and Velocity data from a FITS file:

[6]:
import fitsio
from nbodykit.source.catalog import FITSCatalog

# generate some fake data
dset = numpy.empty(1024, dtype=[('Position', ('f8', 3)), ('Mass', 'f8')])
dset['Position'] = numpy.random.random(size=(1024, 3))
dset['Mass'] = numpy.random.random(size=1024)

# write to a FITS file using fitsio
fitsio.write('fits-example.fits', dset, extname='Data')

# initialize the catalog
f = FITSCatalog('fits-example.fits', ext='Data')

print(f)
print("columns = ", f.columns) # default Weight,Selection also present
print("total size = ", f.csize)
FITSCatalog(size=1024, FileStack(FITSFile(path=/tmp/tmpedljiijj/fits-example.fits, dataset=Data, ncolumns=2, shape=(1024,)>, ... 1 files))
columns =  ['Mass', 'Position', 'Selection', 'Value', 'Weight']
total size =  1024

Reading Multiple Data Files at Once

CatalogSource objects support reading multiple files at once, providing a continuous view of each individual catalog stacked together. Each file read must contain the same data types, otherwise the data cannot be combined into a single catalog.

This becomes particularly useful when the user has data split into multiple files in a single directory, as is often the case when processing large amounts of data. For example, output binary snapshots from N-body simulations, often totaling 10GB - 100GB in size, can be read into a single BinaryCatalog with nbodykit.

When specifying multiple files to load, the user can use either an explicit list of file names or use an asterisk glob pattern to match files. As an example, below, we read data from two plaintext files into a single CSVCatalog:

[7]:
# generate data
data = numpy.random.random(size=(100,5))

# save first 40 rows of data to file
numpy.savetxt('csv-example-1.txt', data[:40], fmt='%.7e')

# save the remaining 60 rows to another file
numpy.savetxt('csv-example-2.txt', data[40:], fmt='%.7e')
Using a glob pattern
[8]:
# the names of the columns in both files
names =['a', 'b', 'c', 'd', 'e']

# read with a glob pattern
f = CSVCatalog('csv-example-*', names)

print(f)

# combined catalog size is 40+60=100
print("total size = ", f.csize)

CSVCatalog(size=100, FileStack(CSVFile(path=/tmp/tmpedljiijj/csv-example-1.txt, dataset=*, ncolumns=5, shape=(40,)>, ... 2 files))
total size =  100
Using a list of file names
[9]:
# the names of the columns in both files
names =['a', 'b', 'c', 'd', 'e']

# read with a list of the file names
f = CSVCatalog(['csv-example-1.txt', 'csv-example-2.txt'], names)

print(f)

# combined catalog size is 40+60=100
print("total size = ", f.csize)
CSVCatalog(size=100, FileStack(CSVFile(path=csv-example-1.txt, dataset=*, ncolumns=5, shape=(40,)>, ... 2 files))
total size =  100
Adapting in memory data to a Catalog

A light-weight way of reading in data that nbodykit does not understand is to read in the data with existing tools, then adapt the data into a catalog via nbodykit.lab.ArrayCatalog.

Here is an exampling that reads in a file with numpy’s load, then make it into a catalog.

[10]:
from nbodykit.source.catalog import ArrayCatalog

# generate the fake data
data = numpy.empty(1024, dtype=[('Position', ('f8', 3)), ('Mass', 'f8')])
data['Position'] = numpy.random.random(size=(1024, 3))
data['Mass'] = numpy.random.random(size=1024)

# save to a npy file
numpy.save("npy-example.npy", data)

data = numpy.load("npy-example.npy")

# initialize the catalog
f = ArrayCatalog(data)

print(f)
print("columns = ", f.columns) # default Weight,Selection also present
print("total size = ", f.csize)


f = ArrayCatalog({'Position' : data['Position'], 'Mass' : data['Mass'] })

print(f)
print("columns = ", f.columns) # default Weight,Selection also present
print("total size = ", f.csize)
ArrayCatalog(size=1024)
columns =  ['Mass', 'Position', 'Selection', 'Value', 'Weight']
total size =  1024
ArrayCatalog(size=1024)
columns =  ['Mass', 'Position', 'Selection', 'Value', 'Weight']
total size =  1024

Reading a Custom Data Format

Users can implement their own subclasses of CatalogSource for reading custom data formats with a few easy steps. The core functionality of the CatalogSource classes described in this section use the nbodykit.io module for reading data from disk. This module implements the nbodykit.io.base.FileType base class, which is an abstract class that behaves like a file-like object. For the built-in file formats discussed in this section, we have implemented the following subclasses of FileType in the nbodykit.io module: CSVFile, BinaryFile, BigFile, HDFFile, and FITSFile.

To make a valid subclass of FileType, users must:

  1. Implement the read() function that reads a range of the data from disk.

  2. Set the size in the __init__() function, specifying the total size of the data on disk.

  3. Set the dtype in the __init__() function, specifying the type of data stored on disk.

Once we have the custom subclass implemented, the nbodykit.source.catalog.file.FileCatalogFactory() function can be used to automatically create a custom CatalogSource object from the subclass.

As a toy example, we will illustrate how this is done for data saved using the numpy .npy format. First, we will implement our subclass of the FileType class:

[11]:
from nbodykit.io.base import FileType

class NPYFile(FileType):
    """
    A file-like object to read numpy ``.npy`` files
    """
    def __init__(self, path):
        self.path = path
        self.attrs = {}
        # load the data and set size and dtype
        self._data = numpy.load(self.path)
        self.size = len(self._data) # total size
        self.dtype = self._data.dtype # data dtype

    def read(self, columns, start, stop, step=1):
        """
        Read the specified column(s) over the given range
        """
        return self._data[start:stop:step]

And now generate the subclass of CatalogSource:

[12]:
from nbodykit.source.catalog.file import FileCatalogFactory

NPYCatalog = FileCatalogFactory('NPYCatalog', NPYFile)

And finally, we generate some fake data, save it to a .npy file, and then load it with our new NPYCatalog class:

[13]:
# generate the fake data
data = numpy.empty(1024, dtype=[('Position', ('f8', 3)), ('Mass', 'f8')])
data['Position'] = numpy.random.random(size=(1024, 3))
data['Mass'] = numpy.random.random(size=1024)

# save to a npy file
numpy.save("npy-example.npy", data)

# and now load the data
f = NPYCatalog("npy-example.npy")

print(f)
print("columns = ", f.columns) # default Weight,Selection also present
print("total size = ", f.csize)
NPYCatalog(size=1024, FileStack(NPYFile(path=/tmp/tmpedljiijj/npy-example.npy, dataset=None, ncolumns=2, shape=(1024,)>, ... 1 files))
columns =  ['Mass', 'Position', 'Selection', 'Value', 'Weight']
total size =  1024

This toy example illustrates how custom data formats can be incorporated into nbodykit, but users should take care to optimize their storage solutions for more complex applications. In particular, data storage formats that are stored in column-major format and allow data slices from arbitrary locations to be read should be favored. This enables large speed-ups when reading data in parallel. On the contrary, our simple toy example class NPYFile reads the entirety of the data before returning a certain slice in the read() function. In general, this should be avoided if at all possible.

On Demand IO via dask.array

nbodykit uses the dask package to store the columns in CatalogSource objects. The dask package implements a dask.array.Array object that mimics that interface of the more familiar numpy array. In this section, we describe what exactly a dask array is and how it is used in nbodykit.

What is a dask array?

In nbodykit, the dask array object is a data container that behaves nearly identical to a numpy array, except for one key difference. When performing manipulations on a numpy array, the operations are performed immediately. This is not the case for dask arrays. Instead, dask arrays store these operations in a task graph and only evaluate the operations when the user specifies to via a call to a compute() function. When using nbodykit, often the first task in this graph is loading data from disk. Thus, dask provides nbodykit with on-demand IO functionality, allowing the user to control when data is read from disk.

It is useful to describe a bit more about the nuts and bolts of the dask array to illustrate its full power. The dask array object cuts up the full array into many smaller arrays and performs calculations on these smaller “chunks”. This allows array computations to be performed on large data that does not fit into memory (but can be stored on disk). Particularly useful on laptops and other systems with limited memory, it extends the maximum size of useable datasets from the size of memory to the size of the disk storage. For further details, please see the introduction to the dask array in the dask documentation.

By Example

The dask array functionality is best illustrated by example. Here, we initialize a UniformCatalog that generates objects with uniformly distributed position and velocity columns.

[1]:
from nbodykit.lab import UniformCatalog

cat = UniformCatalog(nbar=100, BoxSize=1.0, seed=42)
/home/yfeng1/anaconda3/install/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
[2]:
print("catalog = ", cat)
catalog =  UniformCatalog(size=96, seed=42)
[3]:
print("Position = ", cat['Position'])
Position =  dask.array<array, shape=(96, 3), dtype=float64, chunksize=(96, 3)> first: [0.45470105 0.83263203 0.06905134] last: [0.62474599 0.15388738 0.84302209]

We see that the Position column can be accessed by indexing the catalog with the column name and that the returned object is not a numpy array but a dask array. The dask array has the same shape (96,3) and dtype (‘f8’) as the underlying numpy array but also includes the chunksize attribute. This attribute specifies the size of the internal chunks that dask uses to examine arrays in smaller pieces. In this case, the data size is small enough that only a single chunk is needed.

The dask.array module

The dask.array module provides much of the same functionality as the numpy module, but with functions optimized to perform operations on dask arrays.

For example, we can easily compute the minimum and maximum position coordinates using the dask.array.min() and dask.array.max() functions.

[4]:
import dask.array as da

pos = cat['Position']
minpos = da.min(pos, axis=0)
maxpos = da.max(pos, axis=0)

print("minimum position coordinates = ", minpos)
print("maximum position coordinates = ", maxpos)
minimum position coordinates =  dask.array<amin-aggregate, shape=(3,), dtype=float64, chunksize=(3,)>
maximum position coordinates =  dask.array<amax-aggregate, shape=(3,), dtype=float64, chunksize=(3,)>

Here, we see that the result of our calls to dask.array.min() and dask.array.max() are also stored as dask arrays. The task has not yet been performed but instead added to the internal dask task graph.

For a full list of the available functionality, please see the dask array documentation. A large subset of the most commonly used functions in numpy have implementations in the dask.array module. In addition to these functions, dask arrays support the usual array arithmetic operations. For example, to rescale the position coordinate array, use

[5]:
BoxSize = 2500.0
pos *= BoxSize

rescaled_minpos = da.min(pos, axis=0)
rescaled_maxpos = da.max(pos, axis=0)
Evaluating a dask array

The CatalogSource.compute() function computes a dask array and returns the result of the internal series of tasks, either a numpy array or float. For example, we can compute the minimum and maximum of the position coordinates using:

[6]:
minpos, maxpos = cat.compute(minpos, maxpos)
print("minimum position coordinates = ", minpos)
print("maximum position coordinates = ", maxpos)
minimum position coordinates =  [0.00402579 0.00015685 0.00271747]
maximum position coordinates =  [0.9927406  0.99610592 0.99925086]

And similarly, we see the result of the rescaling operation earlier:

[7]:
minpos, maxpos = cat.compute(rescaled_minpos, rescaled_maxpos)
print("minimum re-scaled position coordinates = ", minpos)
print("maximum re-scaled position coordinates = ", maxpos)
minimum re-scaled position coordinates =  [10.06446279  0.39212416  6.79367111]
maximum re-scaled position coordinates =  [2481.85149744 2490.26480949 2498.12715085]

Caching with Dask

nbodykit automatically caches tasks computed when evaluating dask arrays. The global cache is controlled via the nbodykit.GlobalCache class. Often the most expensive task when evaluating a dask array is loading the data from disk. By using dask’s caching features, CatalogSource objects are able to cache intermediate results, such that repeated calls to CatalogSource.compute() do not repeat expensive IO operations.

The global cache has a fixed size. By default, we set the value to a reasonable (not too large) value. The default value is controlled by the global_cache_size keyword, which can be controlled via the set_options function. Users can control the size of the global cache using:

from nbodykit import GlobalCache
GlobalCache.resize(2e9) # set cache size to 2 GB

Note

When accessing columns of a CatalogSource, the returned dask array also has a compute() function. When using this function to evaluate dask arrays, internal caching will also be used. So users have the option of using CatalogSource.compute() or the compute() attached to each dask array.

Examining Larger-than-Memory Data

CatalogSource objects automatically take advantage of the chunking features of the dask array, greatly reducing the difficulties of analyzing larger-than-memory data. When combined with the ability of the CatalogSource object to provide a continuous view of multiple files at once, we can analyze large amounts of data from a single catalog with ease.

A common use case is examining a directory of large binary outputs from a N-body simulation on a laptop. Often the user wishes to select a smaller subsample of the catalog or perform some trivial data inspection to verify the accuracy of the data. These tasks become straightforward with nbodykit, using the functionality provided by the CatalogSource object and the dask package. Regardless of the size of the data that the user is loading, the nbodykit CatalogSource interface remains the same.

Common Data Operations

Here, we detail some of the most common operations when dealing with data in the form of a CatalogSource. The native format for data columns in a CatalogSource object is the dask array. Be sure to read the previous section for an introduction to dask arrays before proceeding.

The dask array format allows users to easily manipulate columns in their input data and feed any transformed data into one of the nbodykit algorithms. This provides a fast and easy way to transform the data while hiding the implementation details needed to compute these transformations internally. In this section, we’ll provide examples of some of these data transformations to get users acclimated to dask arrays quickly.

To help illustrate these operations, we’ll initialize the nbodykit “lab” and load a catalog of uniformly distributed objects.

[2]:
from nbodykit.lab import *
cat = UniformCatalog(nbar=100, BoxSize=1.0, seed=42)
/home/yfeng1/anaconda3/install/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters

Accessing Data Columns

Specific columns can be accessed by indexing the catalog object using the column name, and a dask.array.Array object is returned (see What is a dask array? for more details on dask arrays).

[3]:
position = cat['Position']
velocity = cat['Velocity']

print(position)
print(velocity)
dask.array<array, shape=(96, 3), dtype=float64, chunksize=(96, 3)> first: [0.45470105 0.83263203 0.06905134] last: [0.62474599 0.15388738 0.84302209]
dask.array<array, shape=(96, 3), dtype=float64, chunksize=(96, 3)> first: [0.0006346  0.00675438 0.00704942] last: [0.00375581 0.00046149 0.00819726]

While in the format of the dask array, data columns can easily be manipulated by the user. For example, here we normalize the position coordinates to the range 0 to 1 by dividing by the box size:

[4]:
# normalize the position
normed_position = position / cat.attrs['BoxSize']

print(normed_position)
dask.array<truediv, shape=(96, 3), dtype=float64, chunksize=(96, 3)>

Note that the normalized position array is also a dask array and that the actual normalization operation is yet to occur. This makes these kinds of data transformations very fast for the user.

Computing Data Columns

Columns can be converted from dask.array.Array objects to numpy arrays using the compute() function (see Evaluating a dask array for further details on computing dask arrays).

[5]:
position, velocity = cat.compute(cat['Position'], cat['Velocity'])

print(type(position))
print(type(velocity))
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>

We can also compute the max of the normalized position coordinates from the previous section:

[6]:
maxpos = normed_position.max(axis=0)
print(maxpos)

print(cat.compute(maxpos))
dask.array<amax-aggregate, shape=(3,), dtype=float64, chunksize=(3,)>
[0.9927406  0.99610592 0.99925086]

Adding New Columns

New columns can be easily added to a CatalogSource object by directly setting them:

[7]:
# no "Mass" column originally
print("contains 'Mass'? :", 'Mass' in cat)

# add a random array as the "Mass" column
cat['Mass'] = numpy.random.random(size=len(cat))

# "Mass" exists!
print("contains 'Mass'? :", 'Mass' in cat)

# can also add scalar values -- converted to correct length
cat['Type'] = b"central"

print(cat['Mass'])
print(cat['Type'])
contains 'Mass'? : False
contains 'Mass'? : True
dask.array<array, shape=(96,), dtype=float64, chunksize=(96,)> first: 0.3745401188473625 last: 0.49379559636439074
dask.array<array, shape=(96,), dtype=|S7, chunksize=(96,)> first: b'central' last: b'central'

Here, we have added two new columns to the catalog, Mass and Type. Internally, nbodykit stores the new columns as dask arrays and will automatically convert them to the correct type if they are not already.

Caveats

  • New columns must be either be a scalar value, or an array with the same length as the catalog. Scalar values will automatically be broadcast to the correct length.

  • Setting a column of the wrong length will raise an exception.

Overwriting Columns

The same syntax used for adding new columns can also be used to overwrite columns that already exist in a CatalogSource. This procedure works as one would expect – the most up-to-date values of columns are always used in operations.

In the example below we overwrite both the Position and Velocity columns, and each time the columns are accessed, the most up-to-date values are used, as expected.

[8]:
# some fake data
data = numpy.ones(5, dtype=[
      ('Position', ('f4', 3)),
      ('Velocity', ('f4', 3))]
      )

# initialize a catalog directly from the structured array
src = ArrayCatalog(data)

# overwrite the Velocity column
src['Velocity'] = src['Position'] + src['Velocity'] # 1 + 1 = 2

# overwrite the Position column
src['Position'] = src['Position'] + src['Velocity'] # 1 + 2 = 3

print("Velocity = ", src.compute(src['Velocity'])) # all equal to 2
print("Position = ", src.compute(src['Position'])) # all equal to 3
Velocity =  [[2. 2. 2.]
 [2. 2. 2.]
 [2. 2. 2.]
 [2. 2. 2.]
 [2. 2. 2.]]
Position =  [[3. 3. 3.]
 [3. 3. 3.]
 [3. 3. 3.]
 [3. 3. 3.]
 [3. 3. 3.]]

Adding Redshift-space Distortions

A useful operation in large-scale structure is the mapping of positions in simulations from real space to redshift space, referred to as redshift space distortions (RSD). This operation can be easily performed by combining the Position and Velocity columns to overwrite the Position column. As first found by Kaiser 1987, the mapping from real to redshift space is:

\[s = r + \frac{\vv \cdot \vnhat}{a H},\]

where \(r\) is the line-of-sight position in real space, \(s\) is the line-of-sight position in redshift space, \(\vv\) is the velocity vector, \(\vnhat\) is the line-of-sight unit vector, \(a\) is the scale factor, and \(H\) is the Hubble parameter at \(a\).

As an example, below we add RSD along the z axis of a simulation box:

[9]:
# apply RSD along the z axis
line_of_sight = [0,0,1]

# redshift and cosmology
redshift =  0.55; cosmo = cosmology.Cosmology(h=0.7).match(Omega0_m=0.31)

# the RSD normalization factor
rsd_factor = (1+redshift) / (100 * cosmo.efunc(redshift))

# update Position, applying RSD
src['Position'] = src['Position'] + rsd_factor * src['Velocity'] * line_of_sight

The RSD factor is known as the conformal Hubble parameter \(\mathcal{H} = a H(a)\). This calculation requires a cosmology, which can be specified via the Cosmology class. We use the efunc() function which returns \(E(z)\), where the Hubble parameter is defined as \(H(z) = 100h\ E(z)\) in units of km/s/Mpc. Note that the operation above assumes the Position column is in units of \(\mathrm{Mpc}/h\).

For catalogs in nbodykit that generate mock data, such as the log-normal catalogs or HOD catalogs, there is an additional column, VelocityOffset, available to facilitate RSD operations. This column has units of \(\mathrm{Mpc}/h\) and includes the rsd_factor above. Thus, this allows users to add RSD simply by using:

src['Position'] = src['Position'] + src['VelocityOffset'] * line_of_sight

Selecting a Subset

A subset of a CatalogSource object can be selected using slice notation. There are two ways to select a subset:

  1. use a boolean array, which specifies which rows of the catalog to select

  2. use a slice object specifying which rows of the catalog to select

For example,

[10]:
# boolean selection array
select = cat['Mass'] < 0.5
print("number of True entries = ", cat.compute(select.sum()))

# select only entries where select = True
subcat = cat[select]

print("size of subcat = ", subcat.size)

# select the first ten rows
subcat = cat[:10]
print("size of subcat = ", subcat.size)

# select first and last row
subcat = cat[[0, -1]]
print("size of subcat = ", subcat.size)
number of True entries =  50
size of subcat =  50
size of subcat =  10
size of subcat =  2

Caveats

  • When indexing with a boolean array, the array must have the same length as the size attribute, or an exception will be raised.

  • Selecting a single object by indexing with an integer is not supported. If the user wishes to select a single row, a list of length one can be used to select the specific row.

Selecting a Subset of Columns from a CatalogSource

A subset of columns can be selected from a CatalogSource object by indexing the catalog with a list of the names of the desired columns. For example,

[11]:
print("columns in catalog = ", cat.columns)

# select Position + Mass
subcat = cat[['Position', 'Mass']]

# the selected columns + default columns
print("columns in subset = ", subcat.columns)
columns in catalog =  ['Mass', 'Position', 'Selection', 'Type', 'Value', 'Velocity', 'Weight']
columns in subset =  ['Mass', 'Position', 'Selection', 'Value', 'Weight']

Caveats

  • When selecting a subset of columns, note that in addition to the desired columns, the sub-catalog will also contain the default columns (Weight, Value, and Selection).

The nbodykit.transform module

The nbodykit.transform module includes several commonly used functions for convenience. We describe a few of the most common use cases in the sub-sections below.

Note

The transform module is available to users when from nbodykit.lab import * is executed.

Concatenating CatalogSource Objects

When CatalogSource objects have the same columns, they can be concatenated together into a single object using the nbodykit.transform.ConcatenateSources() function. For example,

[12]:
cat1 = UniformCatalog(nbar=50, BoxSize=1.0, seed=42)
cat2 = UniformCatalog(nbar=150, BoxSize=1.0, seed=42)

combined = transform.ConcatenateSources(cat1, cat2)

print("total size = %d + %d = %d" %(cat1.size, cat2.size, combined.size))
total size = 47 + 145 = 192
Stacking Columns Together

Another common use case is when users need to combine separate data columns vertically, as the columns of a new array. For example, often the Cartesian position coordinates x, y, and z are stored as separate columns, and the Position column must be added to a catalog from these individual columns. We provide the nbodykit.transform.StackColumns() function for this exact purpose. For example,

[13]:
# fake position data
data = numpy.random.random(size=(5,3))

# save to a plaintext file
numpy.savetxt('csv-example.dat', data, fmt='%.7e')

# the cartesian coordinates
names =['x', 'y', 'z']

# read the data
f = CSVCatalog('csv-example.dat', names)

# make the "Position" column
f['Position'] =  transform.StackColumns(f['x'], f['y'], f['z'])

print(f['Position'])
print(f.compute(f['Position']))
dask.array<transpose, shape=(5, 3), dtype=float64, chunksize=(5, 1)> first: [0.52273283 0.42754102 0.02541913] last: [0.22879817 0.07697991 0.28975145]
[[0.52273283 0.42754102 0.02541913]
 [0.10789143 0.03142919 0.63641041]
 [0.31435598 0.50857069 0.90756647]
 [0.24929223 0.41038292 0.75555114]
 [0.22879817 0.07697991 0.28975145]]
Converting from Sky to Cartesian Coordinates

We provide the function nbodykit.transform.SkyToCartesian() for converting sky coordinates, in the form of right ascension, declination, and redshift, to Cartesian coordinates. The conversion from redshift to comoving distance requires a cosmology instance, which can be specified via the Cosmology class.

Below, we initialize a catalog holding random right ascension, declination, and redshift coordinates, and then add the Cartesian position as the Position column.

The random number generator provided by a RandomCatalog always generate the correct number of items for the catalog. Use the ? to see its docstring.

[14]:
src = RandomCatalog(100, seed=42)

# add random (ra, dec, z) coordinates
src['z'] = src.rng.normal(loc=0.5, scale=0.1)
src['ra'] = src.rng.uniform(low=0, high=360)
src['dec'] = src.rng.uniform(low=-180, high=180.)

# initialize a set of cosmology parameters
cosmo = cosmology.Cosmology(h=0.7)

# add the position
src['Position'] = transform.SkyToCartesian(src['ra'], src['dec'], src['z'], degrees=True, cosmo=cosmo)

Caveats

  • Whether the right ascension and declination arrays are in degrees (as opposed to radians) should be specified via the degrees keyword.

  • The units of the returned Position column are \(\mathrm{Mpc}/h\).

Using the dask.array module

For more general column transformations, users should take advantage of the dask.array module, which implements most functions in the numpy package in a manner optimized for dask arrays. The module can be accessed from the nbodykit.transform module as nbodykit.transform.da.

Important

For a full list of functions available in the dask.array module, please see the dask array documentation. We strongly recommend that new users read through this documentation and familiarize themselves with the functionality provided by the dask.array module.

As a simple illustration, below we convert an array holding right ascension values from degrees to radians, compute the sine of the array, and find the min and max values using functions available in the dask.array module.

[15]:
ra = transform.da.deg2rad(src['ra']) # from degrees to radians
sin_ra = transform.da.sin(ra) # compute the sine

print("min(sin(ra)) = ", src.compute(sin_ra.min()))
print("max(sin(ra)) = ", src.compute(sin_ra.max()))
min(sin(ra)) =  -0.999907640154132
max(sin(ra)) =  0.9988053754673173

Generating Catalogs of Mock Data

nbodykit includes several methods for generating mock catalogs, with varying levels of sophistication. These CatalogSource objects allow users to create catalogs of objects at run time and include:

Randomly Distributed Objects

nbodykit includes two subclasses of CatalogSource that generate particles randomly in a box: RandomCatalog and UniformCatalog. While these catalogs do not produce realistic cosmological distributions of objects, they are especially useful for generating catalogs quickly and for testing purposes.

RandomCatalog

The RandomCatalog class includes a random number generator with the functionality of numpy.random.RandomState that generates random numbers in parallel and in a manner that is independent of the number of MPI ranks being used. This property is especially useful for running reproducible tests where the number of CPUs might vary. The random number generator is stored as the rng attribute. Users can use this random number generator to add columns to the catalog, using the syntax to add columns.

For example,

[1]:
from nbodykit.lab import RandomCatalog
import numpy

# initialize a catalog with only the default columns
cat = RandomCatalog(csize=100) # collective size of 100
print("columns = ", cat.columns) # only the default columns present

# add mass uniformly distributed in log10
cat['Mass'] = 10**(cat.rng.uniform(12, 15))

# add normally distributed velocity
cat['Velocity'] = cat.rng.normal(loc=0, scale=500)

print(cat.columns)
columns =  ['Selection', 'Value', 'Weight']
['Mass', 'Selection', 'Value', 'Velocity', 'Weight']
/home/yfeng1/anaconda3/install/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters

Caveats

  • For a list of the full functionality of the rng attribute, please see the API documentation for MPIRandomState.

  • When adding columns, the new column must have the same length as the local size of the catalog, as specified by the size attribute. Most functions of the rng attribute accept the size keyword to generate an array of the correct size.

UniformCatalog

The UniformCatalog is a subclass of RandomCatalog that includes Position and Velocity columns that are uniformly distributed. The positions of the particles are uniformly distributed between zero and the size of the box (as specified by the user), with an input number density. The velocities are also uniformly distributed but on a scale that is 1% of the size of the box.

For example,

[2]:
from nbodykit.lab import UniformCatalog

cat = UniformCatalog(nbar=100, BoxSize=1.0, seed=42)
print("columns = ", cat.columns)

# min must be greater than 0
print("position minimum = ", cat.compute(cat['Position'].min()))

# max must be less than 1.0
print("position maximum = ", cat.compute(cat['Position'].max()))

# min must be greater than 0
print("velocity minimum = ", cat.compute(cat['Velocity'].min()))

# max must be less than 0.01
print("velocity maximum = ", cat.compute(cat['Velocity'].max()))
columns =  ['Position', 'Selection', 'Value', 'Velocity', 'Weight']
position minimum =  0.00015684966390538957
position maximum =  0.9992508603400948
velocity minimum =  2.4238347118806793e-05
velocity maximum =  0.00999869916263692

Note that because UniformCatalog is a subclass of RandomCatalog users can also use the rng attribute to add new columns to a UniformCatalog object.

Log-normal Mocks

The LogNormalCatalog offers a more realistic approximation of cosmological large-scale structure. The class generates a set of objects by Poisson sampling a log-normal density field, using the Zel’dovich approximation to model non-linear evolution. Given a linear power spectrum function, redshift, and linear bias supplied by the user, this class performs the following steps:

Generate Gaussian initial conditions

First, a Gaussian overdensity field \(\delta_L(\vk)\) is generated in Fourier space with a power spectrum given by a function specified by the user. We also generate linear velocity fields from the overdensity field in Fourier space, using

\[\vv(\vk) = i f a H \delta_L(\vk) \frac{\vk}{k^2}.\]

where \(f\) is the logarithmic growth rate, \(a\) is the scale factor, and \(H\) is the Hubble parameter at \(a\). Note that bold variables reflect vector quantities.

Finally, we Fourier transform \(\delta_L(\vk)\) and \(\vv(\vk)\) to configuration space. These fields serve as the “initial conditions”. They will be evolved forward in time and Poisson sampled to create the final catalog.

Perform the log-normal transformation

Next, we perform a log-normal transformation on the density field \(\delta\). As first discussed in Coles and Jones 1991, the distribution of galaxies on intermediate to large scales can be well-approximated by a log-normal distribution. An additional useful property of a log-normal field is that it follows the natural constraint \(\delta(\vx) \ge -1\) of density contrasts by definition. This property does not generally hold true for Gaussian realizations of the overdensity field.

The new, transformed density field is given by

\[\delta(\vx) = e^{-\sigma^2 + b_L \delta_L(\vx)} - 1,\]

where the normalization factor \(\sigma^2\) ensures that the mean of the \(\delta(\vx)\) vanishes and \(b_L\) is the Lagrangian bias factor, which is related to the final, linear bias as \(b_L = b_1 - 1\). Here, \(b_1\) is the value input by the user as the bias keyword to the LogNormalCatalog class.

Poisson sample the density field

We then generate discrete positions of objects by Poisson sampling the overdensity field in each cell of the mesh. We assign each object the velocity of the mesh cell that it is located in, and objects are placed randomly inside their cells. The desired number density of objects in the box is specified by the user as the nbar parameter to the LogNormalCatalog class.

Apply the Zel’dovich approximation

Finally, we evolve the overdensity field according to the Zel’dovich approximation, which is 1st order Lagrangian perturbation theory. To do this, we move the positions of each object according to the linear velocity field,

\[\vr(\vx) = \vr_0(\vx) + \frac{\vv(\vx)}{f a H},\]

where \(\vr(\vx)\) is the final position of the objects, \(\vr_0(\vx)\) is the initial position, and \(\vv(\vx)\) is the velocity assigned in the previous step.

After this step, we have a catalog of discrete objects, with a Position column in units of \(\mathrm{Mpc}/h\), a Velocity columns in units of km/s, and a VelocityOffset column in units of \(\mathrm{Mpc}/h\). The VelocityOffset is a convenience function for adding redshift-space distortions (see Adding Redshift-space Distortions), such that RSD can be added using:

line_of_sight = [0,0,1]
src['Position'] = src['Position'] + src['VelocityOffset'] * line_of_sight

Note

For examples using log-normal mocks, see the cookbook/lognormal-mocks.ipynb recipe in The Cookbook.

Halo Occupation Distribution Mocks

nbodykit includes functionality to generate mock galaxy catalogs using the Halo Occupation Distribution (HOD) technique via the HODCatalog class. The HOD technique populates a catalog of halos with galaxies based on a functional form for the probability that a halo of mass \(M\) hosts \(N\) objects, \(P(N|M)\). The functional form of the HOD used by HODCatalog is the form used in Zheng et al 2007. The average number of galaxies in a halo of mass \(M\) is

\[\langle N_\mathrm{gal}(M) \rangle = N_\mathrm{cen}(M) \left [ 1 + N_\mathrm{sat}(M) \right],\]

where the occupation functions for centrals and satellites are given by

\[ \begin{align}\begin{aligned}N_\mathrm{cen}(M) &= \frac{1}{2} \left (1 + \mathrm{erf} \left[ \frac{\log_{10}M - \log_{10}M_\mathrm{min}}{\sigma_{\log_{10}M}} \right] \right),\\N_\mathrm{sat}(M) &= \left ( \frac{M - M_0}{M_1} \right )^\alpha.\end{aligned}\end{align} \]

This HOD parametrization has 5 parameters, which can be summarized as:

Parameter

Name

Default

Description

\(\log_{10}M_\mathrm{min}\)

logMmin

13.031

Minimum mass required for a halo to host a central galaxy

\(\sigma_{\log_{10}M}\)

sigma_logM

0.38

Rate of transition from \(N_\mathrm{cen}=0\) to \(N_\mathrm{cen}=1\)

\(\alpha\)

alpha

0.76

Power law slope of the relation between halo mass and \(N_\mathrm{sat}\)

\(\log_{10}M_0\)

logM0

13.27

Low-mass cutoff in \(N_\mathrm{sat}\)

\(\log_{10}M_1\)

logM1

14.08

Characteristic halo mass where \(N_\mathrm{sat}\) begins to assume a power law form

The default values of the HOD parameters are taken from Reid et al. 2014.

This form of the HOD clustering description assumes the galaxy – halo connection depends only on the halo mass. Thus, given a catalog of halo objects, with associated mass values, users can quickly generate realistic galaxy catalogs using this class.

Interfacing with halotools via a HaloCatalog

Internally, the HODCatalog class uses the halotools package to perform the halo population step. For further details, see the documentation for halotools.empirical_models.Zheng07Cens and halotools.empirical_models.Zheng07Sats, as well as this tutorial on the Zheng 07 HOD model.

The catalog of halos input to the HODCatalog class must be of type halotools.sim_manager.UserSuppliedHaloCatalog, the tabular data format preferred by halotools. nbodykit includes the HaloCatalog class in order to interface nicely with halotools. In particular, this catalog object includes a to_halotools() function to create a UserSuppliedHaloCatalog from the data columns in the HaloCatalog object.

Given a CatalogSource object, the HaloCatalog object interprets the objects as halos, using a specified redshift, cosmology, and mass definition, to add several analytic columns to the catalog, including Radius() and Concentration().

For example, below we generate uniform particles in a box and then interpret them as halos by specifying a redshift and cosmology:

[3]:
from nbodykit.lab import HaloCatalog, cosmology

# uniform objects in a box
cat = UniformCatalog(nbar=100, BoxSize=1.0, seed=42)

# add a Mass column to the objects
cat['Mass'] = 10**(cat.rng.uniform(12, 15))

# initialize the halos
halos = HaloCatalog(cat, cosmo=cosmology.Planck15, redshift=0., mdef='vir', position='Position', velocity='Velocity', mass='Mass')

print(halos.columns)
['Concentration', 'Mass', 'Position', 'Radius', 'Selection', 'Value', 'Velocity', 'VelocityOffset', 'Weight']

And using the to_halotools() function, we can create the halotools.sim_manager.UserSuppliedHaloCatalog object needed to initialize the HODCatalog object.

[4]:
halocat = halos.to_halotools()

print(halocat)

print(halocat.halo_table[:10])
<halotools.sim_manager.user_supplied_halo_catalog.UserSuppliedHaloCatalog object at 0x7f78f96c0198>
       halo_x              halo_y       ... halo_upid halo_local_id
------------------- ------------------- ... --------- -------------
0.45470105222137214  0.8326320323149597 ...      -1.0             0
0.31944725103976734 0.48518719297596336 ...      -1.0             1
0.21242627399222258 0.16674683796980805 ...      -1.0             2
 0.3185452432219371  0.3490676632113582 ...      -1.0             3
 0.5066846142238558 0.23705948996927073 ...      -1.0             4
0.47273713613639634  0.8569679300356892 ...      -1.0             5
0.45934116150501314  0.9523800798479475 ...      -1.0             6
 0.9244771851038425  0.8843779857073643 ...      -1.0             7
 0.4553655200041342  0.8086086697963959 ...      -1.0             8
 0.5231105620516908 0.10008694708748456 ...      -1.0             9

Caveats

  • The units of the halo position, velocity, and mass input to HaloCatalog are assumed to be \(\mathrm{Mpc}/h\), km/s, and \(M_\odot/h\), respectively. These units are necessary to interface with halotools.

  • The mass definition input to HaloCatalog can be “vir” to use virial masses, or an overdensity factor with respect to the critical or mean density, i.e. “200c”, “500c”, or “200m”, “500m”.

  • If using the built-in friends-of-friends (FOF) finder class, FOF, to identify halos, the user can use the to_halos() function to directly produce a HaloCatalog from the result of running the FOF algorithm.

  • By default, the halo concentration values stored in the Concentration column of a HaloCatalog object are generated using the input mass definition and the analytic formulas from Dutton and Maccio 2014. Users can overwrite this column with their own values if they wish to use custom concentration values when generating HOD catalogs.

The HODCatalog Class

Users can initialize the HOD catalog directly from the UserSuppliedHaloCatalog object and the desired HOD parameters. The HODCatalog object will include all of the columns from the UserSuppliedHaloCatalog object, with the usual columns Position, Velocity, and VelocityOffset for the generated galaxies. The additional columns are:

  1. conc_NFWmodel: the concentration of the halo

  2. gal_type: the galaxy type, 0 for centrals and 1 for satellites

  3. halo_id: the global ID of the halo that this galaxy belongs to, between 0 and csize

  4. halo_local_id: the local ID of the halo that this galaxy belongs to, between 0 and size

  5. halo_mvir: the halo mass, in units of \(M_\odot/h\)

  6. halo_nfw_conc: alias of conc_NFWmodel

  7. halo_num_centrals: the number of centrals that this halo hosts, either 0 or 1

  8. halo_num_satellites: the number of satellites that this halo hosts

  9. halo_rvir: the halo radius, in units of \(\mathrm{Mpc}/h\)

  10. halo_upid: equal to -1; should be ignored by the user

  11. halo_vx, halo_vy, halo_vz: the three components of the halo velocity, in units of km/s

  12. halo_x, halo_y, halo_z: the three components of the halo position, in units of \(\mathrm{Mpc}/h\)

  13. host_centric_distance: the distance from this galaxy to the center of the halo, in units of \(\mathrm{Mpc}/h\)

  14. vx, vy, vz: the three components of the galaxy velocity, equal to Velocity, in units of km/s

  15. x,y,z: the three components of the galaxy position, equal to Position, in units of \(\mathrm{Mpc}/h\)

Below we populate galaxies with Zheng07Model, and compute the number of centrals and satellites in the catalog, using the gal_type column.

[5]:
from nbodykit.hod import Zheng07Model
hod = halos.populate(Zheng07Model, alpha=0.5, sigma_logM=0.40, seed=42)

print("total number of HOD galaxies = ", hod.csize)
print(hod.columns)

print("number of centrals = ", hod.compute((hod['gal_type']==0).sum()))
print("number of satellites = ", hod.compute((hod['gal_type']==1).sum()))
total number of HOD galaxies =  291
['Position', 'Selection', 'Value', 'Velocity', 'VelocityOffset', 'Weight', 'conc_NFWmodel', 'gal_type', 'halo_hostid', 'halo_id', 'halo_mvir', 'halo_num_centrals', 'halo_num_satellites', 'halo_rvir', 'halo_upid', 'halo_vx', 'halo_vy', 'halo_vz', 'halo_x', 'halo_y', 'halo_z', 'host_centric_distance', 'vx', 'vy', 'vz', 'x', 'y', 'z']
number of centrals =  94
number of satellites =  197

Caveats

  • The HOD population step requires halo concentration. If the user wishes to uses custom concentration values, the UserSuppliedHaloCatalog table should contain a halo_nfw_conc column. Otherwise, the analytic prescriptions from Dutton and Maccio 2014 are used.

Repopulating a HOD Catalog

We can also quickly repopulate a HOD catalog in place, generating a new set of galaxies for the same set of halos, either changing the random seed or the HOD parameters. For example,

[6]:
# repopulate, just changing the random seed
hod.repopulate(seed=84)
print("total number of HOD galaxies = ", hod.csize)

print("number of centrals = ", hod.compute((hod['gal_type']==0).sum()))
print("number of satellites = ", hod.compute((hod['gal_type']==1).sum()))

# re-populate with new parameters
hod.repopulate(logM0=13.2, logM1=14.5)
print("total number of HOD galaxies = ", hod.csize)

print("number of centrals = ", hod.compute((hod['gal_type']==0).sum()))
print("number of satellites = ", hod.compute((hod['gal_type']==1).sum()))
total number of HOD galaxies =  282
number of centrals =  95
number of satellites =  187
total number of HOD galaxies =  141
number of centrals =  93
number of satellites =  48

Note

For examples using HOD mocks, see the cookbook/hod-mocks.ipynb recipe in The Cookbook.

Using a Custom HOD Model

Users can implement catalogs that use custom HOD modeling by subclassing the HODBase class. This base class is abstract, and subclasses must implement the __makemodel__() function. This function returns a HodModelFactory object, which is the halotools object responsible for supporting custom HOD models. For more information on designing your own HOD model using halotools, see this series of halotools tutorials.

[7]:

Dealing with Data on a Mesh

What is a MeshSource?

Often in large-scale structure data analysis, we wish to manipulate representations of continuous quantities on a discrete grid. The canonical example is the analysis of the cosmological density field, interpolated on to a 3D mesh from a discrete set of galaxies. To support such calculations, nbodykit provides the nbodykit.base.mesh.MeshSource object.

Fundamentally, the MeshSource object stores a (possibly weighted) density field on a three-dimensional mesh, with the Nmesh parameter determining the number of grid cells per side (such that there are \(\mathrm{Nmesh}^3\) mesh cells). nbodykit adds the functionality to analyze these fields in both configuration space (often referred to real space) and Fourier space through an interface to the RealField and ComplexField objects implemented by the pmesh package. These objects are paired classes, related through the operation of a 3D fast Fourier transform (FFT). The FFT operation implemented in pmesh relies on the pfft-python package, which is a Python binding of PFFT, a massively parallel FFT library.

Use Cases

The MeshSource is an abstract base class – it cannot be directly initialized. Instead, nbodykit includes several specialized subclasses of MeshSource in the nbodykit.source.mesh module. In general, these subclasses fall into three categories:

  1. Generating mesh data from a CatalogSource (see Converting a CatalogSource to a Mesh)

  2. Reading mesh data from disk (see Saving and Loading a Mesh)

  3. Generating mock fields directly on a mesh (see Gaussian Realizations)

Painting the Mesh

The MeshSource.paint() function produces the values of the field on the mesh, returning either a RealField or ComplexField. This function treats the mesh equally in either configuration space or Fourier space, internally performing the appropriate FFTs. By specifying the mode keyword to the paint() function, users can access either the field data in configuration space or the complex modes of the field in Fourier space.

The “painting” nomenclature derives from the most common use case. The process of interpolating a set of discrete objects on to the mesh evokes the imagery of “painting” the mesh. More generally, the paint() function is responsible for filling in the mesh with data, which could also involve reading data from disk or generating mock fields directly on the mesh.

For further details and examples of painting a catalog of discrete objects to a mesh, see Painting Catalogs to a Mesh.

Fields: RealField and ComplexField

The MeshSource class provides an interface to the pmesh.pm.RealField and pmesh.pm.ComplexField objects. These classes behave like numpy arrays and include functions to perform parallel forward and inverse FFTs. These field objects are initialized from a pmesh.pm.ParticleMesh, which sets the number of mesh cells and stores FFT-related grid quantities.

[1]:
from pmesh.pm import ParticleMesh, RealField, ComplexField

# a 8^3 mesh
pm = ParticleMesh(Nmesh=[8,8,8])

# initialize a RealField
rfield = RealField(pm)

# shape
print("shape = ", rfield.shape)

# set entire mesh to unity
rfield[...] = 1.0

# print the mean of the underlying array
print("mean = ", rfield.value.mean())
shape =  (8, 8, 8)
mean =  1.0

All MeshSource objects implement either the MeshSource.to_real_field() function or the MeshSource.to_complex_field() function. These functions are responsible for returning either a RealField or a ComplexField. The MeshSource.paint function calls these functions, providing the core functionality of the MeshSource class.

The c2r() and r2c() functions

Users can transform between RealField and ComplexField objects using the r2c() function for forward FFTs and the c2r() function for inverse FFTs. These operations take advantage of the fact that the field objects in configuration space store real-valued quantities to perform real-to-complex FFTs. This type of FFT uses the symmetry of real-valued quantities to store only half of the complex modes along the z axis.

[2]:
# perform the forward FFT
cfield = rfield.r2c()

# stores Nmesh/2+1 in z axis b/c of conjugate symmetry
print("shape = ", cfield.shape)

# k=0 mode is the mean value of configuration space field
print("mean of configuration space field from k=0 = ", cfield[0,0,0])

# perform the inverse FFT
rfield2 = cfield.c2r()

# print the mean of the underlying array
print("mean of real field = ", rfield2.value.mean())
shape =  (8, 8, 5)
mean of configuration space field from k=0 =  (1+0j)
mean of real field =  1.0

Storing Meta-data

For all MeshSource objects, the input parameters and additional meta-data are stored in the attrs dictionary attribute.

API

For more information about specific mesh sources, please see the API section.

Creating a Mesh

In this section, we outline how users interact with data on a mesh in nbodykit. The main ways to create a mesh include:

Converting a CatalogSource to a Mesh

Users can create mesh objects from CatalogSource objects by specifying the desired number of cells per mesh side via the Nmesh parameter and using the to_mesh() function. Below, we convert a UniformCatalog to a MeshSource using a \(16^3\) mesh in a box of side length \(1\) \(\mathrm{Mpc}/h\).

[2]:
from nbodykit.lab import UniformCatalog

cat = UniformCatalog(nbar=100, BoxSize=1.0, seed=42)

mesh = cat.to_mesh(Nmesh=16)

print("mesh = ", mesh)
/home/yfeng1/anaconda3/install/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
mesh =  (UniformCatalog(size=96, seed=42) as CatalogMesh)

Important

The to_mesh() operation does not perform any interpolation operations, which nbodykit refers to as “painting” the mesh. This function merely initializes a new object that sets up the mesh with the configuration provided by the user. For more details and examples of painting a catalog of discrete objects to a mesh, see Painting Catalogs to a Mesh.

The Window Kernel

When interpolating discrete particles on to a regular mesh, we must choose which kind of interpolation kernel to use. The kernel determines which cells an object will contribute to on the mesh. In the simplest case, known as Nearest Grid Point interpolation, an object only contributes to the one cell that is closest to its position. In general, higher order interpolation schemes, which spread out objects over more cells, lead to more accurate results. See Section 2.3 of Sefusatti et al. 2015 for an introduction to different interpolation schemes.

nbodykit supports several different interpolation kernels, which can be specified using the window keyword of the to_mesh() function. The default value is cic, representing the second-order interpolation scheme known as Cloud In Cell. The third-order interpolation scheme, known as Triangular Shaped Cloud, can be specified by setting window to tsc. CIC and TSC are the most commonly used interpolation windows in the field of large-scale structure today.

Support for wavelet-based kernels is also provided. The Daubechies wavelet with various sizes can be used by specifying db6, db12, or db20. The closely related Symlet wavelet can be used by specifying sym6, sym12, or sym20. These are symmetric Daubechies wavelets and tend to perform better than the non-symmetric versions. For more information on using wavelet kernels for painting a mesh, see Cui et al. 2008.

Note that the non-traditional interpolation windows can be considerably slower than the cic or tsc methods. For this reason, nbodykit uses the cic interpolation window by default. See pmesh.window.methods for the full list of supported window kernels.

Note

Please see this cookbook recipe for a notebook exploring the accuracy of different interpolation windows.

Interlacing

nbodykit provides support for the interlacing technique, which can reduce the effects of aliasing when Fourier transforming the density field on the mesh. This technique involves interpolating objects on to two separate meshes, separated by half of a cell size. When combining the complex fields in Fourier space from these two meshes, the effects of aliasing are significantly reduced on the combined field. For a more detailed discussion behind the mathematics of this technique, see Section 3.1 of Sefusatti et al. 2015.

By default this technique is turned off, but it can be turned on by the user by passing interlaced=True to the to_mesh() function.

Note

Please see this cookbook recipe for a notebook exploring the effects of interlacing on density fields in Fourier space.

Compensation: Deconvolving the Window Kernel

Interpolating discrete objects on to the mesh produces a density field defined on the mesh that is convolved with the interpolation kernel. In Fourier space, the complex field is then the product of the true density field and the Fourier transform of the window kernel as given by the Convolution Theorem. For the TSC and CIC window kernels, there are well-known correction factors that can be applied to the density field in Fourier space. If we apply these correction factors, we refer to the field as “compensated”, and the use of these correction factors is controlled via the compensated keyword of the to_mesh() function.

If compensated is set to True, the correction factors that will be applied are:

Window

Interlacing

Compensation Function

Reference

cic

False

CompensateCICAliasing()

eq 20 of Jing et al. 2005

tsc

False

CompensateTSCAliasing()

eq 20 of Jing et al. 2005

cic

True

CompensateCIC()

eq 18 of Jing et al. 2005 (\(p=2\))

tsc

True

CompensateTSC()

eq 18 of Jing et al. 2005 (\(p=3\))

Note

If window is not equal to tsc or cic, no compensation correction is currently implemented by default in nbodykit, and if compensated is set to True, an exception will be raised. Users can implement custom compensation functions via the syntax for Applying Functions to the Mesh.

Additional Mesh Configuration Options

The to_mesh() function supports additional keywords for customizing the painting process. These keywords are:

  • weight:

    The weight keyword can be used to select a column to use as weights when painting objects to a mesh. By default, it is set to the Weight column, a default column equal to unity for all objects. In this default configuration, the density field on the mesh is normalized as \(1+\delta\). See cookbook/painting.ipynb#Painting-Multiple-Species-of-Particles for an example of using the weight keyword to represent particle masses.

  • value:

    The value keyword can be used to select a column to use as the field value when painting a mesh. The mesh field is a weighted average of value, with the weights given by weight. By default, it is set to the Value column, a default column equal to unity for all objects. In this default configuration, the density field on the mesh is normalized as \(1+\delta\). See cookbook/painting.ipynb#Painting-the-Line-of-sight-Momentum-Field for an example of using the value keyword to paint the momentum field.

  • selection:

    The selection keyword specifies a boolean column that selects a subset of the CatalogSource object to paint to the mesh. By default, the selection keyword is set to the Selection column, a default column in all CatalogSource objects that is set to True for all objects.

  • position:

    By default, nbodykit assumes that the Position column is the name of the column holding the Cartesian coordinates of the objects in the catalog. Thus, the to_mesh() function uses this column to paint a catalog to a mesh. The user can change this behavior by specifying the name of the desired column using the position keyword of the to_mesh() function.

Gaussian Realizations

A Gaussian realization of a density field can be initialized directly on a mesh using the LinearMesh class. This class generates the Fourier modes of density field with a variance set by an input power spectrum function. It allows the user to create density fields with a known power spectrum, which is often a useful tool in large-scale structure analysis.

Users can take advantage of the built-in linear power spectrum class, LinearPower, or use their own function to specify the desired power spectrum. The function should take a single argument k, the wavenumber. Several transfer functions can be used with the LinearPower class, including from the CLASS CMB Boltzmann code, and the analytic fitting formulas from Eisenstein and Hu 1998 with and without Baryon Acoustic Oscillations.

In addition to the power spectrum function, users need to specify a mesh size via the Nmesh parameter and a box size via the BoxSize parameter. For example, to create a density field on a mesh using the 2015 Planck cosmological parameters and the Eisenstein-Hu linear power spectrum at redshift \(z=0\), use

[3]:
from nbodykit.lab import LinearMesh, cosmology
from matplotlib import pyplot as plt

cosmo = cosmology.Planck15
Plin = cosmology.LinearPower(cosmo, redshift=0, transfer='EisensteinHu')

# initialize the mesh
mesh = LinearMesh(Plin, Nmesh=128, BoxSize=1380, seed=42)

# preview the density field
plt.imshow(mesh.preview(axes=[0,1]))
[3]:
<matplotlib.image.AxesImage at 0x7f0b1af1b6d8>
_images/mesh_creating_4_1.png

From In-memory Data

From a RealField or ComplexField

If a pmesh.pm.RealField or pmesh.pm.ComplexField object is already stored in memory, they can be converted easily into a mesh object using the FieldMesh class. For example,

[4]:
from nbodykit.lab import FieldMesh
from pmesh.pm import RealField, ComplexField, ParticleMesh

# a 8^3 mesh
pm = ParticleMesh(Nmesh=[8,8,8])

# initialize a RealField
rfield = RealField(pm)

# set entire mesh to unity
rfield[...] = 1.0

# initialize from the RealField
real_mesh = FieldMesh(rfield)

# can also initialize from a ComplexField
cfield = rfield.r2c()
complex_mesh = FieldMesh(cfield)
From a Numpy Array

Given a 3D numpy array stored in memory that represents data on a mesh, users can initialize a mesh object using the ArrayMesh class. The array for the full mesh must be stored in memory on a single rank and not split in parallel across multiple ranks. After initializing the ArrayMesh object, the mesh data will be automatically spread out across the available ranks.

A common use case for this class is when a single rank handles the input/output of the mesh data in the form of numpy arrays. Then, a single rank can read in the array data from disk, and the mesh object can be initialized using the ArrayMesh class.

For example,

[5]:
from nbodykit.lab import ArrayMesh
import numpy

# generate random data on a 128^3 mesh
data = numpy.random.random(size=(128,128,128))

# inititalize the mesh
mesh = ArrayMesh(data, BoxSize=1.0)

# preview the density mesh
plt.imshow(mesh.preview(axes=[0,1]))
[5]:
<matplotlib.image.AxesImage at 0x7f0b1aebf0b8>
_images/mesh_creating_8_1.png

Painting Catalogs to a Mesh

The MeshSource.paint() function produces the values of the field on the mesh, returning either a RealField or ComplexField. In this section, we focus on the process of interpolating a set of discrete objects in a CatalogSource on to a mesh and how users can customize this procedure.

The Painted Field

The compute() function paints mass-weighted (or equivalently, number-weighted) fields to a mesh. So, when painting a CatalogSource to a mesh, the field \(F(\vx)\) that is painted is:

\[F(\vx) = \left[ 1 + \delta'(\vx) \right] V(\vx),\]

where \(V(\vx)\) represents the field value painted to the mesh and \(\delta'(\vx)\) is the (weighted) overdensity field, given by:

\[\delta'(\vx) = \frac{n(\vx)'}{\bar{n}'} - 1,\]

where \(\bar{n}'\) is the weighted mean number density of objects. Here, quantities denoted with a prime (\('\)) indicate weighted quantities. The unweighted number density field \(n(\vx)\) is related to its weighted counterpart via \(n'(\vx) = W(\vx) n(\vx)\), where \(W(\vx)\) are the weights.

Users can control the behavior of the value \(V(\vx)\) and the weights \(W(\vx)\) when converting a CatalogSource object to a mesh via the to_mesh() function. Specifically, the weight and value keywords allow users to indicate the name of the column in the CatalogSource to use for \(W(\vx)\) and \(V(\vx)\). See Additional Mesh Configuration Options for more details on these keywords.

Operations

The painted field is an instance of pmesh.pm.RealField. Methods are provided for Fourier transforms (to a pmesh.pm.ComplexField object), and transfer functions on both the Real and Complex fields:

field = mesh.compute()

def tf(k, v):
   return 1j * k[2] / k.normp(zeromode=1) ** 0.5 * v

gradz = field.apply(tf).c2r()

The underlying numerical values of the field can be accessed via indexing. A RealField is distributed across the entire MPI communicator of the mesh object, and in general each single rank in the MPI communicator only sees a region of the field.

  • numpy methods (e.g. field[…].std() that operates on the local field values only compute the results on a single rank, thus only correct when a single rank is used:

  • collective methods provide the correct result that has been reduced on the entire MPI communicator. For example, to compute the standard deviation of the field in a script that runs on sevearl MPI ranks, we shall use ((field ** 2).cmean() - field.cmean() ** 2) ** 0.5 instead of field[...].std().

The positions of the grid points on which the field value resides can be obtained from

field = mesh.compute()

grid = field.pm.generate_uniform_particle_grid(shift=0)

A low resolution projected preview of the field can be obtained (the example is along x-y plain)

field = mesh.compute()

imshow(field.preview(Nmesh=64, axes=[0, 1]).T,
       origin='lower',
       extent=(0, field.BoxSize[0], 0, field.BoxSize[1]))

Shot-noise

The shot-noise level of a weighted field is given by

\[SN = L^3 \frac{\sum W^2}{(\sum W)^2}\]

where L^3 is the total volume of the box, and W is the weight of individual objects. We see in the limit where W=1 everywhere, the shotnoise is simply \(1 / \bar{n}\).

Default Behavior

The default behavior is \(W(\vx) = 1\) and \(V(\vx) = 1\), in which case the painted field is given by:

\[F^\mathrm{default}(\vx) = 1 + \delta(\vx).\]

In the CatalogSource.to_mesh() function, the default values for the value and weight keywords are the Value and Weight columns, respectively. These are default columns that are in all CatalogSource objects that are set to unity by default.

More Examples

The Painting Recipes section of the cookbook contains several more examples that change the default behavior to paint customized fields to the mesh.

For example, users can set \(V(\vx)\) to a column holding a component of the velocity field, in which case the painted field \(F(\vx)\) would represent the momentum (mass-weighted velocity) field. See the Painting the Line-of-sight Momentum Field recipe for further details.

Another common example is setting the weights \(W(\vx)\) to a column representing mass and painting multiple species of particles to the same mesh using the MultipleSpeciesCatalog object. See the Painting Multiple Species of Particles recipe for more details.

Common Mesh Operations

In this section, we detail some common operations for manipulating data defined on a mesh via a MeshSource object.

Previewing the Mesh

The MeshSource.preview() function allows users to preview a low-resolution mesh by resampling the mesh and gathering the mesh to all ranks. It can also optionally project the mesh across multiple axes, which enables visualizations of the projected density field for quick data inspection by the user.

For example, below we initialize a LinearMesh object on a \(128^3\) mesh and preview the mesh on a \(64^3\) mesh after projecting the field along two axes:

[2]:
from nbodykit.lab import LinearMesh, cosmology
from matplotlib import pyplot as plt

cosmo = cosmology.Planck15
Plin = cosmology.LinearPower(cosmo, redshift=0, transfer='EisensteinHu')

mesh = LinearMesh(Plin, Nmesh=128, BoxSize=1380, seed=42)

density = mesh.preview(Nmesh=64, axes=(0,1))
plt.imshow(density)
/home/yfeng1/anaconda3/install/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
[2]:
<matplotlib.image.AxesImage at 0x7fb154c74828>
_images/mesh_common-operations_2_2.png

Note

The previewed mesh result is broadcast to all ranks, so each rank allocates \(\mathrm{Nmesh}^3\) in memory.

Saving and Loading a Mesh

The MeshSource.save() function paints the mesh via a call to MeshSource.paint() and saves the mesh using a bigfile format. The output mesh saved to file can be in either configuration space or Fourier space, by specifying mode as either real or complex.

Below, we save our LinearMesh to a bigfile file:

[3]:
# save the RealField
mesh.save('linear-mesh-real.bigfile', mode='real', dataset='Field')

# save the ComplexField
mesh.save('linear-mesh-complex.bigfile', mode='real', dataset='Field')

The saved mesh can be loaded from disk using the BigFileMesh class:

[4]:
from nbodykit.lab import BigFileMesh
import numpy

# load the mesh in the form of a RealField
real_mesh = BigFileMesh('linear-mesh-real.bigfile', 'Field')

# return the RealField via paint
rfield = real_mesh.paint(mode='real')

# load the mesh in the form of a ComplexField
complex_mesh = BigFileMesh('linear-mesh-complex.bigfile', 'Field')

# FFT to get the ComplexField as a RealField
rfield2 = complex_mesh.paint(mode='real')

# the two RealFields must be the same!
numpy.allclose(rfield.value, rfield2.value)
[4]:
True

Here, we load our meshes in configuration space and Fourier space and then paint both with mode=real and verify that the results are the same.

Applying Functions to the Mesh

nbodykit supports performing transformations to the mesh data by applying arbitrary functions in either configuration space or Fourier space. Users can use the MeshSource.apply() function to apply these transformations. The function applied to the mesh should take two arguments, x and v:

  1. The x argument provides a list of length three holding the coordinate arrays that define the mesh. These arrays broadcast to the full shape of the mesh, i.e., they have shapes \((N_x,1,1)\), \((1,N_y,1)\), and \((1,1,N_z)\) if the mesh has shape \((N_x, N_y, N_z)\).

  2. The v argument is the array holding the value of the mesh field at the coordinate arrays in x

The units of the x coordinate arrays depend upon the values of the kind and mode keywords passed to the apply() function. The various cases are:

mode

kind

range of the “x” argument

real

relative

\([-L/2, L/2)\)

real

index

\([0, N)\)

complex

wavenumber

\([- \pi N/L, \pi N / L)\)

complex

circular

\([-\pi, \pi)\)

complex

index

\([0, N)\)

Here, \(L\) is the size of the box and N is the number of cells per mesh side.

One common use of the MeshSource.apply() functionality is applying compensation function to the mesh to correct for the interpolation window. The table of built-in compensation functions in the Compensation: Deconvolving the Window Kernel section provide examples of the syntax needed to apply functions to the mesh.

In the example below, we apply a filter function in Fourier space that divides the mesh by the squared norm of the wavenumber k on the mesh, and then print out the first few mesh cells of the filtered mesh to verify the function was applied properly.

[5]:
def filter(k, v):
    kk = sum(ki ** 2 for ki in k) # k^2 on the mesh
    kk[kk == 0] = 1
    return v / kk # divide the mesh by k^2

# apply the filter and get a new mesh
filtered_mesh = mesh.apply(filter, mode='complex', kind='wavenumber')

# get the filtered RealField object
filtered_rfield = filtered_mesh.paint(mode='real')

print("head of filtered Realfield = ",  filtered_rfield[:10,0,0])
print("head of original RealField = ",  rfield[:10,0,0])
head of filtered Realfield =  [1094.8562 1060.4702 1058.0209 1059.883  1067.5662 1104.8212 1150.1516
 1193.5796 1244.2278 1295.758 ]
head of original RealField =  [0.14927006 0.85538614 1.7375357  2.270503   1.6951047  1.9259002
 1.4472127  0.8612448  0.8651851  1.6219351 ]

Resampling a Mesh

Users can resample a mesh by specifying the Nmesh keyword to the MeshSource.paint() function. For example, below we resample a LinearMesh object, changing the mesh resolution from Nmesh=128 to Nmesh=32.

[6]:
from nbodykit.lab import LinearMesh, cosmology

# linear mesh
Plin = cosmology.LinearPower(cosmology.Planck15, redshift=0.55, transfer='EisensteinHu')
source = LinearMesh(Plin, Nmesh=64, BoxSize=512, seed=42)

# paint, re-sampling to Nmesh=32
real = source.paint(mode='real', Nmesh=32)

print("original Nmesh = ", source.attrs['Nmesh'])
print("resampled Nmesh = ", real.Nmesh)
print("shape of resampled density field = ", real.cshape)
original Nmesh =  [64 64 64]
resampled Nmesh =  [32 32 32]
shape of resampled density field =  [32 32 32]

Available Algorithms

nbodykit includes several state-of-the-art implementations of canonical algorithms in the field of large-scale structure. It includes functionality for computing a wide variety of clustering statistics on data, as well as algorithms for finding groups of objects.

We describe these algorithms in detail below, and users can also find further examples in The Cookbook section of the documentation.

Power Spectrum Algorithms

Simulation Box Power Spectrum/Multipoles (FFTPower)

The FFTPower class computes the 1d power spectrum \(P(k)\), 2d power spectrum \(P(k,\mu)\), and/or multipoles \(P_\ell(k)\) for data in a simulation box, using a Fast Fourier Transform (FFT). Here, we provide a brief overview of the algorithm itself as well as the key things to know for the user to get up and running quickly.

Note

To jump right into the FFTPower algorithm, see this cookbook recipe for a detailed walk-through of the FFTPower algorithm.

The Algorithm

The steps involved in computing the power spectrum via FFTPower are as follows:

  1. Generate data on a mesh

    Data must be painted on to a discrete mesh to compute the power spectrum. There are several ways to generate data on a mesh (see Creating a Mesh), but the most common is painting a discrete catalog of objects on to a mesh (see Converting a CatalogSource to a Mesh and Painting Catalogs to a Mesh). The FFTPower class accepts input data in either the form of a MeshSource or a CatalogSource. In the latter case, the catalog is automatically converted to a mesh using the default parameters of the to_mesh() function.

    When converting from a catalog to a mesh, users can customize the painting procedure via the options of the to_mesh() function. These options have important effects on the resulting power spectrum of the field in Fourier space. See Converting a CatalogSource to a Mesh for more details.

  2. FFT the mesh to Fourier space

    Once the density field is painted to the mesh, the Fourier transform of the field \(\delta(\vx)\) is performed in parallel to obtain the complex modes of the overdensity field, \(\delta(\vk)\). The field is stored using the ComplexField object.

  3. Generate the 3D power spectrum on the mesh

    The 3D power spectrum field is computed on the mesh, using

    \[P(\mathbf{k}) = \delta(\mathbf{k}) \cdot \delta^\star(\mathbf{k}),\]

    where \(\delta^\star (\mathbf{k})\) is the complex conjugate of \(\delta(\mathbf{k})\).

  4. Perform the binning in the specified basis

    Finally, the 3D power defined on the mesh \(P(\mathbf{k})\) is binned using the basis specified by the user. The available options for binning are:

    • 1D binning as a function of wavenumber \(k\)

    • 2D binning as a function of wavenumber \(k\) and cosine of the angle to the line-of-sight \(\mu\)

    • Multipole binning as a function of \(k\) and multipole number \(\ell\)

The Functionality

Users can compute various quantities using the FFTPower. We’ll discuss the available functionality briefly in the sub-sections below.

Auto power spectra and cross power spectra

Both auto and cross spectra are supported. Users can compute cross power spectra by passing a second mesh object to the FFTPower class using the second keyword. The first mesh object should always be specified as the first argument.

1D Power Spectrum, \(P(k)\)

The 1D power spectrum \(P(k)\) can be computed by specifying the mode argument as “1d”. The wavenumber binning will be linear, and can be customized by specifying the dk and kmin attributes. By default, the edge of the last wavenumber bin is the Nyquist frequency, given by \(k_\mathrm{Nyq} = \pi N_\mathrm{mesh} / L_\mathrm{box}\). If dk is not specified, then the fundamental mode of the box is used: \(2\pi/L_\mathrm{box}\).

2D Power Spectrum, \(P(k,\mu)\)

The 2D power spectrum \(P(k,\mu)\) can be computed by specifying the mode argument as “2d”. The number of \(\mu\) bins is specified via the Nmu keyword. The bins range from \(\mu=0\) to \(\mu=1\).

Multipoles of \(P(k,\mu)\)

The FFTPower class can also compute the multipoles of the 2D power spectrum, defined as

\[P_\ell(k) = (2\ell + 1) \int_0^1 d\mu P(k,\mu) \mathcal{L}_\ell(\mu),\]

where \(\mathcal{L}_\ell\) is the Legendre polynomial of order \(\ell\). Users can specify which multipoles they wish to compute by passing a list of the desired \(\ell\) values as the poles keyword to the FFTPower class.

For example, we can compute both \(P(k,\mu)\) and \(P_\ell(k)\) for a uniform catalog of objects using:

[2]:
from nbodykit.lab import UniformCatalog, FFTPower

cat = UniformCatalog(nbar=100, BoxSize=1.0, seed=42)

r = FFTPower(cat, mode='2d', Nmesh=32, Nmu=5, poles=[0,2,4])
/home/yfeng1/anaconda3/install/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
The Results

The power spectrum results are stored in two attributes of the initialized FFTPower object: power and poles. These attributes are BinnedStatistic objects, which behave like structured numpy arrays and store the measured results on a coordinate grid defined by the bins. See Analyzing your Results for a full tutorial on using the BinnedStatistic class.

The power attribute stores the following variables:

  • k :

    the mean value for each k bin

  • muif mode=2d

    the mean value for each mu bin

  • power :

    complex array storing the real and imaginary components of the power

  • modes :

    the number of Fourier modes averaged together in each bin

The poles attribute stores the following variables:

  • k :

    the mean value for each k bin

  • power_L :

    complex array storing the real and imaginary components for the \(\ell=L\) multipole

  • modes :

    the number of Fourier modes averaged together in each bin

Note that measured power results for bins where modes is zero (no data points to average over) are set to NaN.

In our example, the power and poles attributes are:

[3]:
# the 2D power spectrum results
print("power = ", r.power)
print("variables = ", r.power.variables)
for name in r.power.variables:
    var = r.power[name]
    print("'%s' has shape %s and dtype %s" %(name, var.shape, var.dtype))
power =  <BinnedStatistic: dims: (k: 16, mu: 5), variables: ('k', 'mu', 'power', 'modes')>
variables =  ['k', 'mu', 'power', 'modes']
'k' has shape (16, 5) and dtype float64
'mu' has shape (16, 5) and dtype float64
'power' has shape (16, 5) and dtype complex128
'modes' has shape (16, 5) and dtype int64
[4]:
# the multipole results
print("poles = ", r.poles)
print("variables = ", r.poles.variables)
for name in r.poles.variables:
    var = r.poles[name]
    print("'%s' has shape %s and dtype %s" %(name, var.shape, var.dtype))
poles =  <BinnedStatistic: dims: (k: 16), variables: 5 total>
variables =  ['k', 'power_0', 'power_2', 'power_4', 'modes']
'k' has shape (16,) and dtype float64
'power_0' has shape (16,) and dtype complex128
'power_2' has shape (16,) and dtype complex128
'power_4' has shape (16,) and dtype complex128
'modes' has shape (16,) and dtype int64

These attributes also store meta-data computed during the power calculation in the attrs dictionary. Most importantly, the shotnoise key gives the Poisson shot noise, \(P_\mathrm{shot} = V / N\), where V is the volume of the simulation box and N is the number of objects. The keys N1 and N2 store the number of objects.

In our example, the meta-data is:

[5]:
for k in r.power.attrs:
  print("%s = %s" %(k, str(r.power.attrs[k])))
Nmesh = [32 32 32]
BoxSize = [1. 1. 1.]
Lx = 1.0
Ly = 1.0
Lz = 1.0
volume = 1.0
mode = 2d
los = [0, 0, 1]
Nmu = 5
poles = [0, 2, 4]
dk = 6.283185307179586
kmin = 0.0
N1 = 96
N2 = 96
shotnoise = 0.010416666666666666

Note

The shot noise is not subtracted from any measured results. Users can access the Poisson shot noise value in the meta-data attrs dictionary.

Saving and Loading

Results can easily be saved and loaded from disk in a reproducible manner using the FFTPower.save() and FFTPower.load() functions. The save function stores the state of the algorithm, including the meta-data in the FFTPower.attrs dictionary, in a JSON plaintext format.

[6]:
# save to file
r.save("fftpower-example.json")

# load from file
r2 = FFTPower.load("fftpower-example.json")

print("power = ", r2.power)
print("poles = ", r2.poles)
print("attrs = ", r2.attrs)
power =  <BinnedStatistic: dims: (k: 16, mu: 5), variables: ('k', 'mu', 'power', 'modes')>
poles =  <BinnedStatistic: dims: (k: 16), variables: 5 total>
attrs =  {'Nmesh': array([32, 32, 32]), 'BoxSize': array([1., 1., 1.]), 'Lx': 1.0, 'Ly': 1.0, 'Lz': 1.0, 'volume': 1.0, 'mode': '2d', 'los': [0, 0, 1], 'Nmu': 5, 'poles': [0, 2, 4], 'dk': 6.283185307179586, 'kmin': 0.0, 'N1': 96, 'N2': 96, 'shotnoise': 0.010416666666666666}
Common Pitfalls

The default configuration of nbodykit should lead to reasonable results when using the FFTPower algorithm. When performing custom, more complex analyses, some of the more common pitfalls are:

  • When the results of FFTPower do not seem to make sense, the most common culprit is usually the configuration of the mesh, and whether or not the mesh is “compensated”. In the language of nbodykit, “compensated” refers to whether the effects of the interpolation window used to paint the density field have been de-convolved in Fourier space. See the Converting a CatalogSource to a Mesh section for detailed notes on this procedure.

  • Be wary of normalization issues when painting weighted density fields. See Painting Catalogs to a Mesh for further details on painting meshes and Applying Functions to the Mesh for notes on applying arbitrary functions to the mesh while painting. See this cookbook recipe for examples of more advanced painting uses.

Power Spectrum Multipoles of Survey Data (ConvolvedFFTPower)

The ConvolvedFFTPower class computes the power spectrum multipoles \(P_\ell(k)\) for data from an observational survey that includes non-trivial selection effects. The input data is expected to be in the form of angular coordinates (right ascension and declination) and redshift. The measured power spectrum multipoles represent the true multipoles convolved with the window function. Here, the window function refers to the Fourier transform of the survey volume. In this section, we provide an overview of the FFT-based algorithm used to compute the multipoles and detail important things to know for the user to get up and running quickly.

Note

To jump right into the ConvolvedFFTPower algorithm, see this cookbook recipe for a detailed walk-through of the algorithm.

Some Background

We begin by defining the estimator for the multipole power spectrum, often referred to as the “Yamamoto estimator” in the literature (Yamamoto et al 2006):

(1)\[P_\ell = \frac{2\ell+1}{A} \int \frac{\d\Omega_k}{4\pi} \left [ \int \d\vrone \ F(\vrone) e^{i \vk \cdot \vrone} \int \d\vrtwo \ F(\vrtwo) e^{-i \vk \cdot \vrtwo} \L_\ell(\vkhat \cdot \vrhat_2) - P_\ell^{\rm noise}(\vk) \right ].\]

where \(\Omega_k\) represents the solid angle in Fourier space, and \(\L_\ell\) is the Legendre polynomial of order \(\ell\). The weighted density field \(F(\vr)\) is defined as

(2)\[F(\vr) = \wfkp(\vr) \left[ n_g'(\vr) - \alpha' \ n'_s(\vr)\right],\]

where \(n_g'\) and \(n_s'\) are the number densities for the galaxy catalog and synthetic catalog of randoms object, respectively, and \(\alpha'\) is the ratio of the number of real galaxies to random galaxies. Often, the catalog of random objects has a much higher number density than the galaxy catalog, and the factor of \(\alpha'\) re-normalizes to the proper number density. The normalization \(A\) is given by

(3)\[A = \int \d\vr [n'_g(\vr) \wfkp(\vr)]^2.\]

The shot noise \(P_\ell^{\rm noise}\) in equation (1) is

(4)\[P_\ell^{\rm noise}(\vk) = (1 + \alpha') \int \d\vr \ n'_g(\vr) \wfkp^2(\vr) \L_\ell (\vkhat \cdot \vrhat).\]

The FKP weights, first derived in Feldman, Kaiser, and Peacock 1994, minimize the variance of the estimator at a desired power spectrum value. Denoted as \(w_\mathrm{FKP}\), these weights are given by

(5)\[w_\mathrm{FKP}(\vr) = \frac{1}{1 + n_g'(\vr) P_0},\]

where \(P_0\) is the power spectrum amplitude in units of \(h^{-3} \mathrm{Mpc}^3\) where the estimator is optimized. For typically galaxy survey analyses, a value of order \(P_0 = 10^{4} \ h^{-3} \mathrm{Mpc}^3\) is usually assumed.

In our notation, quantities marked with a prime (\('\)) include completeness weights. These weights, denoted as \(w_c\), help account for systematic variations in the number density fields. Typically the weights for the random catalog are unity, but for full generality we allow for the possibility of non-unity weights for \(n_s\) as well. For example, \(\alpha' = N'_\mathrm{gal} / N'_\mathrm{ran}\), where \(N'_\mathrm{gal} = \sum_i^{N_\mathrm{gal}} w_c^i\).

The Algorithm

We use the FFT-based algorithm of Hand et al 2017 to compute the power spectrum multipoles. This work improves upon the previous FFT-based estimators of \(P_\ell(k)\) presented in Bianchi et al 2015 and Scoccimarro 2015 by only requiring the use of \(2\ell+1\) FFTs to compute a multipole of order \(\ell\). The algorithms uses the spherical harmonic addition theorem to express equation (1) as

(6)\[P_\ell(k) = \frac{2\ell+1}{A} \int \frac{\d\Omega_k}{4\pi} F_0(\vk) F_\ell(-\vk),\]

where

(7)\[ \begin{align}\begin{aligned}F_\ell(\vk) &= \int \d\vr \ F(\vr) e^{i \vk \cdot \vr} \L_\ell(\vkhat \cdot \vrhat),\\ &= \frac{4\pi}{2\ell+1} \sum_{m=-\ell}^{\ell} \Ylm(\vkhat) \int \d\vr \ F(\vr) \Ylm^*(\vrhat) e^{i \vk \cdot \vr}.\end{aligned}\end{align} \]

The sum over \(m\) in equation (7) contains \(2 \ell + 1\) terms, each of which can be computed using a FFT. Thus, the multipole moments can be expressed as a sum of Fourier transforms of the weighted density field, with weights given by the appropriate spherical harmonic. We evaluate equation (7) using the real-to-complex FFT functionality of the pmesh package and use the real form of the spherical harmonics \(\Ylm\).

We use the symbolic manipulation functionality available in the SymPy Python package to compute the spherical harmonic expressions in terms of Cartesian vectors. This allows the user to specify the desired multipoles at runtime, enabling the code to be used to compute multipoles of arbitrary \(\ell\).

Getting Started

Here, we outline the necessary steps for users to get started using the ConvolvedFFTPower algorithm to compute the power spectrum multipoles from an input data catalog.

The FKPCatalog Class

The ConvolvedFFTPower algorithm requires a galaxy catalog and a synthetic catalog of random objects without any clustering signal, which we refer to as the “data” and “randoms” catalogs, respectively. The CatalogSource object responsible for handling these two types of catalogs is the FKPCatalog class.

The FKPCatalog determines the size of the Cartesian box that the “data” and “randoms” are placed in, which is then also used during the FFT operation. By default, the box size is determined automatically from the maximum extent of the “randoms” positions. In this automatic case, the size of the box can be artificially extended and padded with zeros via the BoxPad keyword. Users can also specify a desired box size by passing in the BoxSize keyword.

The FKPCatalog object can be converted to a mesh object, FKPCatalogMesh, via the to_mesh() function. This mesh object knows how to paint the FKP weighted density field, given by equation (2), to the mesh using the “data” and “randoms” catalogs. With the FKP density field painted to the mesh, the ConvolvedFFTPower algorithm uses equations (6) and (7) to compute the multipoles specified by the user.

From Sky to Cartesian Coordinates

The Position column, holding the Cartesian coordinates, is required for both the “data” and “randoms” catalogs. We provide the function nbodykit.transform.SkyToCartesian() for converting sky coordinates, in the form of right ascension, declination, and redshift, to Cartesian coordinates. The conversion from redshift to comoving distance requires a cosmology instance, which can be specified via the Cosmology class.

For more details, see Converting from Sky to Cartesian Coordinates.

Specifying \(n'_g(z)\)

The number density of the “data” catalog as a function of redshift, in units of \(h^{3} \mathrm{Mpc}^{-3}\), is required to properly normalize the power spectrum using equation (3) and to compute the shot noise via equation (4). The “data” and “randoms” catalog should contain a column that gives this quantity, evaluated at the redshift of each object in the catalogs.

When converting from a FKPCatalog to a FKPCatalogMesh, the name of the \(n'_g(z)\) column should be passed as the nbar keyword to the to_mesh() function. The \(n'_g(z)\) column should have the same name in the “data” and “randoms” catalogs.

By default, the name of the nbar column is set to NZ. If this column is missing from the “data” or “randoms” catalog, an exception will be raised.

Note that the RedshiftHistogram algorithm can compute a weighted \(n(z)\) for an input catalog and may be useful if the user needs to compute \(n_g(z)\).

Important

By construction, the objects in the “randoms” catalog should follow the same redshift distribution as the “data” catalog. Often, the “randoms” will have an overall number density that is 10, 50, or 100 times the number density of the “data” catalog. Even if the “randoms” catalog has a higher number density, the nbar column in both the “data” and “randoms” catalogs should hold number density values on the same scale, corresponding to the value of \(n_g(z)\) at the redshift of the objects in the catalogs.

Using Completeness Weights

The ConvolvedFFTPower algorithm supports the use of completeness weights to account for systematic variations in the number density of the “data” and “randoms” objects. In our notation in the earlier background section, quantities marked with a prime (\('\)) include completeness weights. The weighted and unweighted number densities of the “data” and “randoms” fields are then related by:

\[ \begin{align}\begin{aligned}n'_g(\vr) &= w_{c,g}(\vr) n_g(\vr),\\n'_s(\vr) &= w_{c,s}(\vr) n_s(\vr),\end{aligned}\end{align} \]

Typically the weights for the random catalog are unity, but for full generality, we allow for the possibility of non-unity weights for \(n_s\) as well. For example, \(\alpha' = N'_\mathrm{gal} / N'_\mathrm{ran}\), where \(N'_\mathrm{gal} = \sum_i^{N_\mathrm{gal}} w_{c,g}^i\) and \(N'_\mathrm{ran} = \sum_i^{N_\mathrm{ran}} w_{c,s}^i\).

When converting from a FKPCatalog to a FKPCatalogMesh, the name of the completeness weight column should be passed as the comp_weight keyword to the to_mesh() function.

By default, the name of the comp_weight column is set to the default column Weight, which has a value of unity for all objects. If specifying a different name, the column should have the same name in both the “data” and “randoms” catalogs.

Using FKP Weights

Users can also specify the name of a column in the “data” and “randoms” catalogs that represents a FKP weight for each object, as given by equation (5). The FKP weights do not weight the individual number density fields as the completeness weights do, but rather they weight the combined field, \(n'_g(\vr) - \alpha' n'_s(\vr)\) (see equation (2)).

When converting from a FKPCatalog to a FKPCatalogMesh, the name of the FKP weight column should be passed as the fkp_weight keyword to the to_mesh() function.

By default, the name of the fkp_weight column is set to FKPWeight, which has a value of unity for all objects. If specifying a different name, the column should have the same name in both the “data” and “randoms” catalogs.

The ConvolvedFFTPower algorithm can also automatically generate and use FKP weights from the input nbar column if the user specifies the use_fkp_weights keyword of the algorithm to be True. In this case, the user must also specify the P0_FKP keyword, which gives the desired \(P_0\) value to use in equation (5).

The Results
The Multipoles

The multipole results are stored as the poles attribute of the initialized ConvolvedFFTPower object. This attribute is a BinnedStatistic object, which behaves like a structured numpy array and stores the measured results on a coordinate grid defined by the wavenumber bins specified by the user. See Analyzing your Results for a full tutorial on using the BinnedStatistic class.

The poles attribute stores the following variables:

  • k :

    the mean value for each k bin

  • power_L :

    complex array storing the real and imaginary components for the \(\ell=L\) multipole

  • modes :

    the number of Fourier modes averaged together in each bin

Note that measured power results for bins where modes is zero (no data points to average over) are set to NaN.

Note

The shot noise is not subtracted from any measured results. Users can access the shot noise value for monopole, computed according to equation (4) in the meta-data attrs dictionary. Shot noise for multipoles with \(\ell>0\) is assumed to be zero.

The Meta-data

Several important meta-data calculations are also performed during the algorithm’s execution. This meta-data is stored in both the ConvolvedFFTPower.attrs attribute and the ConvolvedFFTPower.poles.attrs atrribute.

  1. data.N, randoms.N :

    the unweighted number of data and randoms objects

  2. data.W, randoms.W :

    the weighted number of data and randoms objects, using the column specified as the completeness weights. This is given by:

    \[ \begin{align}\begin{aligned}W_\mathrm{data} &= \sum_\mathrm{data} w_c\\W_\mathrm{ran} &= \sum_\mathrm{randoms} w_c\end{aligned}\end{align} \]
  3. alpha :

    the ratio of data.W to randoms.W

  4. data.norm, randoms.norm :

    the normalization \(A\) of the power spectrum, computed from either the “data” or “randoms” catalog (they should be similar). They are given by:

    (8)\[ \begin{align}\begin{aligned}A_\mathrm{data} &= \sum_\mathrm{data} n'_g w_c \wfkp^2\\A_\mathrm{ran} &= \alpha \sum_\mathrm{randoms} n'_g w_c \wfkp^2\end{aligned}\end{align} \]
  5. data.shotnoise, randoms.shotnoise :

    the contributions to the monopole shot noise from the “data” and “random” catalogs. These values are given by:

    \[ \begin{align}\begin{aligned}P^\mathrm{shot}_\mathrm{data} &= A_\mathrm{ran}^{-1} \sum_\mathrm{data} (w_c \wfkp)^2\\P^\mathrm{shot}_\mathrm{ran} &= \alpha^2 A_\mathrm{ran}^{-1} \sum_\mathrm{randoms} (w_c \wfkp)^2\end{aligned}\end{align} \]
  6. shotnoise :

    the total shot noise for the monopole power spectrum, which should be subtracted from the monopole measurement in ConvolvedFFTPower.poles. This is computed as:

    \[P^\mathrm{shot} = P^\mathrm{shot}_\mathrm{data} + P^\mathrm{shot}_\mathrm{ran}\]
  7. BoxSize :

    the size of the Cartesian box used to grid the “data” and “randoms” objects on the Cartesian mesh.

From \(P_\ell(k)\) to \(P(k,\mu)\)

The ConvolvedFFTPower.to_pkmu() allows users to rotate the measured multipoles, stored as the ConvolvedFFTPower.poles attribute, into \(P(k,\mu)\) wedges (bins in \(\mu\)). The function returns a BinnedStatistic holding the binned \(P(k,\mu)\) data.

Saving and Loading

Results can easily be saved and loaded from disk in a reproducible manner using the ConvolvedFFTPower.save() and ConvolvedFFTPower.load() functions. The save function stores the state of the algorithm, including the meta-data in the ConvolvedFFTPower.attrs dictionary, in a JSON plaintext format.

Common Pitfalls

Some of the more common issues users run into are:

  • The Position column is missing from the input “data” and “randoms” catalogs. The input position columns for this algorithm are assumed to be in terms of the right ascension, declination, and redshifts, and users must add the Position column holding the Cartesian coordinates explicitly to both the “data” and “randoms” catalogs.

  • Normalization issues may occur if the number density columns in the “data” and “randoms” catalogs are on different scales. Similar issues may arise if the FKP weight column uses number density values on different scales. In all cases, the number density to be used should be that of the data, denoted as \(n'_g(z)\). The algorithm will compute the power spectrum normalization from both the “data” and “randoms” catalogs (as given by equation (8)). If the values do not agree, there is likely an issue with varying number density scales, and the algorithm will raise an exception.

currentmodule:: nbodykit.algorithms.fftpower

Power Spectrum of a Projected Field (ProjectedFFTPower)

Correlation Function Algorithms

currentmodule:: nbodykit.algorithms.fftcorr

Correlation Functions via FFT (FFTCorr)
Pair Counts for Data in a Simulation Box (SimulationBoxPairCount)
Pair Counts for Survey Data (SurveyDataPairCount)

This includes angular clustering as well.

Correlation Function for Data in a Simulation Box (SimulationBox2PCF)
Correlation Function for Survey Data (SurveyData2PCF)

This includes angular clustering as well.

3-pt Correlation Function for Data in a Simulation Box (SimulationBox3PCF)
3-pt Correlation Function for Survey Data (SurveyData3PCF)

Grouping Methods

Friends-of-Friends Group Finder (FOF)
Cylindrical Groups (CylindricalGroups)
Fiber Assignment and Collisions (FiberCollisions)

Miscellaneous

Density Estimation (KDDensity)
Computing the \(n(z)\) of Data (RedshiftHistogram)

Parallel Computation with nbodykit

nbodykit is fully parallelized and is built on top of the Message Passage Interface (MPI) using the Python bindings provided by the mpi4py library. In this section, we discuss some of the crucial ways in which nbodykit interacts with MPI. For those unfamiliar with MPI, a good place to start is the documentation for mpi4py.

Running nbodykit in parallel

Users can execute Python scripts using nbodykit in parallel using the standardized mpiexec MPI executable. If installing nbodykit through Anaconda (as described here), the mpiexec command will be available for use in your conda environment.

As an example, we can run this example script in parallel. The example script can be downloaded manually from GitHub, or using the wget utility:

# install wget via conda, if not installed
$ conda install wget

# download the example script
$ wget https://raw.githubusercontent.com/bccp/nbodykit/master/nersc/example.py

We can execute the example script with 4 MPI cores in parallel using:

# run with 4 MPI workers
$ mpiexec -n 4 python example.py

This example script generates a set of simulated objects and computes the power spectrum via the FFTPower algorithm, saving the result to a JSON file nbkit_example_power.json.

The MPI calling sequence typically differs for supercomputing environments and depends on the task scheduler being used. For example, when using the Slurm manager (as on NERSC), the equivalent command to execute the example script is:

$ srun -n 4 python example.py

On a managed computing facility, it is also usually necessary to include directives that reserves the computing resource used for running the script. Often, you can refer to the ‘Submitting a Job’ section of the user guide of the facility, e.g., NERSC Cori, NERSC Edison, and the NCSA. However, be aware these user guides are not always accurate, and it is always better to check with someone who uses these facilities first.

A Primer on MPI Communication

MPI stands for Message Passage Interface, and unsurprisingly, one of its key elements is the communication between processes running in parallel. The MPI standard allows processes running in parallel, which own their own memory, to exchange messages, thus allowing the independent results computed by individual processes to be combined into a single result.

The MPI communicator object is responsible for managing the communication of data and messages between parallel processes. In nbodykit, we manage the current MPI communicator using the nbodykit.CurrentMPIComm class. By default, all nbodykit objects, i.e., catalog and mesh objects, use the communicator defined by this class to exchange results. For catalog and mesh objects, the communicator is stored as the comm attribute.

Users can access the current communicator by calling the get() function of the CurrentMPIComm object. For example,

from nbodykit import CurrentMPIComm
comm = CurrentMPIComm.get()

The communicator object carries a rank attribute, which provides a unique numbering of the available processes controlled by the communicator, and a size attribute giving the total number of processes within the communicator. Often, the rank attribute is used to reduce the amount of messages printed to the terminal, e.g.

if comm.rank == 0:
    print("I am Groot.")

In this case, we get only one print statement, whereas we would get as many as comm.size messages without the condition. The tutorials provided in the mpi4py documentation provide more examples illustrating the power of MPI communicators.

For more advanced users, the current communicator can be set using the set() function of the CurrentMPIComm object. This can be useful if the default communicator (which includes all processes) is split into sub-communicators. This framework is how we implement the task-based parallelism provided by the TaskManager object (see the Task-based parallelism section below). Setting the current MPI communicator in this manner only affects the creation of objects after the comm has been set.

Data-based parallelism

When writing code with nbodykit, it’s important to keep in mind that memory is not shared across different CPUs when using MPI. This is particularly relevant when interacting with data in nbodykit. Both the catalog and mesh objects are distributed containers, meaning that the data is spread out evenly across the available processes within an MPI communicator. A single process does not have access to the full catalog or 3D mesh but merely the portion stored in its memory. A crucial benefit of the distributed nature of data in nbodykit is that we can quickly load large data sets that would otherwise not fit into the memory of a single process.

When working with several processes, a simple way to gather the full data set onto every process is to combine the numpy.concatenate() function with the allgather() method of the current MPI communicator. For example, given a catalog object, we can gather the full “Position” column to all ranks using

data = numpy.concatenate(catalog.comm.allgather(catalog['Position'].compute()), axis=0)

Important

Beware of such allgather() operations. Each process gets a full copy of the data, and the computer will quickly run out of memory if the catalog is large.

Task-based parallelism

Often, large-scale structure data analysis involves hundreds to thousands of iterations of a single, less computationally expensive task. Examples include parameter sampling and minimization and the calculation of clustering statistics from thousands of simulations to determine covariance properties. nbodykit includes the TaskManager class to allow users to easily iterate over multiple tasks while using nbodykit. Users can specify the desired number of MPI ranks per task, and tasks will run in parallel ensuring that all MPI ranks are being used.

We attempt to hide most of the MPI complexities from the user by implementing the TaskManager utility as a Python context manager. This allows users to simply paste the workflow of a single task into the context of a TaskManager to iterate through tasks in parallel.

For example, we can compute the power spectrum of a simulated catalog of particles with several different bias values using:

from nbodykit.lab import *

# the bias values to iterate over
biases = [1.0, 2.0, 3.0, 4.0]

# initialize the task manager to run the tasks
with TaskManager(cpus_per_task=2, use_all_cpus=True) as tm:

  # set up the linear power spectrum
  redshift = 0.55
  cosmo = cosmology.Planck15
  Plin = cosmology.LinearPower(cosmo, redshift, transfer='EisensteinHu')

  # iterate through the bias values
  for bias in tm.iterate(biases):

    # initialize the catalog for this bias
    cat = LogNormalCatalog(Plin=Plin, nbar=3e-3, BoxSize=1380., Nmesh=256, bias=bias)

    # compute the power spectrum
    r = FFTPower(cat, mode="2d")

    # and save
    r.save("power-" + str(bias) + ".json")

Here, we rely on the iterate() function to iterate through our tasks using a for loop. We could have also used the map() function to apply a function to each bias value (it behaves identical to the built-in map()).

The TaskManager utility works by first splitting the available CPUs in to subgroups, where the size of the subgroups is determined by the cpus_per_task task argument passed to TaskManager. With these subgroups ready to execute the tasks, the manager uses one root process to distribute the tasks to one of the subgroups of workers as they become available and stops when all of the tasks have finished. Internally, the TaskManager class splits the global MPI communicator into smaller groups, so that each subgroup of processes can communicate only with themselves.

Important

Users should execute all of the nbodykit-related code from within the context of the TaskManager. In particular, users should take care to ensure that all of the catalog and mesh objects are created from within the manager’s context. Otherwise, there will be issues with the MPI communication and the code is likely to stall.

The above code is available in the nbodykit source code on GitHub here. We encourage users to download the script and experiment with different cpus_per_task values and total MPI processes. For example, if we run with 3 total processes and use a single process per task, then two tasks will always be executed simultaneously (one process is reserved to distribute the tasks to the other ranks). When running with this configuration, users will see the following output:

$ mpiexec -n 3 python example-batch.py 1
rank 2: computing for bias = 1.0
rank 1: computing for bias = 2.0
rank 2: computing for bias = 3.0
rank 1: computing for bias = 4.0

Analyzing your Results

Several nbodykit algorithms compute binned clustering statistics and store the results as a BinnedStatistic object (see a list of these algorithms here. In this section, we provide an overview of some of the functionality of this class to help users better analyze their algorithm results.

The BinnedStatistic class is designed to hold data variables at fixed coordinates, i.e., a grid of \((r, \mu)\) or \((k, \mu)\) bins and is modeled after the syntax of the xarray.Dataset object.

Loading and Saving Results

A BinnedStatistic object is serialized to disk using a JSON format. The to_json() and from_json() functions can be used to save and load BinnedStatistic objects. respectively.

To start, we read two BinnedStatistic results from JSON files, one holding 1D power measurement \(P(k)\) and one holding a 2D power measurement \(P(k,\mu)\).

[2]:
from nbodykit.binned_statistic import BinnedStatistic

data_dir = os.path.join(os.path.abspath('.'), 'data')
power_1d = BinnedStatistic.from_json(os.path.join(data_dir, 'dataset_1d.json'))
power_2d = BinnedStatistic.from_json(os.path.join(data_dir, 'dataset_2d.json'))

Coordinate Grid

The clustering statistics are measured for fixed bins, and the BinnedStatistic class has several attributes to access the coordinate grid defined by these bins:

  • shape: the shape of the coordinate grid

  • dims: the names of each dimension of the coordinate grid

  • coords: a dictionary that gives the center bin values for each dimension of the grid

  • edges: a dictionary giving the edges of the bins for each coordinate dimension

[3]:
print("1D shape = ", power_1d.shape)
print("2D shape = ", power_2d.shape)
1D shape =  (64,)
2D shape =  (64, 5)
[4]:
print("1D dims = ", power_1d.dims)
print("2D dims = ", power_2d.dims)
1D dims =  ['k']
2D dims =  ['k', 'mu']
[5]:
print("2D edges = ", power_2d.coords)
2D edges =  {'k': array([0.00613593, 0.01840777, 0.03067962, 0.04295147, 0.05522331,
       0.06749516, 0.079767  , 0.09203884, 0.10431069, 0.11658255,
       0.1288544 , 0.14112625, 0.1533981 , 0.1656699 , 0.17794175,
       0.1902136 , 0.20248545, 0.2147573 , 0.22702915, 0.239301  ,
       0.25157285, 0.2638447 , 0.27611655, 0.2883884 , 0.30066025,
       0.3129321 , 0.32520395, 0.3374758 , 0.3497476 , 0.36201945,
       0.3742913 , 0.38656315, 0.398835  , 0.41110685, 0.4233787 ,
       0.43565055, 0.4479224 , 0.46019425, 0.4724661 , 0.48473795,
       0.4970098 , 0.5092816 , 0.52155345, 0.5338253 , 0.54609715,
       0.558369  , 0.57064085, 0.5829127 , 0.59518455, 0.6074564 ,
       0.61972825, 0.6320001 , 0.64427195, 0.6565438 , 0.6688156 ,
       0.68108745, 0.6933593 , 0.70563115, 0.717903  , 0.73017485,
       0.7424467 , 0.75471855, 0.7669904 , 0.77926225]), 'mu': array([0.1, 0.3, 0.5, 0.7, 0.9])}
[6]:
print("2D edges = ", power_2d.edges)
2D edges =  {'k': array([0.        , 0.01227185, 0.02454369, 0.03681554, 0.04908739,
       0.06135923, 0.07363108, 0.08590292, 0.09817477, 0.1104466 ,
       0.1227185 , 0.1349903 , 0.1472622 , 0.159534  , 0.1718058 ,
       0.1840777 , 0.1963495 , 0.2086214 , 0.2208932 , 0.2331651 ,
       0.2454369 , 0.2577088 , 0.2699806 , 0.2822525 , 0.2945243 ,
       0.3067962 , 0.319068  , 0.3313399 , 0.3436117 , 0.3558835 ,
       0.3681554 , 0.3804272 , 0.3926991 , 0.4049709 , 0.4172428 ,
       0.4295146 , 0.4417865 , 0.4540583 , 0.4663302 , 0.478602  ,
       0.4908739 , 0.5031457 , 0.5154175 , 0.5276894 , 0.5399612 ,
       0.5522331 , 0.5645049 , 0.5767768 , 0.5890486 , 0.6013205 ,
       0.6135923 , 0.6258642 , 0.638136  , 0.6504079 , 0.6626797 ,
       0.6749515 , 0.6872234 , 0.6994952 , 0.7117671 , 0.7240389 ,
       0.7363108 , 0.7485826 , 0.7608545 , 0.7731263 , 0.7853982 ]), 'mu': array([0. , 0.2, 0.4, 0.6, 0.8, 1. ])}

Accessing the Data

The names of data variables stored in a BinnedStatistic are stored in the variables attribute, and the data attribute stores the arrays for each of these names in a structured array. The data for a given variable can be accessed in a dict-like fashion:

[7]:
print("1D variables = ", power_1d.variables)
print("2D variables = ", power_2d.variables)
1D variables =  ['power', 'k', 'mu', 'modes']
2D variables =  ['power', 'k', 'mu', 'modes']
[8]:
# the real component of the 1D power
Pk = power_1d['power'].real
print(type(Pk), Pk.shape, Pk.dtype)

# complex power array
Pkmu = power_2d['power']
print(type(Pkmu), Pkmu.shape, Pkmu.dtype)
<class 'numpy.ndarray'> (64,) float64
<class 'numpy.ndarray'> (64, 5) complex128

In some cases, the variable value for a given bin will be missing or invalid, which is indicated by a numpy.nan value in the data array for the given bin. The BinnedStatistic class carries a mask attribute that defines which elements of the data array have numpy.nan values.

Meta-data

An OrderedDict of meta-data for a BinnedStatistic class is stored in the attrs attribute. Typically in nbodykit, the attrs dictionary stores information about box size, number of objects, etc:

[9]:
print("attrs = ", power_2d.attrs)
attrs =  {'N1': 4033, 'Lx': 512.0, 'Lz': 512.0, 'N2': 4033, 'Ly': 512.0, 'volume': 134217728.0}

To attach additional meta-data to a BinnedStatistic class, the user can add additional keywords to the attrs dictionary.

Slicing

Slices of the coordinate grid of a BinnedStatistic can be achieved using array-like indexing of the main BinnedStatistic object, which will return a new BinnedStatistic holding the sliced data:

[10]:
# select the first mu bin
print(power_2d[:,0])
<BinnedStatistic: dims: (k: 64), variables: ('power', 'k', 'mu', 'modes')>
[11]:
# select the first and last mu bins
print(power_2d[:, [0, -1]])
<BinnedStatistic: dims: (k: 64, mu: 2), variables: ('power', 'k', 'mu', 'modes')>
[12]:
# select the first 5 k bins
print(power_1d[:5])
<BinnedStatistic: dims: (k: 5), variables: ('power', 'k', 'mu', 'modes')>

A typical usage of array-like indexing is to loop over the mu dimension of a 2D BinnedStatistic, such as when plotting:

[13]:
from matplotlib import pyplot as plt

# the shot noise is volume / number of objects
shot_noise = power_2d.attrs['volume'] / power_2d.attrs['N1']

# plot each mu bin separately
for i in range(power_2d.shape[1]):
    pk = power_2d[:,i]
    label = r"$\mu = %.1f$" % power_2d.coords['mu'][i]
    plt.loglog(pk['k'], pk['power'].real - shot_noise, label=label)

plt.legend()
plt.xlabel(r"$k$ [$h$/Mpc]", fontsize=14)
plt.ylabel(r"$P(k,\mu)$ $[\mathrm{Mpc}/h]^3$", fontsize=14)
plt.show()
_images/results_analyzing_18_0.png

The coordinate grid can also be sliced using label-based indexing, similar to the syntax of xarray.Dataset.sel(). The method keyword of sel() determines if exact coordinate matching is required (method=None, the default) or if the nearest grid coordinate should be selected automatically (method='nearest').

For example, we can slice power spectrum results based on the k and mu coordinate values:

[14]:
# get all mu bins for the k bin closest to k=0.1
print(power_2d.sel(k=0.1, method='nearest'))
<BinnedStatistic: dims: (mu: 5), variables: ('power', 'k', 'mu', 'modes')>
[15]:
# slice from k=0.01-0.1 for mu = 0.5
print(power_2d.sel(k=slice(0.01, 0.1), mu=0.5, method='nearest'))
<BinnedStatistic: dims: (k: 8), variables: ('power', 'k', 'mu', 'modes')>

We also provide a squeeze() function which behaves similar to the numpy.squeeze() function:

[16]:
# get all mu bins for the k bin closest to k=0.1, but keep k dimension
sliced = power_2d.sel(k=[0.1], method='nearest')
print(sliced)
<BinnedStatistic: dims: (k: 1, mu: 5), variables: ('power', 'k', 'mu', 'modes')>
[17]:
# and then squeeze to remove the k dimension
print(sliced.squeeze())
<BinnedStatistic: dims: (mu: 5), variables: ('power', 'k', 'mu', 'modes')>

Note that, by default, array-based or label-based indexing will automatically “squeeze” sliced objects that have a dimension of length one, unless a list of indexers is used, as is done above.

Reindexing

It is possible to reindex a specific dimension of the coordinate grid using reindex(). The new bin spacing must be an integral multiple of the original spacing, and the variable values will be averaged together on the new coordinate grid.

[18]:
# re-index into wider k bins
print(power_2d.reindex('k', 0.02))
<BinnedStatistic: dims: (k: 32, mu: 5), variables: ('power', 'k', 'mu', 'modes')>
/home/yfeng1/anaconda3/install/lib/python3.6/site-packages/nbodykit/binned_statistic.py:56: RuntimeWarning: Mean of empty slice
  ndarray = operation(ndarray, axis=-1*(i+1))
[19]:
# re-index into wider mu bins
print(power_2d.reindex('mu', 0.4))
<BinnedStatistic: dims: (k: 64, mu: 2), variables: ('power', 'k', 'mu', 'modes')>
/home/yfeng1/anaconda3/install/lib/python3.6/site-packages/nbodykit/binned_statistic.py:56: RuntimeWarning: Mean of empty slice
  ndarray = operation(ndarray, axis=-1*(i+1))

Note

Any variable names passed to reindex() via the fields_to_sum keyword will have their values summed, instead of averaged, when reindexing.

Averaging

The average of a specific dimension can be taken using the average() function. A common use case is averaging over the mu dimension of a 2D BinnedStatistic, which is accomplished by:

[20]:
# compute P(k) from P(k,mu)
print(power_2d.average('mu'))
<BinnedStatistic: dims: (k: 64), variables: ('power', 'k', 'mu', 'modes')>
/home/yfeng1/anaconda3/install/lib/python3.6/site-packages/nbodykit/binned_statistic.py:56: RuntimeWarning: Mean of empty slice
  ndarray = operation(ndarray, axis=-1*(i+1))

Saving your Results

API Reference

The full list of nbodykit modules is available here. We summarize the most important aspects of the API below.

The nbodykit lab

To make things easier for users, we import all of the classes and modules needed to do cool science into a single module:

nbodykit.lab

The nbodykit lab, containing all of the necessary ingredients to use nbodykit.

Cosmology (nbodykit.cosmology)

The main cosmology object relies on the functionality of the classylss package, which provides a binding of the CLASS CMB Boltzmann code. The syntax largely follows that used by CLASS. Below, we list the main cosmology class, as well as its attributes and methods:

Cosmology([h, T0_cmb, Omega0_b, Omega0_cdm, …])

A cosmology calculator based on the CLASS binding in classylss.

Attributes

Methods

There are several transfer functions available to the user:

CLASS(cosmo, redshift)

The linear matter transfer function using the CLASS Boltzmann code.

EisensteinHu(cosmo, redshift)

The linear matter transfer function using the Eisenstein & Hu (1998) fitting formula with BAO wiggles.

NoWiggleEisensteinHu(cosmo, redshift)

Linear power spectrum using the Eisenstein & Hu (1998) fitting formula without BAO wiggles.

There are several power spectrum classes

LinearPower(cosmo, redshift[, transfer])

An object to compute the linear power spectrum and related quantities, using a transfer function from the CLASS code or the analytic Eisenstein & Hu approximation.

HalofitPower(cosmo, redshift)

Nonlinear power spectrum computed using HaloFit via CLASS.

ZeldovichPower(cosmo, redshift[, transfer, nmax])

The matter power spectrum in the Zel’dovich approximation.

And a correlation function class and functions for transforming between power spectra and correlation functions

CorrelationFunction(power)

Evaluate the correlation function by Fourier transforming a power spectrum object, with automatic re-scaling with redshift and sigma8.

xi_to_pk(r, xi[, ell, extrap])

Return a callable function returning the power spectrum multipole of degree \(\ell\), as computed from the Fourier transform of the input \(r\) and \(\xi_\ell(r)\) arrays.

pk_to_xi(k, Pk[, ell, extrap])

Return a callable function returning the correlation function multipole of degree \(\ell\), as computed from the Fourier transform of the input \(k\) and \(P_\ell(k)\) arrays.

We also have a class for computing LPT background calculations:

PerturbationGrowth(\*args, \*\*kwargs)

The built-in cosmologies are:

Name

Source

\(H_0\)

\(\Omega_{m,0}\)

Flat

WMAP5

Komatsu et al. 2009

70.2

0.277

Yes

WMAP7

Komatsu et al. 2011

70.4

0.272

Yes

WMAP9

Hinshaw et al. 2013

69.3

0.287

Yes

Planck13

Planck Collab 2013, Paper XVI

67.8

0.307

Yes

Planck15

Planck Collab 2015, Paper XIII

67.7

0.307

Yes

Transforming Catalog Data (nbodykit.transform)

ConcatenateSources(\*sources, \*\*kwargs)

Concatenate CatalogSource objects together, optionally including only certain columns in the returned source.

StackColumns(\*cols)

Stack the input dask arrays vertically, column by column.

ConstantArray(value, size[, chunks])

Return a dask array of the specified size holding a single value.

SkyToUnitSphere(ra, dec[, degrees, frame])

Convert sky coordinates (ra, dec) to Cartesian coordinates on the unit sphere.

SkyToCartesian(ra, dec, redshift, cosmo[, …])

Convert sky coordinates (ra, dec, redshift) to a Cartesian Position column.

HaloConcentration(mass, cosmo, redshift[, mdef])

Return halo concentration from halo mass, based on the analytic fitting formulas presented in Dutton and Maccio 2014.

HaloRadius(mass, cosmo, redshift[, mdef])

Return proper halo radius from halo mass, based on the specified mass definition.

Data Sources

Discrete Objects

Base class:

CatalogSource(comm)

An abstract base class representing a catalog of discrete particles.

And subclasses:

CSVCatalog(*args, **kwargs)

A CatalogSource that uses CSVFile to read data from disk.

BinaryCatalog(*args, **kwargs)

A CatalogSource that uses BinaryFile to read data from disk.

BigFileCatalog(*args, **kwargs)

A CatalogSource that uses BigFile to read data from disk.

TPMBinaryCatalog(*args, **kwargs)

A CatalogSource that uses TPMBinaryFile to read data from disk.

HDFCatalog(*args, **kwargs)

A CatalogSource that uses HDFFile to read data from disk.

FITSCatalog(*args, **kwargs)

A CatalogSource that uses FITSFile to read data from disk.

Gadget1Catalog(*args, **kwargs)

A CatalogSource that uses Gadget1File to read data from disk.

ArrayCatalog(data[, comm])

A CatalogSource initialized from an in-memory dict, structured numpy.ndarray, or astropy.table.Table.

HaloCatalog(source, cosmo, redshift[, mdef, …])

A CatalogSource of objects that represent halos, which can be populated using analytic models from halotools.

LogNormalCatalog(Plin, nbar, BoxSize, Nmesh)

A CatalogSource containing biased particles that have been Poisson-sampled from a log-normal density field.

UniformCatalog(nbar, BoxSize[, seed, dtype, …])

A CatalogSource that has uniformly-distributed Position and Velocity columns.

RandomCatalog(csize[, seed, comm])

A CatalogSource that can have columns added via a collective random number generator.

FKPCatalog(data, randoms[, BoxSize, BoxPad, …])

An interface for simultaneous modeling of a data CatalogSource and a randoms CatalogSource, in the spirit of Feldman, Kaiser, and Peacock, 1994.

MultipleSpeciesCatalog(names, *species, **kwargs)

A CatalogSource interface for handling multiples species of particles.

DemoHaloCatalog(simname, halo_finder, redshift)

Create a demo catalog of halos using one of the built-in halotools catalogs.

Interpolating Objects to a Mesh

For simple object catalogs:

CatalogMesh(source, Nmesh, BoxSize, Position)

A mesh generated by resampling a Catalog with the given parameters.

And Multiple Species

MultipleSpeciesCatalogMesh(source, Nmesh, …)

A subclass of CatalogMesh designed to paint the density field from a sum of multiple types of particles.

And Survey Catalogs with a window / mask:

FKPCatalogMesh(source, BoxSize, BoxCenter, …)

A subclass of MultipleSpeciesCatalogMesh designed to paint a FKPCatalog to a mesh.

Data Directly on a Mesh

Base class:

MeshSource(comm, Nmesh, BoxSize, dtype)

Base class for a source in the form of data on an input grid.

And subclasses:

BigFileMesh(path, dataset[, comm])

A MeshSource object that reads a mesh from disk using bigfile.

LinearMesh(Plin, BoxSize, Nmesh[, seed, …])

A MeshSource object that generates a RealField density mesh from a linear power spectrum function \(P(k)\).

FieldMesh(field)

A MeshSource initialized from an in-memory Field object, either a pmesh.pm.RealField or pmesh.pm.ComplexField.

ArrayMesh(array, BoxSize[, comm, root])

A MeshSource initalized from an in-memory numpy array.

Algorithms (nbodykit.algorithms)

Clustering Statistics

FFTPower(first, mode[, Nmesh, BoxSize, …])

Algorithm to compute the 1d or 2d power spectrum and/or multipoles in a periodic box, using a Fast Fourier Transform (FFT).

ProjectedFFTPower(first[, Nmesh, BoxSize, …])

The power spectrum of a field in a periodic box, projected over certain axes.

ConvolvedFFTPower(first, poles[, second, …])

Algorithm to compute power spectrum multipoles using FFTs for a data survey with non-trivial geometry.

FFTCorr(first, mode[, Nmesh, BoxSize, …])

Algorithm to compute the 1d or 2d correlation and/or multipoles in a periodic box, using a Fast Fourier Transform (FFT).

SimulationBoxPairCount(mode, first, edges[, …])

Count (weighted) pairs of objects in a simulation box as a function of \(r\), \((r,\mu)\), \((r_p, \pi)\), or \(\theta\) using the Corrfunc package.

SurveyDataPairCount(mode, first, edges[, …])

Count (weighted) pairs of objects from a survey data catalog as a function of \(r\), \((r,\mu)\), \((r_p, \pi)\), or \(\theta\) using the Corrfunc package.

SimulationBox2PCF(mode, data1, edges[, Nmu, …])

Compute the two-point correlation function for data in a simulation box as a function of \(r\), \((r,\mu)\), \((r_p, \pi)\), or \(\theta\) using pair counting.

SurveyData2PCF(mode, data1, randoms1, edges)

Compute the two-point correlation function for observational survey data as a function of \(r\), \((r,\mu)\), \((r_p, \pi)\), or \(\theta\) using pair counting.

SimulationBox3PCF(source, poles, edges[, …])

Compute the multipoles of the isotropic, three-point correlation function in configuration space for data in a simulation box.

SurveyData3PCF(source, poles, edges, cosmo)

Compute the multipoles of the isotropic, three-point correlation function in configuration space for observational survey data.

Grouping Methods

FOF(source, linking_length, nmin[, …])

A friends-of-friends halo finder that computes the label for each particle, denoting which halo it belongs to.

CylindricalGroups(source, rankby, rperp, rpar)

Compute groups of objects using a cylindrical grouping method.

FiberCollisions(ra, dec[, collision_radius, …])

Run an angular FOF algorithm to determine fiber collision groups from an input catalog, and then assign fibers such that the maximum amount of object receive a fiber.

Miscellaneous

KDDensity(source[, margin])

Estimate a proxy density based on the distance to the nearest neighbor.

RedshiftHistogram(source, fsky, cosmo[, …])

Compute the mean number density as a function of redshift \(n(z)\) from an input CatalogSource of particles.

Managing Multiple Tasks (TaskManager)

TaskManager(cpus_per_task[, comm, debug, …])

An MPI task manager that distributes tasks over a set of MPI processes, using a specified number of independent workers to compute each task.

TaskManager.iterate(self, tasks)

A generator that iterates through a series of tasks in parallel.

TaskManager.map(self, function, tasks)

Like the built-in map() function, apply a function to all of the values in a list and return the list of results.

Analyzing Results (BinnedStatistic)

BinnedStatistic(dims, edges, data[, …])

Lightweight class to hold statistics binned at fixed coordinates.

BinnedStatistic.from_json(filename[, key, …])

Initialize a BinnedStatistic from a JSON file.

BinnedStatistic.to_json(self, filename)

Write a BinnedStatistic from a JSON file.

BinnedStatistic.copy(self[, cls])

Returns a copy of the BinnedStatistic, optionally change the type to cls.

BinnedStatistic.rename_variable(self, …)

Rename a variable in data from old_name to new_name.

BinnedStatistic.average(self, dim, \*\*kwargs)

Compute the average of each variable over the specified dimension.

BinnedStatistic.reindex(self, dim, spacing)

Reindex the dimension dim by averaging over multiple coordinate bins, optionally weighting by weights.

BinnedStatistic.sel(self[, method])

Return a new BinnedStatistic indexed by coordinate values along the specified dimension(s).

BinnedStatistic.squeeze(self[, dim])

Squeeze the BinnedStatistic along the specified dimension, which removes that dimension from the BinnedStatistic.

The IO Library (nbodykit.io)

Base class:

FileType

An abstract base class representing a file object.

Subclasses available from the nbodykit.io module:

BigFile(path[, exclude, header, dataset])

A file object to handle the reading of columns of data from a bigfile file.

BinaryFile(path, dtype[, offsets, …])

A file object to handle the reading of columns of data from a binary file.

CSVFile(path, names[, blocksize, dtype, …])

A file object to handle the reading of columns of data from a CSV file.

FITSFile(path[, ext])

A file object to handle the reading of FITS data using the fitsio package.

HDFFile(path[, dataset, exclude, header, root])

A file object to handle the reading of columns of data from a h5py HDF5 file.

FileStack(filetype, path, *args, **kwargs)

A file object that offers a continuous view of a stack of subclasses of FileType instances.

TPMBinaryFile(path[, precision])

Read snapshot binary files from Martin White’s TPM simulations.

Gadget1File(path[, columndefs, hdtype, ptype])

Read snapshot binary files from Volkers Gadget 1/2/3 simulations.

Internal Nuts and Bolts

MPI Utilities

nbodykit.CurrentMPIComm

A class to faciliate getting and setting the current MPI communicator.

nbodykit.CurrentMPIComm.enable(func)

Decorator to attach the current MPI communicator to the input keyword arguments of func, via the comm keyword.

nbodykit.CurrentMPIComm.get()

Get the default current MPI communicator.

nbodykit.CurrentMPIComm.set(comm)

Set the current MPI communicator to the input value.

nbodykit.utils.GatherArray(data, comm[, root])

Gather the input data array from all ranks to the specified root.

nbodykit.utils.ScatterArray(data, comm[, …])

Scatter the input data array across all ranks, assuming data is initially only on root (and None on other ranks).

General Utilities

nbodykit.setup_logging([log_level])

Turn on logging, with the specified level.

nbodykit.set_options(**kwargs)

Set global configuration options.

nbodykit.utils.JSONEncoder(*[, skipkeys, …])

A subclass of json.JSONEncoder that can also handle numpy arrays, complex values, and astropy.units.Quantity objects.

nbodykit.utils.JSONDecoder(*args, **kwargs)

A subclass of json.JSONDecoder that can also handle numpy arrays, complex values, and astropy.units.Quantity objects.

Generating Mock Data

nbodykit.mockmaker.gaussian_complex_fields(pm, …)

Make a Gaussian realization of a overdensity field, \(\delta(x)\).

nbodykit.mockmaker.gaussian_real_fields(pm, …)

Make a Gaussian realization of a overdensity field in real-space \(\delta(x)\).

nbodykit.mockmaker.poisson_sample_to_points(…)

Poisson sample the linear delta and displacement fields to points.

Contact and Support

Users can submit questions, feedback, bug reports, and feature requests by opening up an issue on the GitHub Issue Tracker.

If submitting a bug report or feature request, be sure to read our contributing guidelines.

Contributing Guidelines

We welcome user contributions to nbodykit, but be sure to read this guide first to make the process as smooth as possible!

Requesting a Feature

You can use GitHub issues to request features you would like to see in nbodykit.

  1. Provide a clear and detailed explanation of the feature you want to add and its use case in large-scale structure data analysis. Keep in mind that the goal is to include features that are useful for a wide set of users.

  2. If you are able, start writing some code and attempt a pull request. Be sure to read the submission guidelines. There are often several features competing for time commitments, and user pull requests are greatly appreciated!

Bug Reporting

If you think you’ve found a bug in nbodykit, follow these steps to submit a report:

  1. First, double check that your bug isn’t already fixed. Follow our instructions for setting up a local development environment and make sure you’ve installed the tip of the master branch.

  2. Search for similar issues on GitHub. Make sure to delete is:open on the issue search to find solved tickets as well. It is possible that this bug has been encountered before.

  3. Please include the versions of Python, nbodykit, and other dependency libraries if they are applicable (numpy, scipy, etc)

  4. Provide us with the logging output of the script from when the bug was encountered, including a traceback if appropriate. If at all possible, provide us with a standalone script that will reproduce the issue. Issues have a much higher chance of being resolved quickly if we can easily reproduce the bug.

  5. Take a stab at fixing the bug yourself! The runtests module used by nbodykit supports on-line debugging via the PDB interface. It also supports drop-in replacements for PDB such as PDB++. A common debugging route is to add a regression unit test to the nbodykit test suite that fails due the bug and then run the test in debugging mode:

    $ python run-tests.py nbodykit/path/to/your/test --pdb
    

    Pull requests for bug fixes are always welcome!

We strongly recommend following the above steps and providing as much information as possible when bugs are encountered. This will help us resolve issues faster – and get everyone back to doing more accurate science!

Setting up for Local Development

  1. Fork nbodykit on GitHub:

    $ git clone https://github.com/bccp/nbodykit.git
    $ cd nbodykit
    
  2. Install the dependencies. We recommend using the Anaconda environment manager. To create a new environment for nbodykit development, use:

    # create and activate a new conda environment
    $ conda create -n nbodykit-dev python=3
    $ source activate nbodykit-dev
    
    # install requirements
    $ conda install -c bccp --file requirements.txt --file requirements-extras.txt
    
  3. Install nbodykit in develop mode using pip:

    # install in develop mode
    $ pip install -e .
    

Opening a Pull Request

  1. Write the code implementing your bug fix or feature, making sure to use detailed commit messages.

  2. Ensure that any new code is properly documented, with docstrings following the NumPy/Scipy documentation style guide.

  3. Write tests of the new code, ensuring that it has full unit test coverage. This is a crucial step as new pull requests will not be merged if they reduce the overall test coverage of nbodykit.

  4. Run the test suite locally. From the main nbodykit directory, run:

    $ python run-tests.py --with-coverage --html-cov
    

    This will also output the test coverage statistics to build/coverage/index.html.

  5. Make sure all of the tests have passed and that the coverage statistics indicate that any new code is fully covered by the test suite.

  6. Be sure to update the changelog to indicate what was added/modified.

  7. Submit your pull request to nbodykit:master. The Travis CI build must be passing before your pull request can be merged. Additionally, the overall coverage of the test suite must not decrease for the pull request to be merged.

Contributing to the Cookbook

If you have an application of nbodykit that is concise and interesting, please consider adding it to our cookbook of recipes. We also welcome feedback and improvements for these recipes. Users can submit issues or open a pull request on the nbodykit cookbook repo on GitHub.

Cookbook recipes should be in the form of Jupyter notebooks. See the existing recipes for examples. The recipes are designed to illustrate interesting uses of nbodykit for other users to learn from.

We appreciate any and all contributions!

Changelog

0.3.11 (Unreleased)

  • #575: Fix bug in painting particles to mesh in CatalogMesh

0.3.10 (2019-02-07)

  • #555, #556: Impove dask interpolation in Catalog.save.

  • #557: Use dask’s gufunc for sky transforms

  • #558: Note the unit of HaloRadius (proper, not comoving. bite me)

  • #561: Allow setting the position column in 2PCF.

  • #563: RedshiftHistogram extrapolates to zero rather than any number.

  • #564: Fix missing compensations with Multi-species meshes.

  • #565: add keyward header to HDFCatalog for reading additional meta data

  • #566: fix a bug sorting on float32, and add a persist method to Catalog.

  • #567: suppress redundant output in convpower.

0.3.9 (2019-01-07)

  • #544, #549: Use dask.store to save a catalog.

  • #545: Fix shotnoise estimation of weighted to_mesh() calls.

  • #546: Fix IndexError in painting during throttling.

  • #547: Update docrep

  • #548: Fix many deprecation warnings

  • #550: Documentation updates

  • #553: Fix error in TPCF module when there is only 1 bin.

0.3.8 (2018-12-29)

  • #543: Further performance improvements on catalog slicing.

  • #542: The IO module shall make sure buffer is c-contiguous before reshaping

  • #541: Allow setting cartesian / sphericial transformation reference frame

  • #540: Allow not saving the header in Catalog.save

  • #539: Allow non-uniform redshifts in halo property transformations.

  • #538: Stop gathering catalog to a single rank in HaloCatalog

  • #537: Use numpy.sum for summing of integers.

  • #536: Fix boxsize mismatch comparision in pair counters.

  • #535: Improve working with a dask cluster.

  • #532: Improve speed of slicing of a catalog.

  • #531: Additional throttling during painting.

  • #530: Use setuptools (need to change conda-build-bccp recipe)

  • #529: Add kmax(rmax) to FFTPower, FFTCorr, ConvPower.

  • #528: Add dataset= to Catalog.save, deprecate datasets=[]

0.3.7 (2018-10-17)

  • #519: Rework the class hierarchy of Catalogmesh.

  • #526: Reduce the paint size for systems with lower mem per core

  • #527: Aggregate attrs of header and the main datasets.

0.3.6 (2018-09-26)

  • #518: Rework CurrentMPIComm

  • #521: Fix OOM errors with dask >= 0.19.0

0.3.5 (2018-08-23)

  • #509: Fix auto detection of f8 type in Gadget1 file reader

  • #513: Ignore divide errors.

  • #516: Fix several bugs in three point function

  • #517: Improve compatibility with numpy 1.15.x’s new indexing convention.

0.3.4 (2018-06-29)

  • #495: Improve scaling of LogNormal catalog

  • #497: take method to BinnedStatistic

  • #498: add compute method to Catalog interface; CatalogMesh no longer a Catalog

  • #500: unique binning in FFTPower and FFTCorr

  • #503: redistributing a catalog spatially

  • #504: Catalog.copy hangs

  • #505: update docrep to 0.2.3

  • #506: compatible with dask 0.18.1.

0.3.3 (2018-05-30)

  • #491: update compatibility with pandas 0.23.0 in cgm.

  • #490: write more useful weights and pairs in the paircount result.

  • #493, #494: update for deprecation in pmesh

0.3.2 (2018-05-14)

  • #475: proper normalization of the Landy-Szalay estimator, included R1R2 option and to_xil function

  • #487: Linear theory correspondant of nbody simulation. (three fluid model)

  • #486: overdecomposition in FOF

  • #483: switching to a new type in BinnedStatistics.copy()

  • #482: Fix a crash when two datasets passed into corrfunc are of different dtypes.

  • #480: BigFileCatalog shall look for header relative to the root of file.

  • #479: GatherArray allows root=Ellipsis (for allbather)

  • #476: Fix MeshSource.apply if MeshSource.action is overriden

  • #471: Decompose of surveydata to the correct bounds.

0.3.1 (2018-04-10)

  • #468: corrfunc and big-endian floating point numbers

  • #470: Add hankel tranforms for ell>0

  • #469: Fix a regression painting ‘apply’ed meshes.

0.3.0 (2017-12-18)

  • #439: added updated pair counter algorithms, SurveyDataPairCount and SimulationBoxPairCount.

  • #439: added correlation function algorithms, SurveyData2PCF and SimulationBox2PCF

  • #441: add a DemoHaloCatalog for tutorials that downloads small halo catalogs using Halotools

  • #441: add hod module with wrapper classes for Halotools models and create HOD catalog by calling the populate() method of a HaloCatalog

  • #445: add a global cache with fixed size for dask calculations

  • #446: fixes future warning generated by pandas

  • #447: adds PCS sampling windows

0.2.9 (2017-11-15)

  • #442: bug fix: fixes MemoryError when data is larger than memory in paint(); adds paint_chunk_size default option

  • #440: Selection, Value, Weight specified as “default” columns; default columns are not saved to disk

  • #437: bug fix: make sure to copy attributes of catalog when copy() is called

  • #436: FFT-based correlation function algorithm, FFTCorr addded

  • #435: binder badge added to README and documentation for cookbook recipes

  • #433: by default, the header file will be found automatically in Bigfile

  • #429, #432: updates to documentation

  • #430: fix bug in FOF due to stricter numpy casting rules in numpy 1.13.3

  • #428: fixes bug in painting normalization when using interlacing is used

  • #422: proper list of attributes/methods added to documentation of Cosmology class

  • #425: latex requirement removed from notebook.mplstyle style file

  • #423: support for Gadget 1 file format

0.2.8 (2017-10-06)

  • #398: AngularPairCount algorithm added to compute pair counts for survey data as a function of angular separation

  • #364: fix load balancing for survey pair counting algorithms

  • #415: fix sympy pickling issue

  • #409: fix periodic boundary condition issues with FOF for low number of ranks

  • #420: fix bug introduced in 0.2.7 causing selection of CatalogSources to sometimes hang

  • #420: remove dask selection optimizations, which can cause the code to crash in uncontrollable ways

  • #421: better error messaging when using deprecated __init__ syntax for Cosmology class

  • #406: add global sort and slice operations to CatalogSource objects

0.2.7 (2017-09-25)

  • #384: fix packaging bug causing notebook.mplstyle to be missing from the conda build

  • rename test driver from runtests.py to run-tests.py

  • set_options context manager add to set global configuration variables

  • #392, #403: add optimized slicing via dask when applying a boolean selection index to a CatalogSource

  • #393: CatalogMesh is implemented as a view of a CatalogSource – column set/gets operate on the underlying CatalogSource

  • ConvolvedFFTPower supports cross-correlations of 2 mesh objects originating from the same data/randoms, allowing users to apply different weighting schemes to the two meshes

  • transform.SkyToCartesion deprecated in favor of transform.SkyToCartesian

  • #386: bug fixes related to behavior of Cosmology.clone

0.2.6 (2017-08-29)

  • #379: updated Cosmology class built on classylss, a Python binding of the CLASS Boltzmann code

  • #379: LinearPower object added with CLASS or Eisenstein-Hu transfer

  • #379: ZeldovichPower object added to compute Zel’dovich power spectrum

  • #379:HalofitPower object added to compute nonlinear power

  • #379: CorrelationFunction object added to FFT power spectra to compute theoretical correlation functions

  • #379: EHPower and NoWiggleEHPower deprecated in favor of LinearPower object

0.2.5 (2017-08-25)

  • #359: CSVFile and CSVCatalog no longer fail to read the last line of data when the file does not end in a newline

  • #361: add CylindricalGroups algorithm for computing groups of objects using the cylindrical grouping method of arXiv:1611.04165

  • #355: SimulationBoxPairCount and SurveyDataPairCount classes added to perform pair counting of objects in either simulation boxes or from survey data catalogs (using Corrfunc code)

  • #370: large addition of documentation for version 0.2.x; still partially completed

  • DataSet has been renamed to BinnedStatistic

  • calculation of dk fixed in ProjectedFFTPower

  • paint() supports a Nmesh parameter, for easier re-sampling of mesh objects

  • #368: addition of Value column for painting mesh objects; this represents the value of the field painted, i.e., unity to paint density, or velocity to paint momentum (number-weighted velocity)

  • addition of style module with matplotlib style sheet to make nice plots in our doc tutorials; this makes the docs reproducible by users

  • transform.vstack deprecated in favor of transform.StackColumns

  • transform.concatenate deprecated in favor of transform.ConcatenateSources

  • when painting catalogs to a mesh, users can specify the position column to use via the position keyword

  • #142: MultipleSpeciesCatalog object added to support painting multiple species of particles to the same mesh, i.e, baryons and dark matter particles in hydro simulations

  • CatalogMeshSource renamed to CatalogMesh internally

  • can now delete a column from a CatalogSource

  • can now slice a CatalogSource using a list of column names

  • #373: fix bug in ConstantArray when length is 1

0.2.4 (2017-06-18)

  • #339: transform.StackColumns renamed to vstack

  • #339: transform.concatenate function added, which takes a list of source objects, and returns a new Source that has the concatenation of all data

  • #345: fix compatibility with halotools version 0.5

  • #346: ability to resample a MemoryMesh object

  • #344: bug fixes related to calculation of growth rate in cosmology module

  • #347: ArrayCatalog can now be initialized from a dictionary or structured array

  • #348: add a ProjectedFFTPower algorithm, that computes the FFT Power, but can project over certain axes, i.e., projected axes have their power averaged over

  • #353: FITSCatalog added to the io module, for reading FITS files

  • #352: KDDensity to quickly estimate local density in density region.

  • #352: FOF also identifies Peak position and velocity.

0.2.3 (2017-05-19)

  • use of resampler keyword in the paint function for compatibility with pmesh versions >= 0.1.24

  • bug fixes and code cleanup

0.2.2 (2017-04-27)

  • package maintenance updates only

0.2.1 (2017-04-26)

  • base dependencies + extras (halotools, h5py); install all dependencies via pip nbodykit[extras]

  • meta-data calculations in FKPCatalog now account for Source selection properly

  • support for numpy int/float meta-data in JSON output files

  • Cosmology instances no longer return attributes as Quantity instances, assuming a default set of units

  • renaming of various classes/module related to the nbodykit.Source syntax

    • no more nbodykit.Source in nbodykit.lab

    • nbodykit.source.particle has been renamed to nbodykit.source.catalog

    • source objects are now catalogs – there class names have “Catalog” appended to their names

    • added individual catalogs for different file types in nbodykit.io, i.e., CSVCatalog, HDFCatalog, etc

  • the .apply operation is no longer in place for sources; it returns a view with the list of actions extended

  • galaxy type (central vs satellite) stored as integers in HODCatalog