
a massively parallel, large-scale structure toolkit¶
nbodykit is an open source project written in Python that provides a set of state-of-the-art, large-scale structure algorithms useful in the analysis of cosmological datasets from N-body simulations and observational surveys. All algorithms are massively parallel and run using the Message Passing Interface (MPI).
Driven by the optimism regarding the abundance and availability of large-scale computing resources in the future, the development of nbodykit distinguishes itself from other similar software packages (i.e., nbodyshop, pynbody, yt, xi) by focusing on:
a unified treatment of simulation and observational datasets by insulating algorithms from data containers
support for a wide variety of data formats, as well as large volumes of data
the ability to reduce wall-clock time by scaling to thousands of cores
deployment and availability on large, supercomputing facilities
an interactive user interface that performs as well in a Jupyter notebook as on a supercomputing machine
Learning By Example¶
For users who wish to dive right in, an interactive environment containing our cookbook recipes is available to users via the BinderHub service. Just click the launch button below to get started!
See The Cookbook for descriptions of the various notebook recipes.
Getting nbodykit¶
To get up and running with your own copy of nbodykit, please follow the installation instructions. nbodykit is currently supported on macOS and Linux architectures. The recommended installation method uses the Anaconda Python distribution.
nbodykit is compatible with Python versions 2.7, 3.5, and 3.6, and the source code is publicly available at https://github.com/bccp/nbodykit.
A 1 minute introduction to nbodykit¶
To start, we initialize the nbodykit “lab
”:
from nbodykit.lab import *
There are two core data structures in nbodykit: catalogs and meshes. These represent the two main ways astronomers interact with data in large-scale structure analysis. Catalogs hold information describing a set of discrete objects, storing the data in columns. nbodykit includes functionality for initializing catalogs from a variety of file formats as well as more advanced techniques for generating catalogs of simulated particles.
Below, we create a very simple catalog of uniformly distributed particles in a box of side length \(L = 1 \ h^{-1} \mathrm{Mpc}\):
catalog = UniformCatalog(nbar=100, BoxSize=1.0)
Catalogs have a fixed size and a set of columns describing the particle data. In this case, our catalog has “Position” and “Velocity” columns. Users can easily manipulate the existing column data or add new columns:
BoxSize = 2500.
catalog['Position'] *= BoxSize # re-normalize units of Position
catalog['Mass'] = 10**(numpy.random(12, 15, size=len(catalog))) # add some random mass values
We can generate a representation of the density field on a mesh using our catalog of objects. Here, we interpolate the particles onto a mesh of size \(64^3\):
mesh = catalog.to_mesh(Nmesh=64, BoxSize=BoxSize)
We can save our mesh to disk to later re-load using nbodykit:
mesh.save('mesh.bigfile')
or preview a low-resolution, 2D projection of the mesh to make sure everythings looks as expected:
import matplotlib.pyplot as plt
plt.imshow(mesh.preview(axes=[0,1], Nmesh=32))
Finally, we can feed our density field mesh in to one of the nbodykit algorithms.
For example, below we use the FFTPower
algorithm to
compute the power spectrum \(P(k,\mu)\) of the density
mesh using a fast Fourier transform via
result = FFTPower(mesh, Nmu=5)
with the measured power stored as the power
attribute of the
result
variable. The algorithm result and meta-data, input
parameters, etc. can then be saved to disk as a JSON file:
result.save("power-result.json")
It is important to remember that nbodykit is fully parallelized using MPI. This means that the above code snippets can be excuted in a Jupyter notebook with only a single CPU or using a standalone Python script with an arbitrary number of MPI workers. We aim to hide as much of the parallel abstraction from users as possible. When executing in parallel, data will automatically be divided amongst the available MPI workers, and each worker computes its own smaller portion of the algorithm result before finally these calculations are combined into the final result.
Getting Started¶
We also provide detailed overviews of the two main data containers in nbodykit, catalogs and meshes, and we walk through the necessary background information for each of the available algorithms in nbodykit. The main areas of the documentation can be broken down into the following sub-sections:
Introduction: an introduction to key nbodykit concepts and things to know
Cosmological Calculations: a guide to the cosmology-related functionality in nbodykit
Discrete Data Catalogs: a guide to dealing with catalogs of discrete data catalogs
Data on a Mesh: an overview of data on a discrete mesh
Getting Results: an introduction to the available algorithms, parallel computation, and saving/analyzing results
Getting Help¶
Install nbodykit¶
Getting nbodykit¶
nbodykit is currently supported on macOS and Linux architectures. The
recommended installation method is to install nbodykit and its
dependencies as part of the Anaconda
Python distribution. For more advanced users, or those without an
Anaconda distribution, the software can also be installed using the pip
utility or directly from the source. In these latter cases, additional
difficulties may arise when trying to compile and install some of
nbodykit’s dependencies.
The package is available for Python versions 2.7, 3.5, and 3.6.
Installing nbodykit with Anaconda¶
The easiest installation method uses the conda
utility, as part
of the Anaconda package
manager. The distribution can be downloaded and installed for free from
https://www.continuum.io/downloads.
We have pre-built binaries for nbodykit and all of its dependencies available via Anaconda that are compatible with Linux and macOS platforms. To avoid dependency conflicts, we recommend installing nbodykit into a fresh environment. This can be achieved with the following commands:
$ conda create --name nbodykit-env python=3 # or use python=2 for python 2.7*
$ source activate nbodykit-env
$ conda install -c bccp nbodykit
Updating nbodykit and its dependencies¶
To keep nbodykit and its dependencies up to date, use
$ conda update -c bccp --all
Installing From Source for Conda-based Installs¶
Users can also install nbodykit from source from within a conda environment,
given that git
is installed. First, clone the GitHub repository,
$ git clone http://github.com/bccp/nbodykit
$ cd nbodykit
Then, install all of the required dependencies using
$ conda install -c bccp --file requirements.txt
$ conda install -c bccp --file requirements-extras.txt
Now, the desired git branch can be installed easily. By default, the master
branch is active, and the latest development state of nbodykit can be installed
using
$ pip install -e .
Note that the -e
flag installs nbodykit in “develop” mode, allowing the
installed version to reflect the latest changes in the source code. The latest
changes to the master
branch can be incorporated from GitHub via
$ git checkout master
$ git pull origin master
Installing nbodykit with pip
¶
Warning
The easiest and recommended method to install nbodykit and its dependencies is using the Anaconda package. See Installing nbodykit with Anaconda for more details.
To build nbodykit from source, you will need to make sure all of the dependencies are properly installed on your system. To start, the following dependencies should be installed first:
$ pip install numpy cython mpi4py
Next, we must compile the remaining dependencies, which depends on the user’s machine.
Linux¶
To install nbodykit as well as all of its external dependencies on a Linux machine into the default Python installation directory:
$ pip install nbodykit[extras]
A different installation directory can be specified via the --user
or
--root <dir>
options of the pip install
command.
macOS¶
More care is required to properly build the dependencies on macOS machines.
The autotools
software is required, which can be installed using
the MacPorts package manager using:
$ sudo port install autoconf automake libtool
Using recent versions of MacPorts, we also need to tell mpicc
to use gcc
rather than the default clang
compiler, which doesn’t compile fftw
correctly
due to the lack of openmp
support. Additionally, the LDSHARED
environment variable must be explicitly set.
In bash, the installation command is:
$ export OMPI_CC=gcc
$ export LDSHARED="mpicc -bundle -undefined dynamic_lookup -DOMPI_IMPORTS"; pip install nbodykit[extras]
This command will compile and install the dependencies of nbodykit and then
install nbodykit. Again, a different installation directory can be specified via
the --user
or --root <dir>
options of the pip install
command.
nbodykit on NERSC¶
Note
This section covers using nbodykit on the computing nodes of NERSC. The computing nodes requires special care because they do not work with the simple MPI provided from Anaconda.
If instead you wish to use nbodykit on the login nodes of NERSC or the Jupyter Hub services (available at https://jupyter.nersc.gov and https://jupyter-dev.nersc.gov/), users should follow the Anaconda installation instructions to install nbodykit. The login nodes and JuptyerHub machines are very similar to standard computers. For more information on the JupyterHub services, see the official NERSC guide.
Development and testing of nbodykit was performed on the NERSC super-computing machines at Lawrence Berkeley National Laboratory. We maintain a daily build of the latest stable version of nbodykit on NERSC systems for Python versions 2.7, 3.5, and 3.6 and provide a tool to automatically load the appropriate environment when running jobs on either the Edison or Cori machines.
To load the latest stable version of nbodykit on NERSC, the following line should be added to the beginning of the user’s job script:
# load python 3.6 with latest stable nbodykit
# can also specify 2.7 or 3.5 here
source /usr/common/contrib/bccp/conda-activate.sh 3.6
If instead the user wishes to install the latest development version of nbodykit, the following lines should be added to the job script:
# first load python 3.6 with latest stable nbodykit
# can also specify 2.7 or 3.5 here
source /usr/common/contrib/bccp/conda-activate.sh 3.6
# overwrite nbodykit with the latest version from the tip of master
bcast-pip git+git://github.com/bccp/nbodykit.git
In the nbodykit source directory, we include an example Python script and job script for users. To run this example on NERSC, first download the necessary files:
# download the example script
$ wget https://raw.githubusercontent.com/bccp/nbodykit/master/nersc/example.py
# download the job script
$ wget https://raw.githubusercontent.com/bccp/nbodykit/master/nersc/example-job.slurm
and then if on the Cori machine, the job can be submitted using
$ sbatch -C haswell example-job.slurm
of if on the Edison machine, use
$ sbatch example-job.slurm
The example job script simply loads the nbodykit environment and executes the Python script in parallel, in this case, using 16 CPUs.
#!/bin/bash
#SBATCH -p debug
#SBATCH -o nbkit-example
#SBATCH -n 16
# load nbodykit
source /usr/common/contrib/bccp/conda-activate.sh 3.6
# run the main nbodykit example
srun -n 16 python example.py
If successful, this will save a file nbkit_example_power.json
to the
current working directory.
A Brief Introduction¶
In this section, we provide a brief overview of the major functionality of nbodykit, as well as an introduction to some of the technical jargon needed to get up and running quickly. We try to familiarize the user with the various aspects of nbodykit needed to take full advantage of nbodykit’s computing power. This section also serves as a nice outline of the documentation, with links to more detailed descriptions included throughout.
The lab
framework¶
A core design goal of nbodykit is maintaining an interactive user experience, allowing the user to quickly experiment and play around with data sets and statistics, while still leveraging the power of parallel processing when necessary. Motivated by the power of Jupyter notebooks, we adopt a “lab” framework for nbodykit, where all of the necessary data containers and algorithms can be imported from a single module:
from nbodykit.lab import *
[insert cool science here]
See the documentation for nbodykit.lab
for a full list of the
imported members in this module.
With all of the necessary tools now in hand, the user can easily load a data set, compute statistics of that data via one of the built-in algorithms, and save the results in just a few lines. The end result is a reproducible scientific result, generated from clear and concise code that flows from step to step.
Setting up logging¶
We use the logging
module throughout nbodykit to provide the user
with output as scripts progress. This is especially helpful when problems
are encountered when using nbodykit in parallel. Users can turn on logging
via the nbodykit.setup_logging()
function, optionally providing
the “debug” argument to increase the
logging level.
We typically begin our nbodykit scripts using:
from nbodykit.lab import *
from nbodykit import setup_logging
setup_logging() # log output to stdout
Parallel computation with MPI¶
The nbodykit package is fully parallelized using the Python
bindings of the Message Passage Interface (MPI) available in
mpi4py
. While we aim to hide most of the complexities of MPI from
the top-level user interface, it is helpful to know some basic aspects of the
MPI framework for understanding how nbodykit works to compute its results. If
you are unfamiliar with MPI, a good place to start is the documentation for
mpi4py. Briefly,
MPI allows nbodykit to use a specified number of CPUs, which work independently
to achieve a common goal and pass messages back and forth to coordinate their
work.
We provide a more in-depth discussion of the key MPI-related features of nbodykit in the Parallel Computation with nbodykit section. This section also includes a guide on how to execute nbodykit scripts in parallel using MPI.
Cosmology and units¶
nbodykit includes a cosmology calculator Cosmology
,
as well as several built-in cosmologies,
in the nbodykit.cosmology
module. This class uses
the CLASS CMB Boltzmann code for the majority of its
cosmology calculations by using the Python binding of the CLASS code provided
by the classylss
package. As such, the syntax used in the
Cosmology
class largely follows that
of the CLASS code.
To best interface with CLASS, and avoid unnecessary confusion, nbodykit assumes a default set of units:
distance: \(h^{-1} \ \mathrm{Mpc}\)
wavenumber: \(h \ \mathrm{Mpc}^{-1}\)
velocity: \(\mathrm{km} \ \mathrm{s}^{-1}\)
temperature: \(\mathrm{K}\)
power: \(h^{-3} \ \mathrm{Mpc}^3\)
density: \(10^{10} (h^{-1} \ M_\odot) (h^{-1} \ \mathrm{Mpc})^{-3}\)
neutrino mass: \(\mathrm{eV}\)
time: \(\mathrm{Gyr}\)
\(H_0\): \((\mathrm{km} \ \mathrm{s^{-1}}) / (h^{-1} \ \mathrm{Mpc})\)
We choose to define quantities with respect to the dimensionless Hubble parameter
\(h\) when appropriate. Users should always take care when loading data to
verify that the units follow the conventions defined here. Also, note that
when simulated data is generated by nbodykit, e.g.,
in HODCatalog
, the units of quantities
such as position and velocity will follow the above conventions.
The nbodykit.cosmology
module also includes functionality
for computing the theoretical linear power spectrum (using CLASS or analytic
transfer functions), correlation functions, and the Zel’dovich power
spectrum. See the Cosmological Calculations section for more details.
Interacting with data in nbodykit¶
The algorithms in nbodykit interface with user data in two main ways: “object catalogs” and “mesh fields”.
Catalogs¶
Catalogs hold columns of data for a set of discrete objects, typically galaxies. The columns typically include the three-dimensional positions of the objects, as well as properties of the object, e.g., mass, luminosity, etc. The catalog container represents the attributes of the objects as columns in the catalog. A catalog object behaves much like a structured NumPy array, with a fixed size and named data type fields, except that the data is provided by the random-access interface.
Catalog objects are subclasses of the CatalogSource
base class and live in the nbodykit.source.catalog
module.
We provide several different subclasses that are capable of loading data
from a variety of file formats on disk. We also provide catalog classes that
can generate a simulated set of particles. Users can find a more in depth
discussion of catalog data in Discrete Data Catalogs. For a full list
of available catalogs, see the API docs.
Meshes¶
The mesh container is fundamentally different from the catalog object. It stores a discrete representation of a continuous fluid field on a uniform mesh. The array values on the mesh are generated via a process referred to as “painting” in nbodykit. During the painting step, the positions of the discrete objects in a catalog are interpolated onto a uniform mesh. The fluid field on the mesh is often the density field, as sampled by the discrete galaxy positions.
Mesh objects are subclasses of the MeshSource
base class and live in the nbodykit.source.mesh
module.
We provide subclasses that are capable of loading mesh data
from disk or from a Numpy array, as well as classes that can generate simulated
meshes.
Furthermore, any catalog object can be converted to a mesh object
via the to_mesh()
function. This
function returns a CatalogMesh
object,
which is a view of a CatalogSource
as a MeshSource
.
A CatalogMesh
“knows” how to generate
the mesh data from the catalog data, i.e., the user has specified the desired
size of the mesh, etc. using the to_mesh()
function.
The Data on a Mesh section describes mesh objects in more detail. In particular, more details regarding the creation of mesh objects from catalogs can be found in Creating a Mesh. See the API docs for a full list of available meshes.
A component-based approach¶
The design of nbodykit focuses on a component-based approach. The components are exposed to the Python language as a set of classes and interfaces, and users can combine these components to construct complex applications. This design differs from the more commonly used alternative in cosmology software, which is a monolithic application controlled by a single configuration file (e.g., as in CLASS, CAMB, Gadget). From experience, we have found that a component-based approach offers the user greater freedom and flexibility to build complex applications with nbodykit.
In the figure above, we diagram the important interfaces and components of nbodykit. There are a few items worth highlighting in more details:
Catalog: as discussed in the previous section, catalog objects derive from the
CatalogSource
class and hold information about discrete objects. Catalogs also implement a random-read interface that allows the user to access individual columns of data. The random-read nature of the column access makes use of the high throughput of a parallel file system when nbodykit is executed in parallel.However, the backend of the random-read interface does not have to be a file on disk at all. As an example, the
ArrayCatalog
simply converts a dictionary or a NumPy array object to aCatalogSource
.Mesh: as discussed in the previous section, mesh objects derive from the
MeshSource
class and store a discrete representation of a continuous quantity on a uniform mesh. These objects provide a “paintable” interface provided to the user via thepaint()
function. Calling this function re-samples the fluid field represented by the mesh object to a distributed three-dimensional array (returning either aRealField
orComplexField
, as implemented by thepmesh
package). See the Dealing with Data on a Mesh for more details.Serialization: most objects in nbodykit are serializable via a
save()
function. For a more in-depth discussion of serialization, see Saving your Results.Algorithm classes not only save the result of the algorithm but also input parameters and meta-data stored in the
attrs
dictionary. Algorithms typically implement both asave()
andload()
function, such that the algorithm result can be de-serialized into an object of the same type. For example, the result of theFFTPower
algorithm can be serialized with thesave()
function and the algorithm re-initialized with theload()
function.The two main data containers, catalogs and meshes, can be serialized using nbodykit’s intrinsic format which relies on
bigfile
. The relevant functions aresave()
for catalogs andsave()
for meshes. These serialized results can later be loaded from disk by nbodykit as aBigFileCatalog
orBigFileMesh
object.
Catalogs and dask
¶
The data columns of catalog objects are stored as dask
arrays rather
than the similar, more traditional NumPy arrays. Users unfamiliar with the
dask
package should start with the On Demand IO via dask.array section of the docs.
Briefly, there are two main features to keep in mind when dealing with
dask
arrays:
1. Operations on a dask array are not evaluated immediately, as is the case for NumPy
arrays, but instead stored internally in a task graph. Thus, the usual array
manipulations on dask
arrays are nearly immediate.
2. A dask
array can be evaluated, returning a NumPy array, via a call
the compute()
function of the dask
array. This operation can be
time-consuming – it evaluates all of the operations in the array’s task graph.
In most situations, users should manipulate catalog columns as they would NumPy
arrays and allow the nbodykit internals to call the necessary compute()
function to get the final result. When possible, users should opt to use the
functions defined in the dask.array
module instead of the equivalent
function defined in numpy
. The dask.array
module is designed
to provide the same functionality as the numpy
package but for dask
arrays.
Running your favorite algorithm¶
nbodykit aims to implement a canonical set of algorithms in the field of large-scale structure. The goal is to provide open source, state-of-the-art implementations of the most well-known algorithms used in the analysis of large-scale structure data. We have a wide and growing range of algorithms implemented so far. Briefly, nbodykit includes functionality for:
generating density fields via the painting operation
computing the power spectrum of density fields for both simulations and observational surveys.
calculating two-point and three-point correlation functions
computing groups of objects using a Friends-of-Friends method or a cylindrical radius method
generating HOD catalogs of galaxies from catalogs of dark matter halos
running quasi N-body simulations using the FastPM scheme.
For a full list of the available algorithms, see this section of the docs. We also aim to provide examples of many of the algorithms in The Cookbook.
The algorithms in nbodykit couple to data through the catalog and mesh objects described in the previous sections. Algorithms in nbodykit are implemented as Python classes. When the class is initialized, the algorithm is run and the returned instance holds the corresponding results via attributes. The specific attributes that hold the results vary from algorithm to algorithm – we direct users to the API docs to determine the specifics for a particular algorithm. Furthermore, the algorithm result can be serialized to disk for archiving, We also ensure that the appropriate meta-data is serialized to disk in order to sufficiently describe the input parameters for reproducibility.
As open source software, we hope community contributions will help to maximize the utility of the nbodykit package for its users. We believe community contributions and review can help increase scientific productivity for all researchers. If your favorite algorithm isn’t yet implemented, we encourage contributions and feature requests from the community (see our contributing guidelines).
The Cookbook¶
We’ve created a cookbook of recipes for users to learn nbodykit by example. These recipes are designed to illustrate interesting and common uses of nbodykit for users to learn from. The goal is to have working examples for most of the algorithms in nbodykit, as well as some of the more common data tasks.
The recipes are provided as Jupyter notebooks. Each notebook is available for download by clicking the “Source” link in the navigation bar at the top of the page.
We welcome contributions of new recipes! See our see our contributing guidelines.
Questions, feedback, and contributions¶
If you’ve run in to problems with nbodykit, do not hesitate to get in touch with us. See our Contact and Support section for details on how to best contact us.
User contributions are also very welcome! Please see our see our contributing guidelines if you’ve like to help grow the nbodykit project!
Cosmological Calculations¶
The nbodykit.cosmology
module includes functionality for
representing cosmological parameter sets and computing various
theoretical quantities prevalent in large-scale structure that depend
on the background cosmological model. The module relies on the
CLASS CMB Boltzmann code as the underlying
engine for its cosmological calculations via the Python wrapper
provided by the classylss
package. For consistency, the cosmology
syntax used in nbodykit largely follows that of the CLASS code. A useful
set of links for reference material on CLASS is available
here.
The Cosmology
class¶
nbodykit provides the Cosmology
class for representing cosmological parameter sets. Given an input set of
parameters, this class uses the CLASS code to evaluate various quantities
relevant to large-scale structure. The class can compute background quantities,
such as distances, as a function of redshift, as well as linear and nonlinear
power spectra and transfer functions.
Note
For a full list of the attributes and functions available, please see the API docs.
The CLASS code supports multiple sets of input parameters, e.g., the user
can specify H0
or h
. For simplicity, we use the following parameters
to initialize a Cosmology
object:
h
: the dimensionless Hubble parameterT0_cmb
: the temperature of the CMB in KelvinsOmega0_b
: the current baryon density parameter, \(\Omega_{b,0}\)Omega0_cdm
: the current cold dark matter density parameter, \(\Omega_{cdm,0}\)N_ur
: the number of ultra-relativistic (massless neutrino) speciesm_ncdm
: the masses (in eV) for all massive neutrino speciesP_k_max
: the maximum wavenumber to compute power spectrum results to, in units of \(h \mathrm{Mpc}^{-1}\)P_z_max
: the maximum redshift to compute power spectrum results togauge
: the General Relativity gauge, either “synchronous” or “newtonian”n_s
: the tilt of the primordial power spectrumnonlinear
: whether to compute nonlinear power spectrum results via HaloFit
Any valid CLASS parameters can be additionally specified as keywords to the
Cosmology
constructor. If these additional parameters
conflict with the base set of parameters above, an exception will be raised.
For a guide to the valid input CLASS parameters, the “explanatory.ini”
parameter file in the CLASS GitHub is a good place to start.
Notes and Caveats
The default configuration assumes a flat cosmology, \(\Omega_{k,0}=0\). Pass
Omega0_k
as a keyword to specify the desired non-flat curvature.By default, the value for
N_ur
is inferred from the number of massive neutrinos using the following logic: if you have respectively 1,2,3 massive neutrinos and use the defaultT_ncdm
value (0.71611 K), which is designed to give m/omega of 93.14 eV, and you wish to haveN_eff=3.046
in the early universe, thenN_ur
is set to 2.0328, 1.0196, 0.00641, respectively.For consistency of variable names, the present day values can be passed with or without the “0” postfix, e.g.,
Omega0_cdm
is translated toOmega_cdm
as CLASS always uses the names without “0” as input parameters.By default, a cosmological constant (
Omega0_lambda
) is assumed, with its density value inferred by the curvature condition.Non-cosmological constant dark energy can be used by specifying the
w0_fld
,wa_fld
, and/orOmega0_fld
values.To pass in CLASS parameters that are not valid Python argument names, use the dictionary/keyword arguments trick, e.g.,
Cosmology(..., **{'temperature contributions': 'y'})
Builtin Cosmologies¶
We include several builtin cosmologies for users, accessible from the nbodykit.cosmology
module:
Name |
Source |
\(H_0\) |
\(\Omega_{m,0}\) |
Flat |
---|---|---|---|---|
Komatsu et al. 2009 |
70.2 |
0.277 |
Yes |
|
Komatsu et al. 2011 |
70.4 |
0.272 |
Yes |
|
Hinshaw et al. 2013 |
69.3 |
0.287 |
Yes |
|
Planck Collab 2013, Paper XVI |
67.8 |
0.307 |
Yes |
|
Planck Collab 2015, Paper XIII |
67.7 |
0.307 |
Yes |
Updating cosmological parameters¶
The Cosmology
class is immutable. To update cosmological parameters,
we provide the following functions:
clone()
: provides a newCosmology
object with only the specified parameters changed; this function accepts any keyword parameters that can be passed to theCosmology
constructormatch()
: provides a newCosmology
object that has matched a specific, derived parameter value. The derived parameters that can be matched include:sigma8
: re-scale the scalar amplitudeA_s
to matchsigma8
Omega0_cb
: match the total sum of cdm and baryons density, \(\Omega_{cdm,0}+\Omega_{b,0}\)Omega0_m
: match the total density of matter-like components, \(\Omega_{cdm,0}+\Omega_{b,0}+\Omega_{ncdm,0}-\Omega_{pncdm,0}+\Omega_{dcdm,0}\)
For example, we can match a specific sigma8
value using:
[2]:
from nbodykit.lab import cosmology
cosmo = cosmology.Cosmology()
print("original sigma8 = %.4f" % cosmo.sigma8)
new_cosmo = cosmo.match(sigma8=0.82)
print("new sigma8 = %.4f" % new_cosmo.sigma8)
/home/yfeng1/anaconda3/install/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters
original sigma8 = 0.8380
new sigma8 = 0.8200
And we can clone a Cosmology
using
[3]:
cosmo = cosmo.clone(h=0.7, nonlinear=True)
Converting to a dictionary¶
The keyword parameters passed to the constructor of a Cosmology
object are not passed directly to the underlying CLASS engine, but must be edited first to be compatible with CLASS. Users can access the full set of parameters passed directly to CLASS by converting the Cosmology
object to a dict. This can be accomplished simply using:
[4]:
cosmo_dict = dict(cosmo)
print(cosmo_dict)
{'output': 'vTk dTk mPk', 'extra metric transfer functions': 'y', 'n_s': 0.9667, 'gauge': 'synchronous', 'N_ur': 2.0328, 'T_cmb': 2.7255, 'Omega_cdm': 0.26377065934278865, 'Omega_b': 0.0482754208891869, 'N_ncdm': 1, 'P_k_max_h/Mpc': 10.0, 'z_max_pk': 100.0, 'h': 0.7, 'm_ncdm': [0.06], 'non linear': 'halofit'}
Each of the parameters in this dictionary are valid CLASS parameters. For example, we see that the CLASS parameter “non linear” has been set to “halofit”, since we have set the nonlinear
parameter to be True
.
Note
To create a new Cosmology
object from a dictionary, users should use
Cosmology.from_dict(dict(c))
. Because of the differences in syntax between the Cosmology
constructor
and CLASS, the following will not work: Cosmology(**dict(c))
.
Interfacing with astropy
¶
We provide some comptability for interfacing with the astropy.cosmology
module. The following functions
of the Cosmology()
class allow users to convert between the nbodykit and astropy
cosmology
implementations:
|
Initialize and return a subclass of |
|
Initialize and return a |
We also provide support for some of the syntax used by astropy
for attributes in its cosmology classes by aliasing the astropy
names to the appropriate nbodykit attributes. For example,
Returns |
|
Returns |
|
Returns the sum of |
|
Returns |
Warning
The conversion between astropy
and nbodykit cosmology objects is not one-to-one, as the classes are
implemented using different underlying engines. For example, the neutrino calculations performed in nbodykit
using CLASS differ from those in astropy
. We have attempted to provide as much compatibility as possible,
but these differences will always remain when converting between the two implementations.
Theoretical Power Spectra¶
Linear power¶
The linear power spectrum can be computed for a given Cosmology
and redshift using the LinearPower
class. This object can compute the linear power spectrum using one of
several transfer functions:
“CLASS”: calls CLASS to evaluate the total density transfer function at the specified redshift
“EisensteinHu”: uses the analytic transfer function from Eisenstein & Hu, 1998
“NoWiggleEisensteinHu”: uses the no-BAO analytic transfer function from Eisenstein & Hu, 1998
The LinearPower
object is a callable function, taking wavenumber in units of \(h \mathrm{Mpc}^{-1}\) as input:
[5]:
import matplotlib.pyplot as plt
import numpy as np
c = cosmology.Planck15
Plin = cosmology.LinearPower(c, redshift=0., transfer='CLASS')
k = np.logspace(-3, 0, 100)
plt.loglog(k, Plin(k), c='k')
plt.xlabel(r"$k$ $[h \mathrm{Mpc}^{-1}]$")
plt.ylabel(r"$P$ $[h^{-3} \mathrm{Mpc}^{3}]$")
plt.show()

The redshift and power spectrum normalization can easily be updated by adjusting the redshift
and sigma8
attributes of the LinearPower
instance. For example,
[6]:
# original power
plt.loglog(k, Plin(k), label=r"$z=0, \sigma_8=%.2f$" % Plin.sigma8)
# update the redshift and sigma8
Plin.redshift = 0.55
Plin.sigma8 = 0.80
plt.loglog(k, Plin(k), label=r"$z=0.55, \sigma_8=0.8$")
# format the axes
plt.legend()
plt.xlabel(r"$k$ $[h \mathrm{Mpc}^{-1}]$")
plt.ylabel(r"$P$ $[h^{-3} \mathrm{Mpc}^{3}]$")
plt.show()

Nonlinear and Zel’dovich power¶
We also provide the HalofitPower
and ZeldovichPower
classes to compute the nonlinear and Zel’dovich power spectra, respectively. The nonlinear power uses the HaloFit prescription that is available in CLASS. These objects behavior very similarly to the LinearPower
class.
For example,
[7]:
# initialize the power objects
Plin = cosmology.LinearPower(c, redshift=0, transfer='CLASS')
Pnl = cosmology.HalofitPower(c, redshift=0)
Pzel = cosmology.ZeldovichPower(c, redshift=0)
# plot each kind
plt.loglog(k, Plin(k), label='linear')
plt.loglog(k, Pnl(k), label='nonlinear')
plt.loglog(k, Pzel(k), label="Zel'dovich")
# format the axes
plt.legend()
plt.xlabel(r"$k$ $[h \mathrm{Mpc}^{-1}]$")
plt.ylabel(r"$P$ $[h^{-3} \mathrm{Mpc}^{3}]$")
plt.show()

Correlation functions¶
nbodykit also includes functionality for computing theoretical correlation functions by computing the Fourier transform of a power spectrum. We use the mcfit package, which provides a Python version of the FFTLog algorithm. The main class to compute correlation functions is CorrelationFunction
. This class
takes a power spectrum instance (one of LinearPower
, HalofitPower
,
or ZeldovichPower
) as input. Below, we show the correlation function for each of these classes:
[8]:
# initialize the correlation objects
cf_lin = cosmology.CorrelationFunction(Plin)
cf_nl = cosmology.CorrelationFunction(Pnl)
cf_zel = cosmology.CorrelationFunction(Pzel)
# plot each kind
r = np.logspace(-1, np.log10(150), 1000)
plt.plot(r, r**2 * cf_lin(r), label='linear')
plt.plot(r, r**2 * cf_nl(r), label='nonlinear')
plt.plot(r, r**2 * cf_zel(r), label="Zel'dovich")
# format the axes
plt.legend()
plt.xlabel(r"$r$ $[h^{-1} \mathrm{Mpc}]$")
plt.ylabel(r"$r^2 \xi \ [h^{-2} \mathrm{Mpc}^2]$")
plt.show()

We also include utility functions for performing the $P(k) leftrightarrow xi(r)$ transformation:
|
Return a callable function returning the power spectrum multipole of degree \(\ell\), as computed from the Fourier transform of the input \(r\) and \(\xi_\ell(r)\) arrays. |
|
Return a callable function returning the correlation function multipole of degree \(\ell\), as computed from the Fourier transform of the input \(k\) and \(P_\ell(k)\) arrays. |
These functions take discretely evaluated \((r, \xi(r))\) or \((k, P(k))\) arrays and return a function that evaluates the appropriate transformed quantity.
The Cookbook¶
Here, we provide a set of recipes detailing a broad selection of the functionality available in nbodykit. Ranging from simple tasks to more complex work flows, we hope that these recipes help users become acclimated with nbodykit as well as illustrate the power of nbodykit for large-scale structure data analysis.
For users who wish to dive right into the examples, an interactive environment containing the cookbook recipes is available to users via the BinderHub service. Just click the launch button below to get started!
Angular Pair Counting¶
In this notebook, we explore the functionality of the AngularPairCount
algorithm, which computes the number of pairs in angular separation bins for survey-like input data, assumed to be angular sky coordinates (i.e., right ascension and declination).
Note
The data used in this notebook is not realistic – rather, we choose the simplicity of generating mock data to help users get up and running quickly. Although the end results are not cosmologically interesting, we use the mock data to help illustrate the various steps necessary to use the AngularPairCount
algorithm properly.
[1]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
[2]:
from nbodykit.lab import *
from nbodykit import setup_logging, style
import matplotlib.pyplot as plt
plt.style.use(style.notebook)
[3]:
setup_logging()
Initalizing Mock Data¶
In this notebook, we construct our fake “data” catalog simply by generating 10,000 objects that have uniformly distributed right ascension and declination values within a region of the sky.
[4]:
NDATA = 10000
# randomly distributed RA/DEC on the sky
data = RandomCatalog(NDATA, seed=42)
data['RA'] = data.rng.uniform(low=110, high=260)
data['DEC'] = data.rng.uniform(low=-3.6, high=60)
Adding Weights¶
The AngularPairCount
algorithm also supports computing weighted pair counts. The total weight for a pair of objects is the product of the individual weights. Users can specify the name of column they would like to treat as a weight by passing the weight
keyword to the algorithm constructor.
Here, we choose randomly distributed weights between 0 and 1.
[5]:
data['Weight'] = numpy.random.random(size=len(data))
Auto-correlation Pair Counts¶
To compute the pair counts, we simply specify the edges of the angular separation bins (in degrees), as well as the names of the columns holding the sky coordinates and weight values.
[6]:
# the angular bin edges in degrees
edges = numpy.linspace(0.1, 10., 20+1) # 20 total bins
# run the algorithm
r_auto = AngularPairCount(data, edges, ra='RA', dec='DEC', weight='Weight')
[ 000000.06 ] 0: 10-30 20:54 AngularPairCount INFO using cpu grid decomposition: (1, 1, 1)
[ 000000.08 ] 0: 10-30 20:54 AngularPairCount INFO correlating 10000 x 10000 objects in total
[ 000000.08 ] 0: 10-30 20:54 AngularPairCount INFO correlating A x B = 10000 x 10000 objects (median) per rank
[ 000000.08 ] 0: 10-30 20:54 AngularPairCount INFO min A load per rank = 10000
[ 000000.08 ] 0: 10-30 20:54 AngularPairCount INFO max A load per rank = 10000
[ 000000.08 ] 0: 10-30 20:54 AngularPairCount INFO (even distribution would result in 10000 x 10000)
[ 000000.08 ] 0: 10-30 20:54 AngularPairCount INFO calling function 'Corrfunc.mocks.DDtheta_mocks.DDtheta_mocks'
[ 000000.38 ] 0: 10-30 20:54 AngularPairCount INFO 10% done
[ 000000.70 ] 0: 10-30 20:54 AngularPairCount INFO 20% done
[ 000000.98 ] 0: 10-30 20:54 AngularPairCount INFO 30% done
[ 000001.26 ] 0: 10-30 20:54 AngularPairCount INFO 40% done
[ 000001.58 ] 0: 10-30 20:54 AngularPairCount INFO 50% done
[ 000001.94 ] 0: 10-30 20:54 AngularPairCount INFO 60% done
[ 000002.22 ] 0: 10-30 20:54 AngularPairCount INFO 70% done
[ 000002.52 ] 0: 10-30 20:54 AngularPairCount INFO 80% done
[ 000002.83 ] 0: 10-30 20:54 AngularPairCount INFO 90% done
[ 000003.13 ] 0: 10-30 20:54 AngularPairCount INFO 100% done
The measured paircounts are stored in the result
attribute. The number of pairs in a given angular bin is stored as the npairs
field. Below, we plot the number of pairs \(DD(\theta)\) as a function of angular separation.
[7]:
pc = r_auto.result
plt.plot(pc['theta'], pc['npairs'])
# format the axes
plt.xlabel(r"$\theta$ [$\mathrm{degrees}$]")
plt.ylabel(r"$DD(\theta)$")
[7]:
<matplotlib.text.Text at 0x112e080b8>
The total weight for a given pair is the product of their individual weights, and the average total weight is computed in each angular bin. This value is stored in the weightavg
field of the result
attribute. Below, we plot the average weight as a function of angular separation.
[8]:
plt.plot(r_auto.result['theta'], r_auto.result['weightavg'])
# format the axes
plt.xlabel(r"$\theta$ [$\mathrm{degrees}$]")
plt.ylabel(r"$\mathrm{average \ weight}$")
plt.ylim(0, 1.0)
[8]:
(0, 1.0)
Here, we chose to weight our objects with random values ranging from 0 to 1. Thus, on average, the expectation value of the square of those weights will be 0.25, which we see reproduced in the figure above.
Cross-correlation Pair Counts¶
We can also compute the cross-correlation of pair counts between two sources. Below, we initialize a second randomly distributed catalog, and compute the pair counts between the two catalogs.
[9]:
# randomly distributed RA/DEC on the sky
data2 = RandomCatalog(NDATA, seed=84)
data2['RA'] = data2.rng.uniform(low=110, high=260, size=len(data2))
data2['DEC'] = data2.rng.uniform(low=-3.6, high=60., size=len(data2))
data2['Weight'] = numpy.random.random(size=len(data2))
[10]:
# compute the cross correlation pair counts
r_cross = AngularPairCount(data, edges, ra='RA', dec='DEC', weight='Weight', second=data2)
[ 000006.07 ] 0: 10-30 20:54 AngularPairCount INFO using cpu grid decomposition: (1, 1, 1)
[ 000006.12 ] 0: 10-30 20:54 AngularPairCount INFO correlating 10000 x 10000 objects in total
[ 000006.12 ] 0: 10-30 20:54 AngularPairCount INFO correlating A x B = 10000 x 10000 objects (median) per rank
[ 000006.12 ] 0: 10-30 20:54 AngularPairCount INFO min A load per rank = 10000
[ 000006.13 ] 0: 10-30 20:54 AngularPairCount INFO max A load per rank = 10000
[ 000006.13 ] 0: 10-30 20:54 AngularPairCount INFO (even distribution would result in 10000 x 10000)
[ 000006.14 ] 0: 10-30 20:54 AngularPairCount INFO calling function 'Corrfunc.mocks.DDtheta_mocks.DDtheta_mocks'
[ 000006.44 ] 0: 10-30 20:54 AngularPairCount INFO 10% done
[ 000006.72 ] 0: 10-30 20:54 AngularPairCount INFO 20% done
[ 000007.00 ] 0: 10-30 20:54 AngularPairCount INFO 30% done
[ 000007.29 ] 0: 10-30 20:54 AngularPairCount INFO 40% done
[ 000007.59 ] 0: 10-30 20:54 AngularPairCount INFO 50% done
[ 000007.88 ] 0: 10-30 20:54 AngularPairCount INFO 60% done
[ 000008.16 ] 0: 10-30 20:54 AngularPairCount INFO 70% done
[ 000008.45 ] 0: 10-30 20:54 AngularPairCount INFO 80% done
[ 000008.74 ] 0: 10-30 20:54 AngularPairCount INFO 90% done
[ 000009.03 ] 0: 10-30 20:54 AngularPairCount INFO 100% done
[11]:
# plot the auto and cross pair counts
pc_auto = r_auto.result
pc_cross = r_cross.result
plt.plot(pc_auto['theta'], pc_auto['npairs'], label='auto correlation')
plt.plot(pc_cross['theta'], pc_cross['npairs'], label='cross correlation')
# format the axes
plt.legend(loc=0)
plt.xlabel(r"$\theta$ [$\mathrm{degrees}$]")
plt.ylabel(r"$DD(\theta)$")
[11]:
<matplotlib.text.Text at 0x1198cff28>
The Multipoles of the BOSS DR12 Dataset¶
In this notebook, we use the ConvolvedFFTPower
algorithm to compute the monopole, quadrupole, and hexadecapole of the BOSS DR12 LOWZ South Galactic Cap (SGC) galaxy sample. This data set is described in detail in Reid et al. 2016, and the cosmological results for the BOSS DR12 data are presented in Alam et al. 2017.
Here, we measure the power spectrum multipoles for the SGC LOWZ sample due to its relatively small size, such that this notebook can be executed in reasonable time on a laptop. The analysis presented in this notebook can easily be applied to the BOSS DR12 CMASS sample and samples in both the North and South Galactic Cap regions.
The BOSS dataset includes both the “data” and the “randoms” catalogs, each of which includes FKP weights, completeness weights, and \(n(z)\) values. This notebook illustrates how to incorporate these analysis steps into the power spectrum analysis using nbodykit.
[1]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
[2]:
from nbodykit.lab import *
from nbodykit import setup_logging, style
import os
import matplotlib.pyplot as plt
plt.style.use(style.notebook)
[3]:
setup_logging() # turn on logging to screen
Getting the Data¶
The BOSS DR12 galaxy data sets are available for download on the SDSS DR12 homepage. The “data” and “randoms” catalogs for the North Galactic CMASS sample are available from the following links:
data: https://data.sdss.org/sas/dr12/boss/lss/galaxy_DR12v5_LOWZ_South.fits.gz (32.3 MB)
randoms: https://data.sdss.org/sas/dr12/boss/lss/random0_DR12v5_LOWZ_South.fits.gz (713.3 MB)
Below, we’ve included a function to download these data files a directory of your choosing. The files will only be downloaded if they have not yet been downloaded.
[4]:
def print_download_progress(count, block_size, total_size):
import sys
pct_complete = float(count * block_size) / total_size
msg = "\r- Download progress: {0:.1%}".format(pct_complete)
sys.stdout.write(msg)
sys.stdout.flush()
def download_data(download_dir):
"""
Download the FITS data needed for this notebook to the specified directory.
Parameters
----------
download_dir : str
the data will be downloaded to this directory
"""
from six.moves import urllib
import shutil
import gzip
urls = ['https://data.sdss.org/sas/dr12/boss/lss/galaxy_DR12v5_LOWZ_South.fits.gz',
'https://data.sdss.org/sas/dr12/boss/lss/random0_DR12v5_LOWZ_South.fits.gz']
filenames = ['galaxy_DR12v5_LOWZ_South.fits', 'random0_DR12v5_LOWZ_South.fits']
# download both files
for i, url in enumerate(urls):
# the download path
filename = url.split('/')[-1]
file_path = os.path.join(download_dir, filename)
final_path = os.path.join(download_dir, filenames[i])
# do not re-download
if not os.path.exists(final_path):
print("Downloading %s" % url)
# Check if the download directory exists, otherwise create it.
if not os.path.exists(download_dir):
os.makedirs(download_dir)
# Download the file from the internet.
file_path, _ = urllib.request.urlretrieve(url=url,
filename=file_path,
reporthook=print_download_progress)
print()
print("Download finished. Extracting files.")
# unzip the file
with gzip.open(file_path, 'rb') as f_in, open(final_path, 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)
os.remove(file_path)
print("Done.")
else:
print("Data has already been downloaded.")
[5]:
# download the data to the current directory
download_dir = "."
download_data(download_dir)
Data has already been downloaded.
Data has already been downloaded.
Loading the Data from Disk¶
The “data” and “randoms” catalogs are stored on disk as FITS objects, and we can use the FITSCatalog
object to read the data from disk.
Note that the IO operations are performed on-demand, so no data is read from disk until an algorithm requires it to be read. For more details, see the On-Demand IO section of the documentation.
Note
To specify the directory where the BOSS catalogs were downloaded, change the path_to_catalogs
variable below.
[6]:
# NOTE: change this path if you downloaded the data somewhere else!
data_path = os.path.join(download_dir, 'galaxy_DR12v5_LOWZ_South.fits')
randoms_path = os.path.join(download_dir, 'random0_DR12v5_LOWZ_South.fits')
# initialize the FITS catalog objects for data and randoms
data = FITSCatalog(data_path)
randoms = FITSCatalog(randoms_path)
[ 000001.32 ] 0: 08-03 17:15 CatalogSource INFO Extra arguments to FileType: ('./galaxy_DR12v5_LOWZ_South.fits',)
[ 000001.32 ] 0: 08-03 17:15 CatalogSource INFO Extra arguments to FileType: ('./random0_DR12v5_LOWZ_South.fits',)
We can analyze the available columns in the catalogs via the columns
attribute:
[7]:
print('data columns = ', data.columns)
data columns = ['AIRMASS', 'CAMCOL', 'COMP', 'DEC', 'DEVFLUX', 'EB_MINUS_V', 'EXPFLUX', 'EXTINCTION', 'FIBER2FLUX', 'FIBERID', 'FIELD', 'FINALN', 'FRACPSF', 'ICHUNK', 'ICOLLIDED', 'ID', 'IMAGE_DEPTH', 'IMATCH', 'INGROUP', 'IPOLY', 'ISECT', 'MJD', 'MODELFLUX', 'MULTGROUP', 'NZ', 'PLATE', 'PSFFLUX', 'PSF_FWHM', 'RA', 'RERUN', 'RUN', 'R_DEV', 'SKYFLUX', 'SPECTILE', 'Selection', 'TILE', 'Value', 'WEIGHT_CP', 'WEIGHT_FKP', 'WEIGHT_NOZ', 'WEIGHT_SEEING', 'WEIGHT_STAR', 'WEIGHT_SYSTOT', 'Weight', 'Z']
[8]:
print('randoms columns = ', randoms.columns)
randoms columns = ['AIRMASS', 'DEC', 'EB_MINUS_V', 'IMAGE_DEPTH', 'IPOLY', 'ISECT', 'NZ', 'PSF_FWHM', 'RA', 'SKYFLUX', 'Selection', 'Value', 'WEIGHT_FKP', 'Weight', 'Z', 'ZINDX']
Select the Correct Redshift Range¶
The LOWZ galaxy sample is defined over a redshift range of \(0.15 < z < 0.43\), in order to not overlap with the CMASS sample. First, we will slice our input catalogs to the proper redshift using the Z
column.
When manipulating loaded data, it is usually best to trim the loaded data via slice operations first before any additional columns or data cleaning is performed.
[9]:
ZMIN = 0.15
ZMAX = 0.43
# slice the randoms
valid = (randoms['Z'] > ZMIN)&(randoms['Z'] < ZMAX)
randoms = randoms[valid]
# slice the data
valid = (data['Z'] > ZMIN)&(data['Z'] < ZMAX)
data = data[valid]
Adding the Cartesian Coordinates¶
Both the “data” and “randoms” catalogs include positions of objects in terms of right ascension, declination, and redshift. Next, we add the Position
column to both catalogs by converting from these sky coordinates to Cartesian coordinates, using the helper function transform.SkyToCartesian
.
To convert from redshift to comoving distance, we use the fiducial DR12 BOSS cosmology, as defined in Alam et al. 2017.
[10]:
# the fiducial BOSS DR12 cosmology
cosmo = cosmology.Cosmology(h=0.676).match(Omega0_m=0.31)
# add Cartesian position column
data['Position'] = transform.SkyToCartesian(data['RA'], data['DEC'], data['Z'], cosmo=cosmo)
randoms['Position'] = transform.SkyToCartesian(randoms['RA'], randoms['DEC'], randoms['Z'], cosmo=cosmo)
Add the Completeness Weights¶
Next, we specify the completeness weights for the “data” and “randoms”. By construction, there are no systematic variations in the number density of the “randoms”, so the completenesss weights are set to unity for all objects. For the “data” catalog, the completeness weights are computed as defined in eq. 48 of Reid et al. 2016. These weights account for systematic issues, redshift failures, and missing objects due to close pair collisions on the fiber plate.
[11]:
randoms['WEIGHT'] = 1.0
data['WEIGHT'] = data['WEIGHT_SYSTOT'] * (data['WEIGHT_NOZ'] + data['WEIGHT_CP'] - 1.0)
Computing the Multipoles¶
To compute the multipoles, first we combine the “data” and “randoms” catalogs into a single FKPCatalog
, which provides a common interface to the data in both catalogs. Then, we convert this FKPCatalog
to a mesh object, specifying the number of mesh cells per side, as well as the names of the \(n(z)\) and weight columns.
[12]:
# combine the data and randoms into a single catalog
fkp = FKPCatalog(data, randoms)
We initialize a \(256^3\) mesh to paint the density field. Most likely, users will want to increase this number on machines with enough memory in order to avoid the effects of aliasing on the measured multipoles. We set the value to \(256^3\) to ensure this notebook runs on most machines.
We also tell the mesh object that \(n(z)\) column via the nbar
keyword, the FKP weight column via the fkp_weight
keyword, and the completeness weight column via the comp_weight
keyword.
[13]:
mesh = fkp.to_mesh(Nmesh=256, nbar='NZ', fkp_weight='WEIGHT_FKP', comp_weight='WEIGHT', window='tsc')
[ 000017.34 ] 0: 08-03 17:15 FKPCatalog INFO BoxSize = [ 868. 1635. 919.]
[ 000017.34 ] 0: 08-03 17:15 FKPCatalog INFO cartesian coordinate range: [ 304.15007706 -787.61963937 -219.56693168] : [1154.5008732 815.19825493 680.73166297]
[ 000017.34 ] 0: 08-03 17:15 FKPCatalog INFO BoxCenter = [729.32547513 13.78930778 230.58236564]
Users can also pass a BoxSize
keyword to the to_mesh()
function in order to specify the size of the Cartesian box that the mesh is embedded in. By default, the maximum extent of the “randoms” catalog sets the size of the box.
Now, we are able to compute the desired multipoles. Here, we compute the \(\ell=0,2,\) and \(4\) multipoles using a wavenumber spacing of \(k = 0.005\) \(h/\mathrm{Mpc}\). The maximum \(k\) value computed is set by the Nyquist frequency of the mesh, \(k_\mathrm{max} = k_\mathrm{Nyq} = \pi N_\mathrm{mesh} / L_\mathrm{box}\).
[14]:
# compute the multipoles
r = ConvolvedFFTPower(mesh, poles=[0,2,4], dk=0.005, kmin=0.)
[ 000017.36 ] 0: 08-03 17:15 ConvolvedFFTPower INFO using compensation function CompensateTSCShotnoise for source 'first'
[ 000017.36 ] 0: 08-03 17:15 ConvolvedFFTPower INFO using compensation function CompensateTSCShotnoise for source 'second'
[ 000027.32 ] 0: 08-03 17:15 CatalogMesh INFO painted 5549994 out of 5549994 objects to mesh
[ 000027.33 ] 0: 08-03 17:15 CatalogMesh INFO painted 5549994 out of 5549994 objects to mesh
[ 000027.33 ] 0: 08-03 17:15 CatalogMesh INFO mean particles per cell is 0.0776352
[ 000027.33 ] 0: 08-03 17:15 CatalogMesh INFO sum is 1.3025e+06
[ 000027.33 ] 0: 08-03 17:15 CatalogMesh INFO normalized the convention to 1 + delta
[ 000027.74 ] 0: 08-03 17:15 CatalogMesh INFO painted 113525 out of 113525 objects to mesh
[ 000027.74 ] 0: 08-03 17:15 CatalogMesh INFO painted 113525 out of 113525 objects to mesh
[ 000027.75 ] 0: 08-03 17:15 CatalogMesh INFO mean particles per cell is 0.0016465
[ 000027.75 ] 0: 08-03 17:15 CatalogMesh INFO sum is 27623.7
[ 000027.75 ] 0: 08-03 17:15 CatalogMesh INFO normalized the convention to 1 + delta
[ 000027.80 ] 0: 08-03 17:15 FKPCatalogMesh INFO field: (FKPCatalog(species=['data', 'randoms']) as CatalogMesh) painting done
[ 000027.80 ] 0: 08-03 17:15 ConvolvedFFTPower INFO tsc painting of 'first' done
[ 000028.17 ] 0: 08-03 17:15 ConvolvedFFTPower INFO ell = 0 done; 1 r2c completed
[ 000031.15 ] 0: 08-03 17:15 ConvolvedFFTPower INFO normalized power spectrum with `randoms.norm = 2.076940`
[ 000033.27 ] 0: 08-03 17:15 ConvolvedFFTPower INFO ell = 2 done; 5 r2c completed
/home/yfeng1/anaconda3/install/lib/python3.6/site-packages/nbodykit/algorithms/fftpower.py:616: RuntimeWarning: invalid value encountered in sqrt
xslab **= 0.5
/home/yfeng1/anaconda3/install/lib/python3.6/site-packages/nbodykit/meshtools.py:136: RuntimeWarning: divide by zero encountered in true_divide
return sum(self.coords(i) * los[i] for i in range(self.ndim)) / self.norm2()**0.5
[ 000038.22 ] 0: 08-03 17:15 ConvolvedFFTPower INFO ell = 4 done; 9 r2c completed
[ 000039.10 ] 0: 08-03 17:15 ConvolvedFFTPower INFO higher order multipoles computed in elapsed time 00:00:07.94
The meta-data computed during the calculation is stored in attrs
dictionary. See the documentation for more information.
[15]:
for key in r.attrs:
print("%s = %s" % (key, str(r.attrs[key])))
poles = [0, 2, 4]
dk = 0.005
kmin = 0.0
use_fkp_weights = False
P0_FKP = None
Nmesh = [256 256 256]
BoxSize = [ 868. 1635. 919.]
BoxPad = [0.02 0.02 0.02]
BoxCenter = [729.32547513 13.78930778 230.58236564]
mesh.window = tsc
mesh.interlaced = False
alpha = 0.021211734643316733
data.norm = 2.0767224
randoms.norm = 2.076939564818603
shotnoise = 3631.26854558135
randoms.N = 5549994
randoms.W = 5549994.0
randoms.num_per_cell = 0.0776351608830188
data.N = 113525
data.W = 117725.0
data.num_per_cell = 0.001646501710638404
data.ext = 1
randoms.ext = 1
The measured multipoles are stored in the poles
attribute. Below, we plot the monopole, quadrupole, and hexadecapole, making sure to subtract out the shot noise value from the monopole.
[16]:
poles = r.poles
for ell in [0, 2, 4]:
label = r'$\ell=%d$' % (ell)
P = poles['power_%d' %ell].real
if ell == 0: P = P - r.attrs['shotnoise']
plt.plot(poles['k'], poles['k']*P, label=label)
# format the axes
plt.legend(loc=0)
plt.xlabel(r"$k$ [$h \ \mathrm{Mpc}^{-1}$]")
plt.ylabel(r"$k \ P_\ell$ [$h^{-2} \ \mathrm{Mpc}^2$]")
plt.xlim(0.01, 0.25)
[16]:
(0.01, 0.25)
For the LOWZ SGC sample, with \(N = 113525\) objects, the measured multipoles are noiser than for the other DR12 galaxy samples. Nonetheless, we have measured a clear monopole and quadrupole signal, with the hexadecapole remaining largely consistent with zero.
Note that the results here are measured up to the 1D Nyquist frequency, \(k_\mathrm{max} = k_\mathrm{Nyq} = \pi N_\mathrm{mesh} / L_\mathrm{box}\). Users can increase the Nyquist frequency and decrease the effects of aliasing on the measured power by increasing the mesh size. Using interlacing (by setting interlaced=True
) can also reduce the effects of aliasing on the measured results.
Converting from \(P_\ell(k)\) to \(P(k,\mu)\)¶
The ConvolvedFFTPower.to_pkmu
function allows users to rotate the measured multipoles, stored as the poles
attribute, into \(P(k,\mu)\) wedges. Below, we convert our measurements of \(P_0\), \(P_2\), and \(P_4\) into 3 \(\mu\) wedges.
[17]:
# use the same number of mu wedges and number of multipoles
Nmu = Nell = 3
mu_edges = numpy.linspace(0, 1, Nmu+1)
# get a BinnedStatistic holding the P(k,mu) wedges
Pkmu = r.to_pkmu(mu_edges, 4)
[18]:
# plot each mu bin
for i in range(Pkmu.shape[1]):
Pk = Pkmu[:,i] # select the ith mu bin
label = r'$\mu$=%.2f' % (Pkmu.coords['mu'][i])
plt.loglog(Pk['k'], Pk['power'].real - Pk.attrs['shotnoise'], label=label)
# format the axes
plt.legend(loc=0, ncol=2)
plt.xlabel(r"$k$ [$h \ \mathrm{Mpc}^{-1}$]")
plt.ylabel(r"$P(k, \mu)$ [$h^{-3}\mathrm{Mpc}^3$]")
plt.xlim(0.01, 0.25)
[18]:
(0.01, 0.25)
The Power Spectrum of Survey Data¶
In this notebook, we explore the functionality of the ConvolvedFFTPower
algorithm, which computes the power spectrum multipoles \(P_\ell(k)\) for data from a survey that includes non-trivial selection effects. The output of the algorithm is the true multipoles of the data, convoled with the window function of the survey.
The input data for this algorithm is assumed to be from an observational survey, with the position coordinates specified in terms of right ascension, declination, and redshift.
Note
The data used in this notebook is not realistic – rather, we choose the simplicity of generating mock data to help users get up and running quickly. Although the end results are not cosmologically interesting, we use the mock data to help illustrate the various steps necessary to use the ConvolvedFFTPower
algorithm properly.
[1]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
[2]:
from nbodykit.lab import *
from nbodykit import setup_logging, style
from scipy.interpolate import InterpolatedUnivariateSpline
import matplotlib.pyplot as plt
plt.style.use(style.notebook)
[3]:
setup_logging()
Initalizing Mock Data¶
We start by generating mock catalogs to mimic the “data” and “randoms” catalogs needed by the ConvolvedFFTPower
algorithm. Here, the “data” catalog usually gives the information about the galaxy objects, and the “randoms” catalog is a catalog of synthetic objects without any cosmological clustering signal. The “randoms” usually have a higher number density than the “data” and are a Monte Carlo representation of the survey volume.
In this notebook, we construct our fake “data” and “randoms” catalogs simply by generating uniformly distributed right ascension and declination values within a region of the sky, with redshifts drawn from a Gaussian distribution with \(\mu=0.5\) and \(\sigma=0.1\).
[4]:
NDATA = 50000
# initialize data and randoms catalogs
data = RandomCatalog(NDATA, seed=42)
randoms = RandomCatalog(NDATA*10, seed=84)
# add the (ra, dec, z) columns
for s in [data, randoms]:
s['z'] = s.rng.normal(loc=0.5, scale=0.1)
s['ra'] = s.rng.uniform(low=110, high=260)
s['dec'] = s.rng.uniform(low=-3.6, high=60.)
Adding the Cartesian Coordinates¶
Next, we add the Position
column to both the “data” and “randoms” by converting from sky coordinates to Cartesian coordinates, using the helper function transform.SkyToCartesian
. The redshift to comoving distance transformation requires a cosmology instance, so we first initialize our desired cosmology parameters.
[5]:
# specify our cosmology
cosmo = cosmology.Cosmology(h=0.7).match(Omega0_m=0.31)
# add Cartesian position column
data['Position'] = transform.SkyToCartesian(data['ra'], data['dec'], data['z'], cosmo=cosmo)
randoms['Position'] = transform.SkyToCartesian(randoms['ra'], randoms['dec'], randoms['z'], cosmo=cosmo)
Specifying the “data” \(n(z)\)¶
The ConvolvedFFTPower
algorithm requires the number density as a function of redshift for the “data” catalog. Here, we use the RedshiftHistogram
algorithm to compute the redshift histogram of the “randoms” catalog, and then re-normalize the number density to that of the “data” catalog.
[6]:
# the sky fraction, used to compute volume in n(z)
FSKY = 0.15 # a made-up value
# compute n(z) from the randoms
zhist = RedshiftHistogram(randoms, FSKY, cosmo, redshift='z')
# re-normalize to the total size of the data catalog
alpha = 1.0 * data.csize / randoms.csize
# add n(z) from randoms to the FKP source
nofz = InterpolatedUnivariateSpline(zhist.bin_centers, alpha*zhist.nbar)
# plot
plt.plot(zhist.bin_centers, alpha*zhist.nbar)
plt.xlabel(r"$z$", fontsize=16)
plt.ylabel(r"$n(z)$ $[h^{3} \mathrm{Mpc}^{-3}]$", fontsize=16)
[ 000001.11 ] 0: 08-03 17:16 RedshiftHistogram INFO using Scott's rule to determine optimal binning; h = 4.40e-03, N_bins = 207
[ 000001.14 ] 0: 08-03 17:16 RedshiftHistogram INFO using cosmology {'output': 'vTk dTk mPk', 'extra metric transfer functions': 'y', 'n_s': 0.9667, 'gauge': 'synchronous', 'N_ur': 2.0328, 'h': 0.7, 'T_cmb': 2.7255, 'N_ncdm': 1, 'P_k_max_h/Mpc': 10.0, 'z_max_pk': 100.0, 'Omega_b': 0.04775550899153668, 'Omega_cdm': 0.2609299279412303, 'm_ncdm': [0.06]} to compute volume in units of (Mpc/h)^3
[ 000001.14 ] 0: 08-03 17:16 RedshiftHistogram INFO sky fraction used in volume calculation: 0.1500
[6]:
Text(0,0.5,'$n(z)$ $[h^{3} \\mathrm{Mpc}^{-3}]$')
In this figure, note that the measured \(n(z)\) for the data closely resembles the input distribution we used: a Gaussian distribution with \(\mu=0.5\) and \(\sigma=0.1\).
Next, we initialize the FKPCatalog
, which combines the “data” and “randoms” catalogs into a single object. Columns are now available in the FKPCatalog
prefixed by either “data/” or “randoms/”.
[7]:
# initialize the FKP source
fkp = FKPCatalog(data, randoms)
# print out the columns
print("columns in FKPCatalog = ", fkp.columns)
columns in FKPCatalog = ['data/FKPWeight', 'data/Position', 'data/Selection', 'data/Value', 'data/Weight', 'data/dec', 'data/ra', 'data/z', 'randoms/FKPWeight', 'randoms/Position', 'randoms/Selection', 'randoms/Value', 'randoms/Weight', 'randoms/dec', 'randoms/ra', 'randoms/z']
And we add the \(n(z)\) column to both the “data” and “randoms”, using the appropriate redshift column to compute the results.
[8]:
# add the n(z) columns to the FKPCatalog
fkp['randoms/NZ'] = nofz(randoms['z'])
fkp['data/NZ'] = nofz(data['z'])
Adding FKP Weights¶
Here, we add a column FKPWeight
that gives the appropriate FKP weight for each catalog. The FKP weights are given by:
Here, we use a value of \(P_0 = 10^4 \ h^{-3} \mathrm{Mpc}^3\).
[9]:
fkp['data/FKPWeight'] = 1.0 / (1 + fkp['data/NZ'] * 1e4)
fkp['randoms/FKPWeight'] = 1.0 / (1 + fkp['randoms/NZ'] * 1e4)
Adding Completeness Weights¶
The ConvolvedFFTPower
algorithm also supports the use of completeness weights, which weight the number density fields of the “data” and “randoms” catalogs. Here, we add random weights to both catalogs as the Weight
column.
Completeness weights change the number density field such that the weighted number density field on the mesh is equal to \(n'(\mathbf{r}) = w_c(\mathbf{r}) n(\mathbf{r})\), where \(w_c\) represents the completeness weights.
[10]:
fkp['data/Weight'] = numpy.random.random(size=data.size)
fkp['randoms/Weight'] = numpy.random.random(size=randoms.size)
Computing the Multipoles¶
To compute the multipoles, first we convert our FKPCatalog
to a mesh object, specifying the number of mesh cells per side, as well as the names of the \(n(z)\) and weight columns.
If a Cartesian box size is not specified by the user, the size will be computed from the maximum extent of the Position
column automatically.
[11]:
mesh = fkp.to_mesh(Nmesh=256, nbar='NZ', comp_weight='Weight', fkp_weight='FKPWeight')
[ 000003.45 ] 0: 08-03 17:16 FKPCatalog INFO BoxSize = [2126. 4028. 2042.]
[ 000003.45 ] 0: 08-03 17:16 FKPCatalog INFO cartesian coordinate range: [-2114.09611013 -1983.75342661 -126.22300371] : [ -30.15508148 1964.60785817 1875.18039017]
[ 000003.45 ] 0: 08-03 17:16 FKPCatalog INFO BoxCenter = [-1072.1255958 -9.57278422 874.47869323]
Now, we are able to compute the desired multipoles. Here, we compute the \(\ell=0,2,\) and \(4\) multipoles using a wavenumber spacing of \(k = 0.005\) \(h/\mathrm{Mpc}\). The maximum \(k\) value computed is set by the Nyquist frequency of the mesh, \(k_\mathrm{max} = k_\mathrm{Nyq} = \pi N_\mathrm{mesh} / L_\mathrm{box}\).
[12]:
# compute the multipoles
r = ConvolvedFFTPower(mesh, poles=[0,2,4], dk=0.005, kmin=0.01)
[ 000003.59 ] 0: 08-03 17:16 ConvolvedFFTPower INFO using compensation function CompensateCICShotnoise for source 'first'
[ 000003.59 ] 0: 08-03 17:16 ConvolvedFFTPower INFO using compensation function CompensateCICShotnoise for source 'second'
[ 000004.55 ] 0: 08-03 17:16 CatalogMesh INFO painted 500000 out of 500000 objects to mesh
[ 000004.56 ] 0: 08-03 17:16 CatalogMesh INFO painted 500000 out of 500000 objects to mesh
[ 000004.56 ] 0: 08-03 17:16 CatalogMesh INFO mean particles per cell is 0.0124907
[ 000004.56 ] 0: 08-03 17:16 CatalogMesh INFO sum is 209559
[ 000004.56 ] 0: 08-03 17:16 CatalogMesh INFO normalized the convention to 1 + delta
[ 000004.64 ] 0: 08-03 17:16 CatalogMesh INFO painted 50000 out of 50000 objects to mesh
[ 000004.65 ] 0: 08-03 17:16 CatalogMesh INFO painted 50000 out of 50000 objects to mesh
[ 000004.65 ] 0: 08-03 17:16 CatalogMesh INFO mean particles per cell is 0.00125849
[ 000004.65 ] 0: 08-03 17:16 CatalogMesh INFO sum is 21114
[ 000004.65 ] 0: 08-03 17:16 CatalogMesh INFO normalized the convention to 1 + delta
[ 000004.70 ] 0: 08-03 17:16 FKPCatalogMesh INFO field: (FKPCatalog(species=['data', 'randoms']) as CatalogMesh) painting done
[ 000004.70 ] 0: 08-03 17:16 ConvolvedFFTPower INFO cic painting of 'first' done
[ 000005.07 ] 0: 08-03 17:16 ConvolvedFFTPower INFO ell = 0 done; 1 r2c completed
[ 000005.67 ] 0: 08-03 17:16 ConvolvedFFTPower INFO normalized power spectrum with `randoms.norm = 0.333066`
[ 000007.68 ] 0: 08-03 17:16 ConvolvedFFTPower INFO ell = 2 done; 5 r2c completed
/home/yfeng1/anaconda3/install/lib/python3.6/site-packages/nbodykit/algorithms/fftpower.py:616: RuntimeWarning: invalid value encountered in sqrt
xslab **= 0.5
/home/yfeng1/anaconda3/install/lib/python3.6/site-packages/nbodykit/meshtools.py:136: RuntimeWarning: divide by zero encountered in true_divide
return sum(self.coords(i) * los[i] for i in range(self.ndim)) / self.norm2()**0.5
[ 000012.58 ] 0: 08-03 17:16 ConvolvedFFTPower INFO ell = 4 done; 9 r2c completed
[ 000013.47 ] 0: 08-03 17:16 ConvolvedFFTPower INFO higher order multipoles computed in elapsed time 00:00:07.81
The meta-data computed during the calculation is stored in attrs
dictionary. See the documentation for more information.
[13]:
for key in r.attrs:
print("%s = %s" % (key, str(r.attrs[key])))
poles = [0, 2, 4]
dk = 0.005
kmin = 0.01
use_fkp_weights = False
P0_FKP = None
Nmesh = [256 256 256]
BoxSize = [2126. 4028. 2042.]
BoxPad = [0.02 0.02 0.02]
BoxCenter = [-1072.1255958 -9.57278422 874.47869323]
mesh.window = cic
mesh.interlaced = False
alpha = 0.10068100439630336
data.norm = 0.3319168209225001
randoms.norm = 0.3330663605881116
shotnoise = 39294.08527131943
randoms.N = 500000
randoms.W = 250038.26038984343
randoms.num_per_cell = 0.012490674008300421
data.N = 50000
data.W = 25174.10319355387
data.num_per_cell = 0.0012584947277048816
data.seed = 42
randoms.seed = 84
The measured multipoles are stored in the poles
attribute. Below, we plot the monopole, quadrupole, and hexadecapole, making sure to subtract out the shot noise value from the monopole.
[14]:
poles = r.poles
for ell in [0, 2, 4]:
label = r'$\ell=%d$' % (ell)
P = poles['power_%d' %ell].real
if ell == 0: P = P - poles.attrs['shotnoise']
plt.plot(poles['k'], P, label=label)
# format the axes
plt.legend(loc=0)
plt.xlabel(r"$k$ [$h \ \mathrm{Mpc}^{-1}$]")
plt.ylabel(r"$P_\ell$ [$h^{-3} \mathrm{Mpc}^3$]")
plt.xlim(0.01, 0.3)
[14]:
(0.01, 0.3)
Note that, as expected, there is no measurably cosmological signal, since the input catalogs were simply uniformly distributed objects on the sky.
The Power Spectrum of Data in a Simulation Box¶
In this notebook, we explore the functionality of the FFTPower
algorithm, which can compute the 1D power spectrum \(P(k)\), 2D power spectrum \(P(k,\mu)\), and multipoles \(P_\ell(k)\). The algorithm is suitable for use on data sets in periodic simulation boxes, as the power spectrum is computed via a single FFT of the density mesh.
[1]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
[2]:
from nbodykit.lab import *
from nbodykit import setup_logging, style
import matplotlib.pyplot as plt
plt.style.use(style.notebook)
[3]:
setup_logging()
Initalizing a Log-normal Mock¶
We start by generating a mock catalog of biased objects (\(b_1 = 2\) ) at a redshif \(z=0.55\). We use the Planck 2015 cosmology and the Eisenstein-Hu linear power spectrum fitting formula. We generate the catalog in a box of side length \(L = 1380 \ \mathrm{Mpc}/h\) with a constant number density \(\bar{n} = 3 \times 10^{-3} \ h^{3} \mathrm{Mpc}^{-3}\).
[4]:
redshift = 0.55
cosmo = cosmology.Planck15
Plin = cosmology.LinearPower(cosmo, redshift, transfer='EisensteinHu')
b1 = 2.0
cat = LogNormalCatalog(Plin=Plin, nbar=3e-4, BoxSize=1380., Nmesh=256, bias=b1, seed=42)
We update the Position
column to add redshift-space distortions along the z
axis of the box using the VelocityOffset
column.
[5]:
# add RSD
line_of_sight = [0,0,1]
cat['RSDPosition'] = cat['Position'] + cat['VelocityOffset'] * line_of_sight
Computing the 1D Power, \(P(k)\)¶
In this section, we compute and plot the 1D power spectrum \(P(k)\) of the log-normal mock.
We must first convert our CatalogSource
object to a MeshSource
, by setting up the mesh and specifying which interpolation kernel we wish to use. Here, we use “TSC” interpolation, and specify via compensated=True
that we wish to correct for the effects of the interpolation window in Fourier space.
[6]:
# convert to a MeshSource, using TSC interpolation on 256^3 mesh
mesh = cat.to_mesh(window='tsc', Nmesh=256, compensated=True, position='RSDPosition')
We compute the 1D power spectrum by specifying mode
as “1d”. We also choose the desired linear k
binning by specifying the bin spacing via dk
and the minimum k
value via kmin
.
[7]:
# compute the power, specifying desired linear k-binning
r = FFTPower(mesh, mode='1d', dk=0.005, kmin=0.01)
[ 000010.90 ] 0: 10-30 20:44 CatalogMesh INFO painted 787121 out of 787121 objects to mesh
[ 000010.90 ] 0: 10-30 20:44 CatalogMesh INFO mean particles per cell is 0.0469161
[ 000010.91 ] 0: 10-30 20:44 CatalogMesh INFO sum is 787120
[ 000010.91 ] 0: 10-30 20:44 CatalogMesh INFO normalized the convention to 1 + delta
[ 000011.25 ] 0: 10-30 20:44 CatalogMesh INFO field: (LogNormalCatalog(seed=42, bias=2) as CatalogMesh) painting done
The result is computed when initializing the FFTPower
class and the power spectrum results are stored as a BinnedStatistic
object as the power
attribute.
[8]:
# the result is stored at "power" attribute
Pk = r.power
print(Pk)
<BinnedStatistic: dims: (k: 115), variables: ('k', 'power', 'modes')>
The coords
attribute of the BinnedStatistic
object specifies the coordinate grid for the binned result, which in this case, is just the center values of the k
bins.
By default, the power is computed up to the 1D Nyquist frequency, which is defined as \(k_\mathrm{Nyq} = \pi N_\mathrm{mesh} / L_\mathrm{box}\), which in this case is equal to \(k_\mathrm{Nyq} = \pi \cdot 256 / 1380 = 0.5825 \ h \ \mathrm{Mpc}^{-1}\).
[9]:
print(Pk.coords)
{'k': array([ 0.0125, 0.0175, 0.0225, 0.0275, 0.0325, 0.0375, 0.0425,
0.0475, 0.0525, 0.0575, 0.0625, 0.0675, 0.0725, 0.0775,
0.0825, 0.0875, 0.0925, 0.0975, 0.1025, 0.1075, 0.1125,
0.1175, 0.1225, 0.1275, 0.1325, 0.1375, 0.1425, 0.1475,
0.1525, 0.1575, 0.1625, 0.1675, 0.1725, 0.1775, 0.1825,
0.1875, 0.1925, 0.1975, 0.2025, 0.2075, 0.2125, 0.2175,
0.2225, 0.2275, 0.2325, 0.2375, 0.2425, 0.2475, 0.2525,
0.2575, 0.2625, 0.2675, 0.2725, 0.2775, 0.2825, 0.2875,
0.2925, 0.2975, 0.3025, 0.3075, 0.3125, 0.3175, 0.3225,
0.3275, 0.3325, 0.3375, 0.3425, 0.3475, 0.3525, 0.3575,
0.3625, 0.3675, 0.3725, 0.3775, 0.3825, 0.3875, 0.3925,
0.3975, 0.4025, 0.4075, 0.4125, 0.4175, 0.4225, 0.4275,
0.4325, 0.4375, 0.4425, 0.4475, 0.4525, 0.4575, 0.4625,
0.4675, 0.4725, 0.4775, 0.4825, 0.4875, 0.4925, 0.4975,
0.5025, 0.5075, 0.5125, 0.5175, 0.5225, 0.5275, 0.5325,
0.5375, 0.5425, 0.5475, 0.5525, 0.5575, 0.5625, 0.5675,
0.5725, 0.5775, 0.5825])}
The input parameters to the algorithm, as well as the meta-data computed during the calculation, are stored in the attrs
dictionary attribute. A key attribute is the Poisson shot noise, stored as the “shotnoise” key.
[10]:
# print out the meta-data
for k in Pk.attrs:
print("%s = %s" %(k, str(Pk.attrs[k])))
Nmesh = [256 256 256]
BoxSize = [ 1380. 1380. 1380.]
dk = 0.005
kmin = 0.01
Lx = 1380.0
Ly = 1380.0
Lz = 1380.0
volume = 2628072000.0
mode = 1d
los = [0, 0, 1]
Nmu = 1
poles = []
N1 = 787121
N2 = 787121
shotnoise = 3338.84116927
Now, we plot the 1D power, first subtracting out the shot noise. Note that the power is complex, so we only plot the real part.
[11]:
# print the shot noise subtracted P(k)
plt.loglog(Pk['k'], Pk['power'].real - Pk.attrs['shotnoise'])
# format the axes
plt.xlabel(r"$k$ [$h \ \mathrm{Mpc}^{-1}$]")
plt.ylabel(r"$P(k)$ [$h^{-3}\mathrm{Mpc}^3$]")
plt.xlim(0.01, 0.6)
[11]:
(0.01, 0.6)
Computing the 2D Power, \(P(k,\mu)\)¶
In this section, we compute and plot the 2D power spectrum \(P(k,\mu)\). For illustration, we compute results using a line-of-sight that is both parallel and perpendicular to the direction that we added redshift-space distortions.
Here, we compute \(P(k,\mu)\) where \(\mu\) is defined with respect to the z
axis (los=[0,0,1]
), using 5 \(\mu\) bins ranging from \(\mu=0\) to \(\mu=1\).
[12]:
# compute the 2D power
r = FFTPower(mesh, mode='2d', dk=0.005, kmin=0.01, Nmu=5, los=[0,0,1])
Pkmu = r.power
print(Pkmu)
[ 000014.28 ] 0: 10-30 20:44 CatalogMesh INFO painted 787121 out of 787121 objects to mesh
[ 000014.29 ] 0: 10-30 20:44 CatalogMesh INFO mean particles per cell is 0.0469161
[ 000014.29 ] 0: 10-30 20:44 CatalogMesh INFO sum is 787120
[ 000014.29 ] 0: 10-30 20:44 CatalogMesh INFO normalized the convention to 1 + delta
[ 000014.71 ] 0: 10-30 20:44 CatalogMesh INFO field: (LogNormalCatalog(seed=42, bias=2) as CatalogMesh) painting done
<BinnedStatistic: dims: (k: 115, mu: 5), variables: ('k', 'mu', 'power', 'modes')>
The coords
attribute of the result now gives the centers of both the k
and mu
bins.
[13]:
print(Pkmu.coords)
{'k': array([ 0.0125, 0.0175, 0.0225, 0.0275, 0.0325, 0.0375, 0.0425,
0.0475, 0.0525, 0.0575, 0.0625, 0.0675, 0.0725, 0.0775,
0.0825, 0.0875, 0.0925, 0.0975, 0.1025, 0.1075, 0.1125,
0.1175, 0.1225, 0.1275, 0.1325, 0.1375, 0.1425, 0.1475,
0.1525, 0.1575, 0.1625, 0.1675, 0.1725, 0.1775, 0.1825,
0.1875, 0.1925, 0.1975, 0.2025, 0.2075, 0.2125, 0.2175,
0.2225, 0.2275, 0.2325, 0.2375, 0.2425, 0.2475, 0.2525,
0.2575, 0.2625, 0.2675, 0.2725, 0.2775, 0.2825, 0.2875,
0.2925, 0.2975, 0.3025, 0.3075, 0.3125, 0.3175, 0.3225,
0.3275, 0.3325, 0.3375, 0.3425, 0.3475, 0.3525, 0.3575,
0.3625, 0.3675, 0.3725, 0.3775, 0.3825, 0.3875, 0.3925,
0.3975, 0.4025, 0.4075, 0.4125, 0.4175, 0.4225, 0.4275,
0.4325, 0.4375, 0.4425, 0.4475, 0.4525, 0.4575, 0.4625,
0.4675, 0.4725, 0.4775, 0.4825, 0.4875, 0.4925, 0.4975,
0.5025, 0.5075, 0.5125, 0.5175, 0.5225, 0.5275, 0.5325,
0.5375, 0.5425, 0.5475, 0.5525, 0.5575, 0.5625, 0.5675,
0.5725, 0.5775, 0.5825]), 'mu': array([ 0.1, 0.3, 0.5, 0.7, 0.9])}
We plot each the power for each of the 5 \(\mu\) bins, and we see the effects of redshift-space distortions as a function of \(\mu\).
[14]:
# plot each mu bin
for i in range(Pkmu.shape[1]):
Pk = Pkmu[:,i] # select the ith mu bin
label = r'$\mu$=%.1f' % (Pkmu.coords['mu'][i])
plt.loglog(Pk['k'], Pk['power'].real - Pk.attrs['shotnoise'], label=label)
# format the axes
plt.legend(loc=0, ncol=2)
plt.xlabel(r"$k$ [$h \ \mathrm{Mpc}^{-1}$]")
plt.ylabel(r"$P(k, \mu)$ [$h^{-3}\mathrm{Mpc}^3$]")
plt.xlim(0.01, 0.6)
[14]:
(0.01, 0.6)
Now, we specify the line-of-sight as the x
axis and again compute the 2D power.
[15]:
r = FFTPower(mesh, mode='2d', dk=0.005, kmin=0.01, Nmu=5, los=[1,0,0])
Pkmu = r.power
[ 000018.03 ] 0: 10-30 20:44 CatalogMesh INFO painted 787121 out of 787121 objects to mesh
[ 000018.03 ] 0: 10-30 20:44 CatalogMesh INFO mean particles per cell is 0.0469161
[ 000018.03 ] 0: 10-30 20:44 CatalogMesh INFO sum is 787120
[ 000018.03 ] 0: 10-30 20:44 CatalogMesh INFO normalized the convention to 1 + delta
[ 000018.45 ] 0: 10-30 20:44 CatalogMesh INFO field: (LogNormalCatalog(seed=42, bias=2) as CatalogMesh) painting done
Again we plot the 2D power for each of the \(\mu\) bins, but now that \(\mu\) is not defined along the same axis as the redshift-space distortions, we measure isotropic power as a function of \(\mu\).
[16]:
# plot each mu bin
for i in range(Pkmu.shape[1]):
Pk = Pkmu[:,i] # select the ith mu bin
label = r'$\mu$=%.1f' % (Pkmu.coords['mu'][i])
plt.loglog(Pk['k'], Pk['power'].real - Pk.attrs['shotnoise'], label=label)
# format the axes
plt.legend(loc=0, ncol=2)
plt.xlabel(r"$k$ [$h \ \mathrm{Mpc}^{-1}$]")
plt.ylabel(r"$P(k, \mu)$ [$h^{-3}\mathrm{Mpc}^3$]")
plt.xlim(0.01, 0.6)
[16]:
(0.01, 0.6)
Computing the Multipoles, \(P_\ell(k)\)¶
In this section, we also measure the power spectrum multipoles, \(P_\ell\), which projects the 2D power on to a basis defined by Legendre weights. The desired multipole numbers \(\ell\) should be specified as the poles
keyword.
[17]:
# compute the 2D power AND ell=0,2,4 multipoles
r = FFTPower(mesh, mode='2d', dk=0.005, kmin=0.01, Nmu=5, los=[0,0,1], poles=[0,2,4])
[ 000021.70 ] 0: 10-30 20:44 CatalogMesh INFO painted 787121 out of 787121 objects to mesh
[ 000021.70 ] 0: 10-30 20:44 CatalogMesh INFO mean particles per cell is 0.0469161
[ 000021.70 ] 0: 10-30 20:44 CatalogMesh INFO sum is 787120
[ 000021.70 ] 0: 10-30 20:44 CatalogMesh INFO normalized the convention to 1 + delta
[ 000022.04 ] 0: 10-30 20:44 CatalogMesh INFO field: (LogNormalCatalog(seed=42, bias=2) as CatalogMesh) painting done
Now, there is an additional attribute poles
which stores the \(P_\ell(k)\) result.
[18]:
poles = r.poles
print(poles)
print("variables = ", poles.variables)
<BinnedStatistic: dims: (k: 115), variables: 5 total>
variables = ['k', 'power_0', 'power_2', 'power_4', 'modes']
We plot each multipole, subtracting the shot noise only from the monopole \(P_0\). The multipoles are stored using the variable names “power_0”, “power_2”, etc for \(\ell=0,2,\) etc.
[19]:
for ell in [0, 2, 4]:
label = r'$\ell=%d$' % (ell)
P = poles['power_%d' %ell].real
if ell == 0: P = P - poles.attrs['shotnoise']
plt.plot(poles['k'], poles['k'] * P, label=label)
# format the axes
plt.legend(loc=0)
plt.xlabel(r"$k$ [$h \ \mathrm{Mpc}^{-1}$]")
plt.ylabel(r"$k \ P_\ell$ [$h^{-2} \mathrm{Mpc}^2$]")
plt.xlim(0.01, 0.6)
[19]:
(0.01, 0.6)
Halo Occupation Distribution (HOD) Mocks¶
In this notebook, we will generate a mock halo catalog by running a FOF finder on a LogNormalCatalog
. We will then populate those halos with galaxies via the HOD technique and compare the galaxy and halo power spectra in both real and redshift space.
Note
When viewing results in this notebook, keep in mind that the FOF catalog generated from the LogNormalCatalog
object is only semi-realistic. We choose to use the LogNormalCatalog
rather than a more realistic data set due to the ease with which we can load the mock data.
[1]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
[2]:
from nbodykit.lab import *
from nbodykit import setup_logging, style
import matplotlib.pyplot as plt
plt.style.use(style.notebook)
[3]:
setup_logging() # turn on logging to screen
Generating Halos via the FOF Algorithm¶
We’ll start by generating a mock galaxy catalog at \(z=0.55\) from a log-normal density field.
[4]:
redshift = 0.55
cosmo = cosmology.Planck15
Plin = cosmology.LinearPower(cosmo, redshift, transfer='EisensteinHu')
b1 = 2.0
input_cat = LogNormalCatalog(Plin=Plin, nbar=3e-3, BoxSize=1380., Nmesh=256, bias=b1, seed=42)
And then, we run a friends-of-friends (FOF) algorithm on the mock galaxies to identify halos, using a linking length of 0.2 times the mean separation between objects. Here, we only keep halos with at least 5 particle members.
[5]:
fof = FOF(input_cat, linking_length=0.2, nmin=5)
Now, we set the particle mass as \(10^{12} \ M_\odot/h\) and create a HaloCatalog
using the Planck 2015 parameters at a redshift of \(z = 0.55\). Here, ‘vir’ specifies that we wish to create the Mass
column as the virial mass. This mass definition is necessary for computing the analytic concentration, needed when generating the HOD galaxies.
[6]:
halos = fof.to_halos(cosmo=cosmo, redshift=redshift, particle_mass=1e12, mdef='vir')
Finally, we convert the HaloCatalog
object to a format recognized by the halotools
package and initialze our HOD catalog, using the default HOD parameters.
[7]:
halotools_halos = halos.to_halotools()
hod = HODCatalog(halotools_halos)
[ 000048.48 ] 0: 10-30 20:53 HOD INFO satellite fraction: 0.05
[ 000048.48 ] 0: 10-30 20:53 HOD INFO mean log10 halo mass: 13.10
[ 000048.49 ] 0: 10-30 20:53 HOD INFO std log10 halo mass: 0.31
Using the gal_type
column, we can create subsets of the HOD catalog that contain only centrals and only satellites.
[8]:
cen_idx = hod['gal_type'] == 0
sat_idx = hod['gal_type'] == 1
cens = hod[cen_idx]
sats = hod[sat_idx]
Real-space Power Spectra¶
In this section, we compute the real-space power spectrum of the FOF halos and the HOD galaxies and compare.
[9]:
# compute P(k) for the FOF halos
halo_power = FFTPower(halos, mode='1d', Nmesh=512)
[ 000049.20 ] 0: 10-30 20:53 CatalogMesh INFO painted 24720 out of 24720 objects to mesh
[ 000049.20 ] 0: 10-30 20:53 CatalogMesh INFO mean particles per cell is 0.000184178
[ 000049.20 ] 0: 10-30 20:53 CatalogMesh INFO sum is 24720
[ 000049.20 ] 0: 10-30 20:53 CatalogMesh INFO normalized the convention to 1 + delta
[ 000054.30 ] 0: 10-30 20:53 CatalogMesh INFO field: (HaloCatalog(size=24720) as CatalogMesh) painting done
And we can compute P(k) for all galaxies, centrals only, satellites only, and the cross power between centrals and satellites.
[10]:
gal_power = FFTPower(hod, mode='1d', Nmesh=256)
cen_power = FFTPower(cens, mode='1d', Nmesh=256)
sat_power = FFTPower(sats, mode='1d', Nmesh=256)
cen_sat_power = FFTPower(cens, second=sats, mode='1d', Nmesh=256)
[ 000063.02 ] 0: 10-30 20:53 CatalogMesh INFO painted 9551 out of 9551 objects to mesh
[ 000063.02 ] 0: 10-30 20:53 CatalogMesh INFO mean particles per cell is 0.000569284
[ 000063.02 ] 0: 10-30 20:53 CatalogMesh INFO sum is 9551
[ 000063.02 ] 0: 10-30 20:53 CatalogMesh INFO normalized the convention to 1 + delta
[ 000063.60 ] 0: 10-30 20:53 CatalogMesh INFO field: (HODCatalog(logMmin=13.03, sigma_logM=0.38, alpha=0.76, logM0=13.27, logM1=14.08) as CatalogMesh) painting done
[ 000064.60 ] 0: 10-30 20:53 CatalogMesh INFO painted 9119 out of 9119 objects to mesh
[ 000064.60 ] 0: 10-30 20:53 CatalogMesh INFO mean particles per cell is 0.000543535
[ 000064.60 ] 0: 10-30 20:53 CatalogMesh INFO sum is 9119
[ 000064.60 ] 0: 10-30 20:53 CatalogMesh INFO normalized the convention to 1 + delta
[ 000065.10 ] 0: 10-30 20:53 CatalogMesh INFO field: (HODCatalog(logMmin=13.03, sigma_logM=0.38, alpha=0.76, logM0=13.27, logM1=14.08) as CatalogMesh) painting done
[ 000066.14 ] 0: 10-30 20:53 CatalogMesh INFO painted 432 out of 432 objects to mesh
[ 000066.14 ] 0: 10-30 20:53 CatalogMesh INFO mean particles per cell is 2.57492e-05
[ 000066.15 ] 0: 10-30 20:53 CatalogMesh INFO sum is 432
[ 000066.15 ] 0: 10-30 20:53 CatalogMesh INFO normalized the convention to 1 + delta
[ 000067.08 ] 0: 10-30 20:53 CatalogMesh INFO field: (HODCatalog(logMmin=13.03, sigma_logM=0.38, alpha=0.76, logM0=13.27, logM1=14.08) as CatalogMesh) painting done
[ 000069.09 ] 0: 10-30 20:53 CatalogMesh INFO painted 9119 out of 9119 objects to mesh
[ 000069.09 ] 0: 10-30 20:53 CatalogMesh INFO mean particles per cell is 0.000543535
[ 000069.10 ] 0: 10-30 20:53 CatalogMesh INFO sum is 9119
[ 000069.10 ] 0: 10-30 20:53 CatalogMesh INFO normalized the convention to 1 + delta
[ 000069.93 ] 0: 10-30 20:53 CatalogMesh INFO field: (HODCatalog(logMmin=13.03, sigma_logM=0.38, alpha=0.76, logM0=13.27, logM1=14.08) as CatalogMesh) painting done
[ 000070.06 ] 0: 10-30 20:53 CatalogMesh INFO painted 432 out of 432 objects to mesh
[ 000070.07 ] 0: 10-30 20:53 CatalogMesh INFO mean particles per cell is 2.57492e-05
[ 000070.07 ] 0: 10-30 20:53 CatalogMesh INFO sum is 432
[ 000070.07 ] 0: 10-30 20:53 CatalogMesh INFO normalized the convention to 1 + delta
[ 000070.74 ] 0: 10-30 20:53 CatalogMesh INFO field: (HODCatalog(logMmin=13.03, sigma_logM=0.38, alpha=0.76, logM0=13.27, logM1=14.08) as CatalogMesh) painting done
[11]:
# plot galaxy auto power, centrals auto power, and sats auto power
labels = [r"$P^{gg}$", r"$P^{cc}$", r"$P^{ss}$"]
for i, r in enumerate([gal_power, cen_power, sat_power]):
Pk = r.power
plt.loglog(Pk['k'], Pk['power'].real-Pk.attrs['shotnoise'], label=labels[i])
# central-satellite power
Pk = cen_sat_power.power
plt.loglog(Pk['k'], Pk['power'].real, label=r"$P^{cs}$")
# the halo power
Phalo = halo_power.power
plt.loglog(Phalo['k'], Phalo['power'].real-Phalo.attrs['shotnoise'], c='k', label=r"$P^\mathrm{halo}$")
# add a legend and axis labels
plt.legend(loc=0, ncol=2)
plt.xlabel(r"$k$ [$h \ \mathrm{Mpc}^{-1}$]")
plt.ylabel(r"$P(k)$ [$h^{-3}\mathrm{Mpc}^3$]")
plt.xlim(0.01, 0.6)
plt.ylim(1e4, 5e6)
[11]:
(10000.0, 5000000.0)
Here, we see the high bias of the satellite power spectrum (green). Because the satellite fraction is relatively low (\(f_\mathrm{sat} \simeq 0.04\)), the amplitude of the total galaxy power \(P^{gg}\) is only slightly larger than the centrals auto power \(P^{cc}\) – the centrals dominate the clustering signal.
Redshift-space Power Spectra¶
First, we convert HOD galaxy positions from real to redshift space.
[12]:
LOS = [0, 0, 1]
hod['RSDPosition'] = hod['Position'] + hod['VelocityOffset'] * LOS
And then compute \(P(k,\mu)\) for the full HOD galaxy catalog in five \(\mu\) bins.
[13]:
mesh = hod.to_mesh(position='RSDPosition', Nmesh=256, compensated=True)
rsd_gal_power = FFTPower(mesh, mode='2d', Nmu=5)
[ 000074.18 ] 0: 10-30 20:53 CatalogMesh INFO painted 9551 out of 9551 objects to mesh
[ 000074.18 ] 0: 10-30 20:53 CatalogMesh INFO mean particles per cell is 0.000569284
[ 000074.18 ] 0: 10-30 20:53 CatalogMesh INFO sum is 9551
[ 000074.18 ] 0: 10-30 20:53 CatalogMesh INFO normalized the convention to 1 + delta
[ 000074.77 ] 0: 10-30 20:53 CatalogMesh INFO field: (HODCatalog(logMmin=13.03, sigma_logM=0.38, alpha=0.76, logM0=13.27, logM1=14.08) as CatalogMesh) painting done
Finally, plot the galaxy \(P(k,\mu)\) and compare to the measured halo power in real space.
[14]:
pkmu = rsd_gal_power.power
# plot each mu bin
for i in range(pkmu.shape[1]):
Pk = pkmu[:,i]
label = r'$P^{gg}(\mu$=%.1f)' %pkmu.coords['mu'][i]
plt.loglog(Pk['k'], Pk['power'].real - Pk.attrs['shotnoise'], label=label)
# the halo power
Phalo = halo_power.power
plt.loglog(Phalo['k'], Phalo['power'].real-Phalo.attrs['shotnoise'], c='k', label=r"$P^\mathrm{halo}$")
# add a legend and axis labels
plt.legend(loc=0, ncol=2)
plt.xlabel(r"$k$ [$h \ \mathrm{Mpc}^{-1}$]")
plt.ylabel(r"$P(k, \mu)$ [$h^{-3}\mathrm{Mpc}^3$]")
plt.xlim(0.01, 0.6)
plt.ylim(1e4, 2e6)
[14]:
(10000.0, 2000000.0)
In this figure, we see the effects of RSD on the HOD catalog, with the power spectrum damped at higher \(\mu\) on small scales (larger \(k\)).
The Effects of Interlaced Painting¶
In this notebook, we’ll outline the effects of using the “interlacing” technique when painting a density field to a mesh. Interlacing is designed to reduce the effects of aliasing in Fourier space, which is caused by the finite sampling of the mesh. See Section 3.1 of Sefusatti et al. 2015 for an introuction to the interlacing technique. Aliasing leads to spurious contributions to Fourier space statistics, and the effects become worse as one approaches the Nyquist frequency, defined as:
Interlacing is designed to remove the odd, high-frequency image multiples, thus reducing the overall aliasing contribution at \(k < k_\mathrm{nyq}\).
Here, we will consider the effects on interlacing on the power spectrum when using a mesh of size \(N_\mathrm{mesh}=256\), as compared to the “truth”, computed using a mesh with 2x resoluton, i.e., \(N_\mathrm{mesh} = 512\). We consider the effects when using Cloud in Cell (CIC) and Triangular Shaped Cloud interpoaltion methods.
We correct the measured power spectrum for the effects of the interpolation window using equation 18 of Jing et al. 2005. This amounts to dividing the power spectrum by:
where \(p = 4,6\) for CIC and TSC windows, respectively. To accomplish this correction in nbodykit, we use the CompensateTSC
and CompensateCIC
functions.
Note
We do not apply any additional corrections for aliasing outside of turning on interlacing. An additional approximation to correct aliasing is often used (see equation 20 of Jing et al. 2005), but we do not use that correction in order to focus on the effects of interlacing. See the functions CompensateTSCAliasing
and CompensateCICAliasing
for more info on these additional corrections.
[1]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
[2]:
from nbodykit.lab import *
from nbodykit import style
from scipy.interpolate import InterpolatedUnivariateSpline as spline
import matplotlib.pyplot as plt
plt.style.use(style.notebook)
Initalizing a Log-normal Mock¶
We start by generating a mock catalog of biased objects (\(b_1 = 2\) ) at a redshif \(z=0.55\). We use the Planck 2015 cosmology and the Eisenstein-Hu linear power spectrum fitting formula. We generate the catalog in a box of side length \(L = 1380 \ \mathrm{Mpc}/h\) with a constant number density \(\bar{n} = 3 \times 10^{-3} \ h^{3} \mathrm{Mpc}^{-3}\).
[3]:
redshift = 0.55
cosmo = cosmology.Planck15
Plin = cosmology.LinearPower(cosmo, redshift, transfer='EisensteinHu')
BoxSize = 1380.
Nmesh = 256
cat = LogNormalCatalog(Plin=Plin, nbar=3e-4, BoxSize=BoxSize, Nmesh=Nmesh, bias=2.0, seed=42)
Generating the “Truth”¶
We generate the “truth” power spectrum by using a higher resolution mesh with Nmesh=512
, using the TSC window interpolation method. With a higher resolution mesh, the Nyquist frequency is larger, decreasing the effects of aliasing at a given \(k\) value.
[4]:
# use a high-resolution mesh to get the truth
mesh = cat.to_mesh(window='tsc', Nmesh=512, compensated=True)
# compute the 1D power of this mesh
r = FFTPower(mesh, mode='1d')
# create a smooth interpolation
truth = r.power
truth = spline(truth['k'], truth['power'].real - truth.attrs['shotnoise'])
The Effects of Interlacing¶
In this section, we use a mesh of size \(256^3\) to compute the 1D power \(P(k)\) using the CIC and TSC windows and while using and not using the interlacing technique. We will then compare this \(P(k)\) to the “truth”, measured using the higher resolution mesh of size \(512^3\).
[5]:
for interlaced in [True, False]:
for window in ['CIC', 'TSC']:
# convert catalog to a mesh with desired window and interlacing
mesh = cat.to_mesh(Nmesh=256, window=window, compensated=False, interlaced=interlaced)
# apply correction for the window to the mesh
compensation = mesh.CompensateCIC if window == 'CIC' else mesh.CompensateTSC
mesh = mesh.apply(compensation, kind='circular', mode='complex')
# compute the 1D power P(k)
r = FFTPower(mesh, mode='1d')
Pk = r.power
# compare P(k) to the hi-resolution mesh P(k)
label = 'interlaced=%s, window=%s' %(interlaced, window)
plt.plot(Pk['k'], (Pk['power'].real - Pk.attrs['shotnoise']) / truth(Pk['k']), label=label)
# plot Nyquist frequency
k_ny = numpy.pi * Nmesh / BoxSize
plt.axvline(x=k_ny, c='k', label="Nyquist frequency for Nmesh=256")
# format the axes
plt.legend(loc=0, ncol=2)
plt.xlabel(r"$k$ [$h \ \mathrm{Mpc}^{-1}$]")
plt.ylabel(r"$P(k) / P(k)^\mathrm{truth}$")
plt.ylim(0.9, 1.2)
[5]:
(0.9, 1.2)
In this figure, we see clearly see the effects of aliasing on the lower-resolution mesh results, causing the results to deviate from unity. There are two trends to note in this figure. First, TSC interpolation performs better than the CIC window, since it is a higher-order interpolation scheme. Second, for a fixed window, turning interlacing drastically reduces the contributions from aliasing. When using TSC and interlacing, \(P(k)\) is recovered almost perfectly up to the Nyquist frequency.
Mesh Interpolation Windows¶
In this notebook we explore the different window options for interpolating discrete objects on to a mesh. We compute the 1D power spectrum \(P(k)\) of a density field interpolated from a log-normal mock using different windows. The windows we test are:
Cloud in Cell:
cic
(default in nbodykit)Triangular Shaped Cloud:
tsc
Lanczos Kernel with \(a=2\) and \(a=3\):
lanczos2
andlanczos3
Symmetric Daubechies wavelet with sizes 6, 12, and 20:
sym6
,sym12
, andsym20
We also include timing tests when using each of these windows. The CIC window (default) is the fastest and can be considerably faster than the other kernels, especially the wavelet windows with large sizes.
When computing the power spectrum \(P(k)\), we de-convolve the effects of the interpolation on the measured power (compensated=True
) for the CIC and TSC windows, but do not apply any corrections for the other windows.
For more information on using the Daubechies wavelets for power spectrum measurements, see Cui et al. 2008.
[1]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
[2]:
from nbodykit.lab import *
from nbodykit import style
from scipy.interpolate import InterpolatedUnivariateSpline as spline
import matplotlib.pyplot as plt
plt.style.use(style.notebook)
Initalizing a Log-normal Mock¶
We start by generating a mock catalog of biased objects (\(b_1 = 2\) ) at a redshif \(z=0.55\). We use the Planck 2015 cosmology and the Eisenstein-Hu linear power spectrum fitting formula. We generate the catalog in a box of side length \(L = 1380 \ \mathrm{Mpc}/h\) with a constant number density \(\bar{n} = 3 \times 10^{-3} \ h^{3} \mathrm{Mpc}^{-3}\).
[3]:
redshift = 0.55
cosmo = cosmology.Planck15
Plin = cosmology.LinearPower(cosmo, redshift, transfer='EisensteinHu')
cat = LogNormalCatalog(Plin=Plin, nbar=3e-4, BoxSize=1380., Nmesh=256, bias=2.0, seed=42)
cat_noise = UniformCatalog(nbar=3e-4, BoxSize=1380., seed=42)
Generating the “Truth”¶
We generate the “truth” power spectrum by using a higher resolution mesh with Nmesh=512
, using the TSC window interpolation method. With a higher resolution mesh, the Nyquist frequency is larger, decreasing the effects of aliasing at a given \(k\) value.
[4]:
r = FFTPower(cat.to_mesh(window='tsc', Nmesh=512, compensated=True),
mode='1d') # hi-resolution mesh
truth = r.power
truth = spline(truth['k'], truth['power'].real - truth.attrs['shotnoise'])
The truth power of the Poisson shot noise is just \(\frac{1}{\bar{n}}\).
[5]:
truth_noise = lambda k: 0.0 * k + 1 / 3e-4
Comparing Different Windows / LCDM-like¶
We compare the results when using different windows to measure a LCDM-like \(P(k)\) as compared to the high-resolution “truth” measured in the previous section. The goal is to identify which windows perform better/worse at minimizing this differences. We compute results using a lower-resultion mesh, Nmesh=256
. The effects of aliasing are clear: the measured power to differ from the “truth”. For CIC and TSC, there is a residuel bias at small scale even after the resampling window has been
compensated. This is because the true power-spectrum is LCDM-like, while the compensation window is derived for a shot-noise like signal.
[6]:
for window in ['cic', 'tsc', 'sym6', 'sym12', 'sym20', 'lanczos2', 'lanczos3']:
print("computing power for window '%s'\n" %window + "-"*32)
compensated = True if window in ['cic', 'tsc'] else False
mesh = cat.to_mesh(Nmesh=256, window=window, compensated=compensated, interlaced=False)
# compute the power
%time r = FFTPower(mesh, mode='1d')
Pk = r.power
plt.plot(Pk['k'], (Pk['power'].real - Pk.attrs['shotnoise']) / truth(Pk['k']), label=window)
plt.legend(loc=0, ncol=2, fontsize=16)
plt.xlabel(r"$k$ [$h \ \mathrm{Mpc}^{-1}$]")
plt.ylabel(r"$P(k) / P(k)^\mathrm{truth}$")
plt.xlim(0.01, 0.6)
plt.ylim(0.9, 1.1)
computing power for window 'cic'
--------------------------------
CPU times: user 1.54 s, sys: 89.1 ms, total: 1.63 s
Wall time: 1.64 s
computing power for window 'tsc'
--------------------------------
CPU times: user 1.58 s, sys: 88.4 ms, total: 1.67 s
Wall time: 1.68 s
computing power for window 'sym6'
--------------------------------
CPU times: user 5.21 s, sys: 159 ms, total: 5.36 s
Wall time: 5.48 s
computing power for window 'sym12'
--------------------------------
CPU times: user 12.3 s, sys: 253 ms, total: 12.5 s
Wall time: 13 s
computing power for window 'sym20'
--------------------------------
CPU times: user 19.5 s, sys: 195 ms, total: 19.7 s
Wall time: 19.8 s
computing power for window 'lanczos2'
--------------------------------
CPU times: user 2.13 s, sys: 93.4 ms, total: 2.22 s
Wall time: 2.23 s
computing power for window 'lanczos3'
--------------------------------
CPU times: user 4.01 s, sys: 143 ms, total: 4.15 s
Wall time: 4.3 s
[6]:
(0.9, 1.1)
In this figure, we see that each window kernel has a specific \(k_\mathrm{max}\) where we can reasonably trust the measured results. The wavelet kernels sym12
and sym20
perform the best but taking over an order magnitude longer to compute than the default kernel (CIC). The Triangular Shaped Cloud window performs the next best, followed by the sym6
kernel and the CIC kernel. The Lanczos kernels do not perform well and should not be used by the user without additional compensation
corrections (currently not implemented by default in nbodykit).
Comparing Different Windows / Shotnoise-like¶
We compare the results when using different windows to measure a shotnois-like \(P(k)\) as compared to the high-resolution “truth” measured in the previous section. We compute results using a lower-resultion mesh, Nmesh=256
. The effects of aliasing and resampling is clear: compensation is effective at compensating the ‘missing’ power, recovering the flat shot-noise power. The wavelet inspired resampling windows also recovers the shot-noise power excellently. The lanczos based windows
perform relatively badly due to a lack of suitable compensation function.
[7]:
for window in ['cic', 'tsc', 'sym6', 'sym12', 'sym20', 'lanczos2', 'lanczos3']:
print("computing power for window '%s'\n" %window + "-"*32)
compensated = True if window in ['cic', 'tsc'] else False
mesh = cat_noise.to_mesh(Nmesh=256, window=window, compensated=compensated, interlaced=False)
# compute the power
%time r = FFTPower(mesh, mode='1d')
Pk = r.power
plt.plot(Pk['k'], (Pk['power'].real) / truth_noise(Pk['k']), label=window)
plt.legend(loc=0, ncol=2, fontsize=16)
plt.xlabel(r"$k$ [$h \ \mathrm{Mpc}^{-1}$]")
plt.ylabel(r"$P(k) / P(k)^\mathrm{truth}$")
plt.xlim(0.01, 0.6)
plt.ylim(0.9, 1.1)
computing power for window 'cic'
--------------------------------
CPU times: user 1.71 s, sys: 122 ms, total: 1.83 s
Wall time: 1.86 s
computing power for window 'tsc'
--------------------------------
CPU times: user 1.99 s, sys: 151 ms, total: 2.14 s
Wall time: 2.21 s
computing power for window 'sym6'
--------------------------------
CPU times: user 7.99 s, sys: 231 ms, total: 8.22 s
Wall time: 8.59 s
computing power for window 'sym12'
--------------------------------
CPU times: user 18.5 s, sys: 395 ms, total: 18.9 s
Wall time: 19.5 s
computing power for window 'sym20'
--------------------------------
CPU times: user 28.1 s, sys: 473 ms, total: 28.6 s
Wall time: 29.4 s
computing power for window 'lanczos2'
--------------------------------
CPU times: user 3.41 s, sys: 205 ms, total: 3.61 s
Wall time: 3.8 s
computing power for window 'lanczos3'
--------------------------------
CPU times: user 5.42 s, sys: 138 ms, total: 5.56 s
Wall time: 5.57 s
[7]:
(0.9, 1.1)
Recommendations
Given the increased speed costs of the wavelet windows, we recommend users use either the CIC or TSC windows. For optimized solutions providing the most precise results in the fastest amount of time, users should test the effects of setting interlaced=True
(see this tutorial) while also trying various mesh sizes by changing the Nmesh
parameter. The most robust results are obtained by running a series of convergence tests by computing results with a high-resolution
mesh, which reduces the contributions of aliasing at fixed \(k\). This allows the user to robustly determine the best configuration for their desired accuracy.
Log-normal Mocks¶
In this notebook, we generate a LogNormalCatalog
at \(z=0.55\) using the Planck 2015 cosmology parameters and the Eisenstein-Hu linear power spectrum fitting formula. We then plot the power spectra in both real and redshift space and compare to the input linear power spectrum.
[1]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
[2]:
from nbodykit.lab import *
from nbodykit import style, setup_logging
import matplotlib.pyplot as plt
plt.style.use(style.notebook)
[3]:
setup_logging() # turn on logging to screen
Initalizing the Log-normal Mock¶
Here, we generate a mock catalog of biased objects (\(b_1 = 2\) ) at a redshif \(z=0.55\). We use the Planck 2015 cosmology and the Eisenstein-Hu linear power spectrum fitting formula.
We generate the catalog in a box of side length \(L = 1380 \ \mathrm{Mpc}/h\) with a constant number density \(\bar{n} = 3 \times 10^{-3} \ h^{3} \mathrm{Mpc}^{-3}\).
[4]:
redshift = 0.55
cosmo = cosmology.Planck15
Plin = cosmology.LinearPower(cosmo, redshift, transfer='EisensteinHu')
b1 = 2.0
cat = LogNormalCatalog(Plin=Plin, nbar=3e-3, BoxSize=1380., Nmesh=256, bias=b1, seed=42)
Real-space Power Spectrum¶
In this section, we compute and plot the power spectrum, \(P(k,\mu)\), in real space (no redshift distortions) using five \(\mu\) bins.
[5]:
# convert the catalog to the mesh, with CIC interpolation
real_mesh = cat.to_mesh(compensated=True, window='cic', position='Position')
# compute the 2d P(k,mu) power, with 5 mu bins
r = FFTPower(real_mesh, mode='2d', Nmu=5)
pkmu = r.power
[ 000017.54 ] 0: 10-30 20:51 CatalogMesh INFO painted 7887677 out of 7887677 objects to mesh
[ 000017.54 ] 0: 10-30 20:51 CatalogMesh INFO mean particles per cell is 0.470142
[ 000017.54 ] 0: 10-30 20:51 CatalogMesh INFO sum is 7.88768e+06
[ 000017.55 ] 0: 10-30 20:51 CatalogMesh INFO normalized the convention to 1 + delta
[ 000017.91 ] 0: 10-30 20:51 CatalogMesh INFO field: (LogNormalCatalog(seed=42, bias=2) as CatalogMesh) painting done
[6]:
# plot each mu bin
for i in range(pkmu.shape[1]):
Pk = pkmu[:,i]
label = r'$\mu$=%.1f' %pkmu.coords['mu'][i]
plt.loglog(Pk['k'], Pk['power'].real - Pk.attrs['shotnoise'], label=label)
# plot the biased linear power spectrum
k = numpy.logspace(-2, 0, 512)
plt.loglog(k, b1**2 * Plin(k), c='k', label=r'$b_1^2 P_\mathrm{lin}$')
# add a legend and axes labels
plt.legend(loc=0, ncol=2, fontsize=16)
plt.xlabel(r"$k$ [$h \mathrm{Mpc}^{-1}$]")
plt.ylabel(r"$P(k, \mu)$ [$h^{-3} \mathrm{Mpc}^3$]")
plt.xlim(0.01, 0.6)
plt.ylim(500, 2e5)
[6]:
(500, 200000.0)
In this figure, we see the power spectrum in real-space is independent of \(\mu\), as expected. At low \(k\), the measured power agrees with the biased linear power \(b_1^2 P_\mathrm{lin}\). But at high \(k\), we see the effects of applying the Zel’dovich approximation to the density field, mirroring the effects of nonlinear evolution.
Redshift-space Power Spectrum¶
In this section, we compute and plot the power spectrum, \(P(k,\mu)\), in redshift space and see the effects of redshift-space distortions on the 2D power spectrum.
[7]:
def kaiser_pkmu(k, mu):
"""
Returns the Kaiser linear P(k,mu) in redshift space
.. math::
P(k,mu) = (1 + f/b_1 mu^2)^2 b_1^2 P_\mathrm{lin}(k)
"""
beta = cosmo.scale_independent_growth_rate(redshift) / b1
return (1 + beta*mu**2)**2 * b1**2 * Plin(k)
Here, we add redshift space distortions (RSD) along the \(z\)-axis of the box. We use the VelocityOffset
column which is equal to the Velocity
column re-normalized properly for RSD.
[8]:
LOS = [0,0,1] # the z-axis
# add RSD to the Position
cat['RSDPosition'] = cat['Position'] + cat['VelocityOffset'] * LOS
[9]:
# from catalog to mesh
rsd_mesh = cat.to_mesh(compensated=True, window='cic', position='RSDPosition')
# compute the 2D power
r = FFTPower(rsd_mesh, mode='2d', Nmu=5, los=LOS)
pkmu = r.power
# plot each mu power bin, normalized by the expected Kaiser power
for i in range(pkmu.shape[1]):
Pk = pkmu[:,i]
mu = pkmu.coords['mu'][i]
label = r'$\mu$=%.1f' %mu
P = Pk['power'].real-Pk.attrs['shotnoise']
plt.plot(Pk['k'], P / kaiser_pkmu(Pk['k'], mu), label=label)
# add a legend and axis labels
plt.legend(loc=0, ncol=2)
plt.xlabel(r"$k$ [$h \ \mathrm{Mpc}^{-1}$]")
plt.ylabel(r"$P(k, \mu) / P^\mathrm{Kaiser}(k,\mu)$")
plt.xlim(0.01, 0.6)
plt.ylim(0, 2.5)
[ 000026.10 ] 0: 10-30 20:51 CatalogMesh INFO painted 7887677 out of 7887677 objects to mesh
[ 000026.10 ] 0: 10-30 20:51 CatalogMesh INFO mean particles per cell is 0.470142
[ 000026.10 ] 0: 10-30 20:51 CatalogMesh INFO sum is 7.88768e+06
[ 000026.10 ] 0: 10-30 20:51 CatalogMesh INFO normalized the convention to 1 + delta
[ 000026.46 ] 0: 10-30 20:51 CatalogMesh INFO field: (LogNormalCatalog(seed=42, bias=2) as CatalogMesh) painting done
/Users/nhand/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel_launcher.py:15: RuntimeWarning: divide by zero encountered in true_divide
from ipykernel import kernelapp as app
[9]:
(0, 2.5)
In this figure, we clearly see the effects of RSD, which causes the higher \(\mu\) bins to be damped at high \(k\), relative to the lower \(\mu\) bins. Also, the measured power agrees with the expected Kaiser formula for linear RSD at low \(k\) (where linear theory holds): \(P^\mathrm{Kaiser} = (1 + f/b_1 \mu^2)^2 b_1^2 P_\mathrm{lin}(k)\).
Painting a Catalog to a Mesh¶
In this notebook, we outline the process of painting a discrete catalog of objects to a mesh. We discuss some of the most common examples, including the default behavior, which paints \(1+\delta\), painting the line-of-sight momentum field, and painting multiple species of particles to the same mesh.
Each section in this notebook is designed to be stand-alone, providing a full tutorial of the specific topic for users.
[1]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
[2]:
from nbodykit.lab import *
from nbodykit import style, setup_logging
import matplotlib.pyplot as plt
plt.style.use(style.notebook)
[3]:
setup_logging() # turn on logging to screen
Painting the Overdensity Field¶
When painting a catalog to a mesh, the painted density field is equal to \(1+\delta(\mathbf{x})\) when using the default configuration of nbodykit. In this section, we will paint a log-normal mock catalog of objects to a mesh and take a look at some of the properties of the resulting \(1+\delta(\mathbf{x})\) field.
To start, we initialize a log-normal mock catalog of unbiased (\(b_1=1\)) objects at a redshift of \(z=0.55\) in a box of side length \(L_\mathrm{box} = 1380 \ \mathrm{Mpc}/h\).
[4]:
redshift = 0.55
cosmo = cosmology.Planck15
Plin = cosmology.LinearPower(cosmo, redshift, transfer='EisensteinHu')
cat = LogNormalCatalog(Plin=Plin, nbar=3e-3, BoxSize=1380., Nmesh=256, bias=1.0, seed=42)
Next, we convert the catalog to a mesh object, using the Triangular Shaped Cloud interpolation window.
[5]:
mesh = cat.to_mesh(window='tsc')
The paint()
function returns the painted density field on the mesh, as a RealField
object.
[6]:
one_plus_delta = mesh.paint(mode='real')
print(type(one_plus_delta))
[ 000020.17 ] 0: 10-30 20:42 CatalogMesh INFO painted 7886161 out of 7886161 objects to mesh
[ 000020.17 ] 0: 10-30 20:42 CatalogMesh INFO mean particles per cell is 0.470052
[ 000020.17 ] 0: 10-30 20:42 CatalogMesh INFO sum is 7.88616e+06
[ 000020.17 ] 0: 10-30 20:42 CatalogMesh INFO normalized the convention to 1 + delta
[ 000020.22 ] 0: 10-30 20:42 CatalogMesh INFO field: (LogNormalCatalog(seed=42, bias=1) as CatalogMesh) painting done
<class 'pmesh.pm.RealField'>
The underlying array data of the RealField
object is stored in the value
attribute. Below, we compute the mean, maximum, and minimum of the \(1+\delta\) field. As expected, these values are consistent with the properties of the overdensity field \(\delta\), specifically \(\mathrm{min}(\delta) = -1\) and \(\langle \delta \rangle = 0\).
[7]:
print("mean of 1+delta = ", one_plus_delta.value.mean())
print("min of 1+delta = ", one_plus_delta.value.min())
print("max of 1+delta = ", one_plus_delta.value.max())
mean of 1+delta = 1.0
min of 1+delta = 0.0
max of 1+delta = 17.8592
We can preview a 2D projection of the density field with the preview
function.
[8]:
plt.imshow(one_plus_delta.preview(axes=[0,1]))
[8]:
<matplotlib.image.AxesImage at 0x117a1bac8>
Finally, we take a look at a 1D histogram of the \(1+\delta\) field values in each cell on the mesh. The distribution of values in the \(1+\delta\) field very roughly follows a log-normal distribution, providing a semi-realistic representation of the distribution of mass in the true Universe.
[9]:
# histogram of 1+delta in log-spaced bins
bins = numpy.logspace(-7, numpy.log10(30.), 100)
_ = plt.hist(one_plus_delta.value.flatten(), bins=bins)
# format the axes
plt.xscale('log')
plt.xlabel(r"$1+\delta$")
plt.ylabel(r"$N_\mathrm{cells}$")
plt.xlim(1e-2, 20)
[9]:
(0.01, 20)
Painting the Line-of-sight Momentum Field¶
In this section, we describe how to use nbodykit to paint the line-of-sight momentum field. The paint()
function paints mass-weighted (or equivalently, number-weighted) quantities to the mesh, as
where \(\delta\) is the usual overdensity field and \(V(\mathbf{x})\) is the field value. By default, \(V(\mathbf{x})\) is equal to unity, such that the field painted is \(1+\delta\) (see the previous section). Here, we set the field value \(V(\mathbf{x})\) to be the line-of-sight velocity field, \(V(\mathbf{x}) = v_\parallel(\mathbf{x})\), and paint the line-of-sight momentum field (mass-weighted velocity).
In this section, we compute the cross power spectrum of the momentum field with the density field, the auto spectrum of the momentum field, and the density auto spectrum for a log-normal mock and compare to the corresponding linear predictions for these terms. For more details on these momentum correlators, see Vlah et al. 2012.
We start by initializing a log-normal mock catalog of unbiased (\(b_1=1\)) objects at a redshift of \(z=0.55\) in a box of side length \(L_\mathrm{box} = 1380 \ \mathrm{Mpc}/h\).
[10]:
redshift = 0.55
cosmo = cosmology.Planck15
Plin = cosmology.LinearPower(cosmo, redshift, transfer='EisensteinHu')
cat = LogNormalCatalog(Plin=Plin, nbar=3e-3, BoxSize=1380., Nmesh=256, bias=1.0, seed=42)
We choose the \(x\) axis of the box to be our line-of-sight, and add a column to our catalog holding the \(x\) component of the velocity field, properly normalized by the conformal Hubble factor \(\mathcal{H}\), where \(\mathcal{H} = a H\). We perform this normalization such that the resulting velocity field has units of \(\mathrm{Mpc}/h\). Conveniently, the VelocityOffset
column holds the quantity we desire, so we select the first column for the \(x\) component.
[11]:
# line-of-sight is x-axis
LOS = [1, 0, 0]
# this is the velocity / (a*H)
cat['Vx'] = cat['VelocityOffset'][:,0] # units are Mpc/h
Next, we convert our catalog to mesh objects in order to paint the momentum and overdensity fields. To paint the momentum, we simply specify the value
keyword as the Vx
column, holding the line-of-sight velocity.
[12]:
# mesh to paint momentum by specifying "value" keyword
momentum_mesh = cat.to_mesh(compensated=True, window='tsc', position='Position', value='Vx')
# mesh to paint 1+delta by using default value=1.0
mesh = cat.to_mesh(compensated=True, window='tsc', position='Position')
Now, we can use these mesh objects to compute the desired auto and cross spectra of the overdensity and momentum fields using the FFTPower
algorithm. First, we compute the usual density auto spectrum using
[13]:
# the auto power spectrum of the overdensity field
r00 = FFTPower(mesh, mode='1d', los=[1,0,0])
[ 000044.22 ] 0: 10-30 20:42 CatalogMesh INFO painted 7886161 out of 7886161 objects to mesh
[ 000044.22 ] 0: 10-30 20:42 CatalogMesh INFO mean particles per cell is 0.470052
[ 000044.23 ] 0: 10-30 20:42 CatalogMesh INFO sum is 7.88616e+06
[ 000044.23 ] 0: 10-30 20:42 CatalogMesh INFO normalized the convention to 1 + delta
[ 000044.60 ] 0: 10-30 20:42 CatalogMesh INFO field: (LogNormalCatalog(seed=42, bias=1) as CatalogMesh) painting done
Next, we compute the cross power spectrum of the momentum and overdensity fields. The corresponding power spectrum has \(\mu\) dependence (see section 3.2 of Vlah et al. 2012), so we compute the \(\ell=1\) multipole to extract this angular dependence. Note that we also specify the \(x\) axis as the line-of-sight to use when determining the \(\mu\) coordinate.
[14]:
# the cross spectrum of the momentum and density
r01 = FFTPower(momentum_mesh, mode='1d', poles=[1], second=mesh, los=LOS)
[ 000051.03 ] 0: 10-30 20:43 CatalogMesh INFO painted 7886161 out of 7886161 objects to mesh
[ 000051.03 ] 0: 10-30 20:43 CatalogMesh INFO mean particles per cell is 0.470052
[ 000051.03 ] 0: 10-30 20:43 CatalogMesh INFO sum is 13.9668
[ 000051.03 ] 0: 10-30 20:43 CatalogMesh INFO normalized the convention to 1 + delta
[ 000051.40 ] 0: 10-30 20:43 CatalogMesh INFO field: (LogNormalCatalog(seed=42, bias=1) as CatalogMesh) painting done
[ 000056.84 ] 0: 10-30 20:43 CatalogMesh INFO painted 7886161 out of 7886161 objects to mesh
[ 000056.85 ] 0: 10-30 20:43 CatalogMesh INFO mean particles per cell is 0.470052
[ 000056.85 ] 0: 10-30 20:43 CatalogMesh INFO sum is 7.88616e+06
[ 000056.85 ] 0: 10-30 20:43 CatalogMesh INFO normalized the convention to 1 + delta
[ 000057.22 ] 0: 10-30 20:43 CatalogMesh INFO field: (LogNormalCatalog(seed=42, bias=1) as CatalogMesh) painting done
And finally, we compute the auto power spectrum of the momentum field. We are interested in the \(\mu^2\) dependence of the power spectrum (see section 3.2 of Vlah et al. 2012), so we compute the \(\ell=2\) multipole to extract this angular dependence.
[15]:
# the auto spectrum of the momentum field
r11 = FFTPower(momentum_mesh, mode='1d', poles=[2], los=LOS)
[ 000064.09 ] 0: 10-30 20:43 CatalogMesh INFO painted 7886161 out of 7886161 objects to mesh
[ 000064.09 ] 0: 10-30 20:43 CatalogMesh INFO mean particles per cell is 0.470052
[ 000064.09 ] 0: 10-30 20:43 CatalogMesh INFO sum is 13.9668
[ 000064.09 ] 0: 10-30 20:43 CatalogMesh INFO normalized the convention to 1 + delta
[ 000064.55 ] 0: 10-30 20:43 CatalogMesh INFO field: (LogNormalCatalog(seed=42, bias=1) as CatalogMesh) painting done
Following Vlah et al. 2012, we can use these measured momentum correlators to compute their contributions to the redshift-space power spectrum, \(P(k,\mu)\). Specifically, we can compute the the first three terms as
In linear theory, these terms are given by (see Kaiser 1987):
Below, we will compute the contributions to \(P(k,\mu)\) from our measured power spectra and compare to their linear counterparts.
[16]:
# shot noise subtracted density auto spectrum
P00 = r00.power['power'].real - r00.attrs['shotnoise']
The density-momentum cross spectrum has \(\mu\) dependence, so the dipole (\(\ell=1\)) is equal to our desired \(P_{01}\). Note that the dipole is imaginary, so we select the imaginary component of the measured spectrum. Finally, we multiply the correct pre-factor of \(2 k\).
Note that the because \(P_{01}\) is imaginary, we obtain a real result, as required. Also, note that the final contribution to \(P(k,\mu)\) has \(\mu^2\) dependence, as expected.
[17]:
# density-momentum contribution
P01 = 2 * r01.poles['k'] * r01.poles['power_1'].imag
We are interested in the \(\mu^2\) term of the \(P_{11}(k,\mu)\) correlator, which can be computed by rescaling the measured \(\ell=2\) multipole of the momentum auto spectrum by a factor of \(3/2\). We then multiply by the necessary \(k^2\) factor from the above equation to obtain the final contribution of the third term to \(P(k,\mu)\).
Note that the final contribution to \(P(k,\mu)\) has \(\mu^4\) dependence, as expected.
[18]:
# momentum auto contribution
P11 = 1.5 * r11.poles['k']**2 * r11.poles['power_2'].real
The linear contributions to \(P(k,\mu)\) are given by:
[19]:
# set up the wavenumbers
k = numpy.logspace(-2, 0, 1000)
# the growth rate at this redshift
f = cosmo.scale_independent_growth_rate(redshift)
# linear first term
P00_lin = Plin(k)
# linear second term
P01_lin = 2*f*P00_lin
# linear third term
P11_lin = f**2 * P00_lin
Finally, we can plot the results and compare!
[20]:
# P00
plt.loglog(r00.power['k'], P00, label=r'$P_{00}$')
plt.loglog(k, P00_lin, c='k', label='linear prediction')
# P01
plt.loglog(r01.power['k'], P01, label=r"$P_{01}$")
plt.loglog(k, P01_lin, c='k')
# P11
plt.loglog(r11.power['k'], P11, label=r"$P_{11}$")
plt.loglog(k, P11_lin, c='k')
# format the axes
plt.legend(loc='lower left', ncol=2)
plt.xlabel(r"$k$ [$h \ \mathrm{Mpc}^{-1}$]")
plt.ylabel(r"$P_{xy}$ [$h^{-3} \mathrm{Mpc}^3$]")
plt.xlim(0.01, 0.6)
[20]:
(0.01, 0.6)
Painting Multiple Species of Particles¶
In this section, we generate two “fake” catalogs of data, designed to represent dark matter particles and hydro particles from a cosmological simulation. We assign the two particles different masses, as is often the case in simulations, and paint the catalogs to the same mesh using the MultipleSpeciesCatalog
object. Finally, we compute the combined power spectrum and compare the power spectra of the individual catalogs.
We begin by generating mock catalogs to represent dark matter and hydro particles of a simulation. We use our log-normal mock generator at a redshift of \(z=0.55\) and a box size of \(L_\mathrm{box} = 1380 \ \mathrm{Mpc}/h\).
[21]:
redshift = 0.55
cosmo = cosmology.Planck15
Plin = cosmology.LinearPower(cosmo, redshift, transfer='EisensteinHu')
# the "hydro" particles
hydro = LogNormalCatalog(Plin=Plin, nbar=3e-3, BoxSize=1380., Nmesh=256, bias=2.0, seed=42)
# the "dark matter" particles
dm = LogNormalCatalog(Plin=Plin, nbar=3e-3, BoxSize=1380., Nmesh=256, bias=1.0, seed=84)
Often in simulations, different species of particles will have different masses. Here, we arbitrarily assign the dark matter particles to be 3x as heavy as the hydro particles. We add a new column to each catalog to represent this Mass
column.
[22]:
hydro['Mass'] = 0.25
dm['Mass'] = 0.75
m_hydro = hydro['Mass']
m_dm = dm['Mass']
We can combine both species of particles into a single catalog using the MultipleSpeciesCatalog
object. This object takes a list of strings giving the names of each species and the individual catalogs.
[23]:
combined = MultipleSpeciesCatalog(['hydro', 'dm'], hydro, dm)
print(combined)
MultipleSpeciesCatalog(species=['hydro', 'dm'])
In this combined catalog, the columns for individual species can be accessed by prefixing the name with the name of the species and a forward slash, i.e., “dm/” or “hydro/”.
[24]:
print(combined.columns)
['hydro/Mass', 'hydro/Position', 'hydro/Selection', 'hydro/Value', 'hydro/Velocity', 'hydro/VelocityOffset', 'hydro/Weight', 'dm/Mass', 'dm/Position', 'dm/Selection', 'dm/Value', 'dm/Velocity', 'dm/VelocityOffset', 'dm/Weight']
For example, the mass of the dark matter is set to 0.75 for all dark matter particles.
[25]:
print(combined['dm/Mass'].compute())
[ 0.75 0.75 0.75 ..., 0.75 0.75 0.75]
And the mass of the hydro particles is set to 0.25.
[26]:
print(combined['hydro/Mass'].compute())
[ 0.25 0.25 0.25 ..., 0.25 0.25 0.25]
Now, we can convert our catalogs to mesh objects, and specify the Mass
column to be used as the weight value when painting to the mesh. This is done via the weight
keyword of the to_mesh()
function. By specifying weight=Mass
, each hydro particle will contribute a value of 0.25 to the mesh and each dark matter particle will contribute a value of 0.75. The resulting density field is normalized as \(1+\delta\), using the sum of the densities for each species.
So, the painted field is:
where we \(\delta_\mathrm{tot}\) is given by
and the mean density is determined from the sum of each species as
Here, the mean number density of the particles is assumed to be constant throughout the box, given by the total number of the species divided by the volume.
Note
nbodykit supports spatially varying weights, even though in our case, the weights are constant, such that \(w_\mathrm{dm}(\mathbf{x}) = 0.75\) and \(w_\mathrm{hydro}(\mathbf{x}) = 0.25\).
[27]:
# the combined mesh, weighted by mass
combined_mesh = combined.to_mesh(Nmesh=256, BoxSize=1380.0, compensated=True, window='tsc', weight='Mass')
# the hydro-only mesh
hydro_mesh = hydro.to_mesh(Nmesh=256, BoxSize=1380.0, compensated=True, window='tsc', weight='Mass')
# the dm-only mesh
dm_mesh = dm.to_mesh(Nmesh=256, BoxSize=1380.0, compensated=True, window='tsc', weight='Mass')
First, compute the 1D power \(P(k)\) of the combined mesh of particles.
[28]:
r_combined = FFTPower(combined_mesh, mode='1d')
[ 000095.27 ] 0: 10-30 20:43 MultipleSpeciesCatalogMesh INFO painting the 'hydro' species
[ 000100.98 ] 0: 10-30 20:43 CatalogMesh INFO painted 7887677 out of 7887677 objects to mesh
[ 000100.98 ] 0: 10-30 20:43 CatalogMesh INFO mean particles per cell is 0.117536
[ 000100.98 ] 0: 10-30 20:43 CatalogMesh INFO sum is 1.97192e+06
[ 000100.98 ] 0: 10-30 20:43 CatalogMesh INFO normalized the convention to 1 + delta
[ 000101.00 ] 0: 10-30 20:43 MultipleSpeciesCatalogMesh INFO painting the 'dm' species
[ 000108.29 ] 0: 10-30 20:43 CatalogMesh INFO painted 7881785 out of 7881785 objects to mesh
[ 000108.29 ] 0: 10-30 20:43 CatalogMesh INFO mean particles per cell is 0.352343
[ 000108.29 ] 0: 10-30 20:43 CatalogMesh INFO sum is 7.88326e+06
[ 000108.29 ] 0: 10-30 20:43 CatalogMesh INFO normalized the convention to 1 + delta
[ 000109.11 ] 0: 10-30 20:43 MultipleSpeciesCatalogMesh INFO field: (MultipleSpeciesCatalog(species=['hydro', 'dm']) as CatalogMesh) painting done
And also compute the dark matter only and hydro only power spectra.
[29]:
# dark matter only power
r_dm = FFTPower(dm_mesh, mode='1d')
# hydro only power
r_hydro = FFTPower(hydro_mesh, mode='1d')
[ 000118.44 ] 0: 10-30 20:44 CatalogMesh INFO painted 7881785 out of 7881785 objects to mesh
[ 000118.44 ] 0: 10-30 20:44 CatalogMesh INFO mean particles per cell is 0.352343
[ 000118.44 ] 0: 10-30 20:44 CatalogMesh INFO sum is 5.91133e+06
[ 000118.44 ] 0: 10-30 20:44 CatalogMesh INFO normalized the convention to 1 + delta
[ 000119.13 ] 0: 10-30 20:44 CatalogMesh INFO field: (LogNormalCatalog(seed=84, bias=1) as CatalogMesh) painting done
[ 000128.23 ] 0: 10-30 20:44 CatalogMesh INFO painted 7887677 out of 7887677 objects to mesh
[ 000128.23 ] 0: 10-30 20:44 CatalogMesh INFO mean particles per cell is 0.117536
[ 000128.23 ] 0: 10-30 20:44 CatalogMesh INFO sum is 1.97192e+06
[ 000128.23 ] 0: 10-30 20:44 CatalogMesh INFO normalized the convention to 1 + delta
[ 000128.69 ] 0: 10-30 20:44 CatalogMesh INFO field: (LogNormalCatalog(seed=42, bias=2) as CatalogMesh) painting done
We can express the total overdensity field as a weighted sum of the individual fields:
where \(\alpha_i = \langle w_i \rangle \bar{n}_i\). Using this equation, we can compute the power spectrum of the combined mesh from the individual power spectra as
Note that we have assumed the fields to be un-correlated, which is true in this case, such that there is no cross power spectrum between the fields.
[30]:
# the individual power spectra
P_dm = r_dm.power['power'].real - r_dm.attrs['shotnoise']
P_hydro = r_hydro.power['power'].real - r_hydro.attrs['shotnoise']
# the predicted power spectrum
norm = m_hydro.sum() + m_dm.sum()
Ptot_predicted = (m_hydro.sum()/norm)**2 * P_hydro + (m_dm.sum()/norm)**2 * P_dm
Now, we compare the individual power spectra, the measured combined power spectrum, and our predicted power spectrum
[31]:
# the measured combined power
Ptot = r_combined.power['power'].real
plt.loglog(r_combined.power['k'], Ptot - r_combined.attrs['shotnoise'], label='DM + hydro (measured)')
# predicted combined power
plt.loglog(r_combined.power['k'], Ptot_predicted, label='DM + hydro (predicted)')
# plot the individual spectra
plt.loglog(r_dm.power['k'], P_dm , label='DM only')
plt.loglog(r_hydro.power['k'], P_hydro, label='hydro only')
# format the axes
plt.legend(loc='lower left', ncol=2)
plt.xlabel(r"$k$ [$h \ \mathrm{Mpc}^{-1}$]")
plt.ylabel(r"$P(k)$ [$h^{-3} \mathrm{Mpc}^3$]")
plt.xlim(0.01, 0.6)
[31]:
(0.01, 0.6)
We see that the measured power spectrum for the combined DM + hydro mesh (blue) agrees well with the predicted value computed from the individual power spectra (orange). As expected, the hydro and dark matter only power spectra have similar shapes, but the hydro only power is more strongly biased (\(b_1=2\) vs \(b_1=1\)).
The Projected Power Spectrum of Data in a Simulation Box¶
In this notebook, we explore the functionality of the ProjectedFFTPower
algorithm, which can compute the power spectrum of density fields projected in 1D and 2D, commonly referred to as \(P_\mathrm{1D}(k)\) and \(P_\mathrm{2D}(k)\). The algorithm is suitable for use on data sets in periodic simulation boxes, as the power spectrum is computed via a single FFT of the projected density mesh.
[1]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
[2]:
from nbodykit.lab import *
from nbodykit import setup_logging, style
import matplotlib.pyplot as plt
plt.style.use(style.notebook)
[3]:
setup_logging()
Initalizing a Log-normal Mock¶
We start by generating a mock catalog of biased objects (\(b_1 = 2\) ) at a redshif \(z=0.55\). We use the Planck 2015 cosmology and the Eisenstein-Hu linear power spectrum fitting formula. We generate the catalog in a box of side length \(L = 1380 \ \mathrm{Mpc}/h\) with a constant number density \(\bar{n} = 3 \times 10^{-3} \ h^{3} \mathrm{Mpc}^{-3}\).
[4]:
redshift = 0.55
cosmo = cosmology.Planck15
Plin = cosmology.LinearPower(cosmo, redshift, transfer='EisensteinHu')
BoxSize = 1380.
b1 = 2.0
cat = LogNormalCatalog(Plin=Plin, nbar=3e-3, BoxSize=BoxSize, Nmesh=256, bias=b1, seed=42)
We update the Position
column to add redshift-space distortions along the z
axis of the box using the VelocityOffset
column.
[5]:
# add RSD
line_of_sight = [0,0,1]
cat['Position'] += cat['VelocityOffset'] * line_of_sight
Computing the Projected Power¶
In this section, we compute and plot the power spectra of the projected 1D and 2D density fields, and compare to the full 3D power spectrum.
We must first convert our CatalogSource
object to a MeshSource
, by setting up the mesh and specifying which interpolation kernel we wish to use. Here, we use “TSC” interpolation, and specify via compensated=True
that we wish to correct for the effects of the interpolation window in Fourier space.
[6]:
# convert to a MeshSource, using TSC interpolation on 256^3 mesh
mesh = cat.to_mesh(window='tsc', Nmesh=256, compensated=True)
We compute the 2D projected power by specifying that we wish to project the density field on to the (x,y)
plane by passing axes=[0,1]
to ProjectedFFTPower
.
[7]:
# the projected 2D power
r_2d = ProjectedFFTPower(mesh, dk=0.005, kmin=0.01, axes=[0,1])
[ 000018.50 ] 0: 10-30 20:51 CatalogMesh INFO painted 7887677 out of 7887677 objects to mesh
[ 000018.50 ] 0: 10-30 20:51 CatalogMesh INFO mean particles per cell is 0.470142
[ 000018.50 ] 0: 10-30 20:51 CatalogMesh INFO sum is 7.88768e+06
[ 000018.51 ] 0: 10-30 20:51 CatalogMesh INFO normalized the convention to 1 + delta
[ 000018.89 ] 0: 10-30 20:51 CatalogMesh INFO field: (LogNormalCatalog(seed=42, bias=2) as CatalogMesh) painting done
We compute the 1D projected power by specifying that we wish to project the density field on to the x
plane by passing axes=[0]
to ProjectedFFTPower
.
[8]:
# the projected 1D power
r_1d = ProjectedFFTPower(mesh, dk=0.005, kmin=0.01, axes=[0])
[ 000025.35 ] 0: 10-30 20:52 CatalogMesh INFO painted 7887677 out of 7887677 objects to mesh
[ 000025.35 ] 0: 10-30 20:52 CatalogMesh INFO mean particles per cell is 0.470142
[ 000025.35 ] 0: 10-30 20:52 CatalogMesh INFO sum is 7.88768e+06
[ 000025.35 ] 0: 10-30 20:52 CatalogMesh INFO normalized the convention to 1 + delta
[ 000025.71 ] 0: 10-30 20:52 CatalogMesh INFO field: (LogNormalCatalog(seed=42, bias=2) as CatalogMesh) painting done
We also use the FFTPower
algorithm to compute the power spectrum of the full 3D density field for comparison purposes.
[9]:
# the 3D power P(k)
r_3d = FFTPower(mesh, mode='1d', dk=0.005, kmin=0.01)
[ 000031.43 ] 0: 10-30 20:52 CatalogMesh INFO painted 7887677 out of 7887677 objects to mesh
[ 000031.43 ] 0: 10-30 20:52 CatalogMesh INFO mean particles per cell is 0.470142
[ 000031.43 ] 0: 10-30 20:52 CatalogMesh INFO sum is 7.88768e+06
[ 000031.43 ] 0: 10-30 20:52 CatalogMesh INFO normalized the convention to 1 + delta
[ 000031.78 ] 0: 10-30 20:52 CatalogMesh INFO field: (LogNormalCatalog(seed=42, bias=2) as CatalogMesh) painting done
The power spectrum results are stored as a BinnedStatistic
object as the power
attribute.
[10]:
# the result is stored at "power" attribute
P1D = r_1d.power
P2D = r_2d.power
P3D = r_3d.power
Here, we plot each of the measured power spectra, scaled appropriately by the box size \(L_\mathrm{box}\) in order to have the same units for each power spectra.
[11]:
# plot the 1D, 2D, and 3D power spectra
plt.loglog(P1D['k'], P1D['power'].real * BoxSize**2, label=r'$L_\mathrm{box}^2 P_\mathrm{1D}$')
plt.loglog(P2D['k'], P2D['power'].real * BoxSize, label=r"$L_\mathrm{box} P_\mathrm{2D}$")
plt.loglog(P3D['k'], P3D['power'].real, label=r"$P_\mathrm{3D}$")
# format the axes
plt.legend(loc=0)
plt.xlabel(r"$k$ [$h \ \mathrm{Mpc}^{-1}$]")
plt.ylabel(r"$P(k)$ [$h^{-3}\mathrm{Mpc}^3$]")
plt.xlim(0.01, 0.6)
[11]:
(0.01, 0.6)
We can see that \(P_\mathrm{1D}\) and \(P_\mathrm{2D}\) are considerably noisier than the 3D power spectrum, which results from the fact that the number of Fourier modes to average over in each bin is drastically reduced when averaging the fields over certain axes.
The (static) recipes below are provided as Jupyter notebooks and are available for download by clicking the “Source” link in the navigation bar at the top of the page.
Data Recipes¶
- Demonstrates how to use the
LogNormalCatalog
class, which Poisson samples a log-normal density field and applies the Zel’dovich approximation. - Demonstrates how to use the
HODCatalog
, which uses thehalotools
package to populate a halo catalog with objects.
Painting Recipes¶
- Demonstrates how to create a mesh object from a catalog and paint the overdensity field to the mesh.
- Demonstrates how to paint mass-weighted quantities (in this case, the line-of-sight momentum field) to a mesh.
- Demonstrates how to paint the combined overdensity field from different types of particles to the same mesh.
Algorithm Recipes¶
- Demonstrates how to use
FFTPower
to compute the power spectrum of objects in a simulation box. - Demonstrates the use and effects of the interlacing technique when painting density fields to a mesh.
- Demonstrates the different interpolation windows available to use when painting and their accuracy.
- Demonstrates how to use
ConvolvedFFTPower
to compute the power spectrum multipoles of observational data. - Computes the power spectrum multipoles of the DR12 BOSS LOWZ galaxy sample.
- Demonstrates how to use
AngularPairCount
to compute the angular correlation function.
Contributing¶
If you have an application of nbodykit that is concise and interesting, please consider adding it to our cookbook. We also welcome feedback and improvements for these recipes. Users can submit issues or open a pull request on the nbodykit cookbook repo on GitHub.
Cookbook recipes should be in the form of Jupyter notebooks. See the existing recipes for examples. The recipes are designed to illustrate interesting uses of nbodykit for other users to learn from.
We appreciate any and all contributions!
Dealing with Discrete Data¶
The main interface for dealing with data in the form of catalogs of discrete
objects is provided by subclasses of the
nbodykit.base.catalog.CatalogSource
object.
In this section, we provide an overview of this class and note important
things to know.
What is a CatalogSource
?¶
Most often the user starts with a catalog of discrete objects, with a set of
fields describing each object, such as the position coordinates, velocity,
mass, etc. Given this input data, the user wishes to use nbodykit to perform a
task, i.e., computing the power spectrum or grouping together objects with a
friends-of-friends algorithm. To achieve these goals, nbodykit provides the
nbodykit.base.catalog.CatalogSource
base class.
The CatalogSource
object behaves much like a
numpy structured array,
where the fields of the array are referred to as “columns”. These columns store
the information about the objects in the catalog; common columns are
“Position”, “Velocity”, “Mass”, etc. A list of the column names
that are valid for a given catalog can be accessed via the
CatalogSource.columns
attribute.
Use Cases¶
The CatalogSource
is an abstract base class – it cannot be directly
initialized. Instead, nbodykit includes several specialized catalog subclasses
of CatalogSource
in the nbodykit.source.catalog
module. In
general, these subclasses fall into two categories:
Reading data from disk (see Reading Catalogs from Disk)
Generating mock data at run time (see Generating Catalogs of Mock Data)
Requirements¶
A well-defined size¶
The only requirement to initialize a CatalogSource
is that the object
has a well-defined size. Information about the length of a CatalogSource
is stored in two attributes:
CatalogSource.size
: the local size of the catalog, equal to the number of objects in the catalog on the local rankCatalogSource.csize
: the collective, global size of the catalog, equal to the sum ofsize
across all MPI ranks
So, the user can think of a CatalogSource
object as storing
information for a total of csize
objects, which is
divided amongst the available MPI ranks such that each process only stores
information about size
objects.
The Position
column¶
All CatalogSource
objects must include the Position
column, which
should be a (N,3)
array giving the Cartesian position of each of the N
objects in the catalog.
Often, the user will have the Cartesian coordinates
stored as separate columns or have the object coordinates in terms of
right ascension, declination, and redshift. See Common Data Operations
for more details about how to construct the Position
column for
these cases.
Default Columns¶
All CatalogSource
objects include several default columns.
These columns are used broadly throughout nbodykit and can be summarized as
follows:
Name |
Description |
Default Value |
|
The weight to use for each particle when interpolating a |
1.0 |
|
When interpolating a |
1.0 |
|
A boolean column that selects a subset slice of the |
|
Storing Meta-data¶
For all CatalogSource
objects, the input parameters and additional
meta-data are stored in the attrs
dictionary attribute.
API¶
For more information about specific catalog objects, please see the API section.
Reading Catalogs from Disk¶
Supported Data Formats¶
nbodykit provides support for initializing
CatalogSource
objects by reading tabular data
stored on disk in a variety of formats:
In this section, we provide short examples illustrating how to read data stored in each of these formats. If your data format is not currently supported, you can either read it in yourself, then adapt it as a Catalog Adapting in memory data to a Catalog, or write your own Catalog class Reading a Custom Data Format.
Plaintext Data¶
Reading data stored as columns in plaintext files is supported via the
CSVCatalog
class. This class partitions the CSV file into chunks, and
data is only read from the relevant chunks of the file, using
the pandas.read_csv()
function. The class accepts any configuration
keywords that this function does. The partitioning step provides a significant
speed-up when reading from the end of the file, since the entirety of the data
does not need to be read first.
Caveats
By default, the class reads space-separated columns, but this can be changed by setting
delim_whitespace=False
and changing thedelimiter
keywordA
pandas
index column is not supported – all columns should represent data columns to read.Commented lines in the file are not supported – please remove all comments from the file before loading into nbodykit.
There should not be a header line in the file – column names should be passed to
CSVCatalog
via thenames
argument.
As an example, below we generate 5 columns for 100 fake objects and write to a plaintext file:
[2]:
import numpy
from nbodykit.source.catalog import CSVCatalog
# generate some fake ASCII data
data = numpy.random.random(size=(100,5))
# save to a plaintext file
numpy.savetxt('csv-example.txt', data, fmt='%.7e')
# name each of the 5 input columns
names =['x', 'y', 'z', 'w', 'v']
# read the data
f = CSVCatalog('csv-example.txt', names)
# combine x, y, z to Position, and add boxsize
f['Position'] = f['x'][:, None] * [1, 0, 0] + f['y'][:, None] * [0, 1, 0] + f['z'][:, None] * [0, 0, 1]
f.attrs['BoxSize'] = 1.0
print(f)
print("columns = ", f.columns) # default Weight,Selection also present
print("total size = ", f.csize)
CSVCatalog(size=100, FileStack(CSVFile(path=/tmp/tmpedljiijj/csv-example.txt, dataset=*, ncolumns=5, shape=(100,)>, ... 1 files))
columns = ['Position', 'Selection', 'Value', 'Weight', 'v', 'w', 'x', 'y', 'z']
total size = 100
Binary Data¶
The BinaryCatalog
object reads binary data that is stored
on disk in a column-major format. The class can read any numpy data type
and can handle arbitrary byte offsets between columns.
Caveats
Columns must be stored in consecutive order in the binary file (column-major format).
For example, below we save Position
and Velocity
columns to a binary
file and load them into a BinaryCatalog
:
[3]:
from nbodykit.source.catalog import BinaryCatalog
# generate some fake data and save to a binary file
with open('binary-example.dat', 'wb') as ff:
pos = numpy.random.random(size=(1024, 3)) # fake Position column
vel = numpy.random.random(size=(1024, 3)) # fake Velocity column
pos.tofile(ff); vel.tofile(ff); ff.seek(0)
# create the binary catalog
f = BinaryCatalog(ff.name, [('Position', ('f8', 3)), ('Velocity', ('f8', 3))], size=1024)
print(f)
print("columns = ", f.columns) # default Weight,Selection also present
print("total size = ", f.csize)
BinaryCatalog(size=1024, FileStack(BinaryFile(path=/tmp/tmpedljiijj/binary-example.dat, dataset=*, ncolumns=2, shape=(1024,)>, ... 1 files))
columns = ['Position', 'Selection', 'Value', 'Velocity', 'Weight']
total size = 1024
HDF Data¶
The HDFCatalog
object uses the h5py
module to read
HDF5 files. The class supports reading columns stored in h5py.Dataset
objects and in h5py.Group
objects, assuming that all arrays are of the
same length since catalog objects must have a fixed size. Columns stored in
different datasets or groups can be accessed via their full path in the
HDF5 file.
Caveats
HDFCatalog
attempts to load all possible datasets or groups from the HDF5 file. This can present problems if the data has different lengths. Use theexclude
keyword to explicitly exclude data that has the wrong size.
In the example below, we load fake data from both the dataset “Data1” and
from the group “Data2” in an example HDF5 file. “Data1” is a single structured
numpy array with Position
and Velocity
columns, while “Data2” is a
group storing the Position
and Velocity
columns separately. nbodykit
is able to load both types of data from HDF5 files, and the corresponding
column names are the full paths of the data in the file.
[4]:
import h5py
from nbodykit.source.catalog import HDFCatalog
# generate some fake data
dset = numpy.empty(1024, dtype=[('Position', ('f8', 3)), ('Mass', 'f8')])
dset['Position'] = numpy.random.random(size=(1024, 3))
dset['Mass'] = numpy.random.random(size=1024)
# write to a HDF5 file
with h5py.File('hdf-example.hdf5' , 'w') as ff:
ff.create_dataset('Data1', data=dset)
grp = ff.create_group('Data2')
grp.create_dataset('Position', data=dset['Position']) # column as dataset
grp.create_dataset('Mass', data=dset['Mass']) # column as dataset
# intitialize the catalog
f = HDFCatalog('hdf-example.hdf5')
print(f)
print("columns = ", f.columns) # default Weight,Selection also present
print("total size = ", f.csize)
HDFCatalog(size=1024, FileStack(HDFFile(path=/tmp/tmpedljiijj/hdf-example.hdf5, dataset=/, ncolumns=4, shape=(1024,)>, ... 1 files))
columns = ['Data1/Mass', 'Data1/Position', 'Data2/Mass', 'Data2/Position', 'Selection', 'Value', 'Weight']
total size = 1024
Bigfile Data¶
The bigfile package is a massively
parallel IO library for large, hierarchical datasets, and nbodykit supports
reading data stored in this format using BigFileCatalog
.
Caveats
Similiar to the
HDFCatalog
class, datasets of the wrong size stored in a bigfile format should be explicitly excluded using theexclude
keyword.
Below, we load Position
and Velocity
columns, stored in the
bigfile
format:
[5]:
import bigfile
from nbodykit.source.catalog import BigFileCatalog
# generate some fake data
data = numpy.empty(512, dtype=[('Position', ('f8', 3)), ('Velocity', ('f8',3))])
data['Position'] = numpy.random.random(size=(512, 3))
data['Velocity'] = numpy.random.random(size=(512,3))
# save fake data to a BigFile
with bigfile.BigFile('bigfile-example', create=True) as tmpff:
with tmpff.create("Position", dtype=('f4', 3), size=512) as bb:
bb.write(0, data['Position'])
with tmpff.create("Velocity", dtype=('f4', 3), size=512) as bb:
bb.write(0, data['Velocity'])
with tmpff.create("Header") as bb:
bb.attrs['Size'] = 512.
# initialize the catalog
f = BigFileCatalog('bigfile-example', header='Header')
print(f)
print("columns = ", f.columns) # default Weight,Selection also present
print("total size = ", f.csize)
BigFileCatalog(size=512, FileStack(BigFile(path=/tmp/tmpedljiijj/bigfile-example, dataset=./, ncolumns=2, shape=(512,)>, ... 1 files))
columns = ['Position', 'Selection', 'Value', 'Velocity', 'Weight']
total size = 512
FITS Data¶
The FITS data format is supported via the
FITSCatalog
object. nbodykit relies on the
fitsio package to perform the read
operation.
Caveats
The FITS file must contain a readable binary table of data.
Specific extensions to read can be passed via the
ext
keyword. By default, data is read from the first HDU that has readable data.
For example, below we load Position
and Velocity
data from a FITS file:
[6]:
import fitsio
from nbodykit.source.catalog import FITSCatalog
# generate some fake data
dset = numpy.empty(1024, dtype=[('Position', ('f8', 3)), ('Mass', 'f8')])
dset['Position'] = numpy.random.random(size=(1024, 3))
dset['Mass'] = numpy.random.random(size=1024)
# write to a FITS file using fitsio
fitsio.write('fits-example.fits', dset, extname='Data')
# initialize the catalog
f = FITSCatalog('fits-example.fits', ext='Data')
print(f)
print("columns = ", f.columns) # default Weight,Selection also present
print("total size = ", f.csize)
FITSCatalog(size=1024, FileStack(FITSFile(path=/tmp/tmpedljiijj/fits-example.fits, dataset=Data, ncolumns=2, shape=(1024,)>, ... 1 files))
columns = ['Mass', 'Position', 'Selection', 'Value', 'Weight']
total size = 1024
Reading Multiple Data Files at Once¶
CatalogSource
objects support reading
multiple files at once, providing a continuous view of each individual catalog
stacked together. Each file read must contain the same data types, otherwise
the data cannot be combined into a single catalog.
This becomes particularly useful when the user has data
split into multiple files in a single directory, as is often the case when
processing large amounts of data. For example, output binary snapshots from
N-body simulations, often totaling 10GB - 100GB in size, can be read into a
single BinaryCatalog
with nbodykit.
When specifying multiple files to load, the user can use either an explicit
list of file names or use an asterisk glob pattern to match files.
As an example, below, we read data from two plaintext files into a single
CSVCatalog
:
[7]:
# generate data
data = numpy.random.random(size=(100,5))
# save first 40 rows of data to file
numpy.savetxt('csv-example-1.txt', data[:40], fmt='%.7e')
# save the remaining 60 rows to another file
numpy.savetxt('csv-example-2.txt', data[40:], fmt='%.7e')
Using a glob pattern¶
[8]:
# the names of the columns in both files
names =['a', 'b', 'c', 'd', 'e']
# read with a glob pattern
f = CSVCatalog('csv-example-*', names)
print(f)
# combined catalog size is 40+60=100
print("total size = ", f.csize)
CSVCatalog(size=100, FileStack(CSVFile(path=/tmp/tmpedljiijj/csv-example-1.txt, dataset=*, ncolumns=5, shape=(40,)>, ... 2 files))
total size = 100
Using a list of file names¶
[9]:
# the names of the columns in both files
names =['a', 'b', 'c', 'd', 'e']
# read with a list of the file names
f = CSVCatalog(['csv-example-1.txt', 'csv-example-2.txt'], names)
print(f)
# combined catalog size is 40+60=100
print("total size = ", f.csize)
CSVCatalog(size=100, FileStack(CSVFile(path=csv-example-1.txt, dataset=*, ncolumns=5, shape=(40,)>, ... 2 files))
total size = 100
Adapting in memory data to a Catalog¶
A light-weight way of reading in data that nbodykit does not understand is to
read in the data with existing tools, then adapt the data into a catalog via
nbodykit.lab.ArrayCatalog
.
Here is an exampling that reads in a file with numpy’s load, then make it into a catalog.
[10]:
from nbodykit.source.catalog import ArrayCatalog
# generate the fake data
data = numpy.empty(1024, dtype=[('Position', ('f8', 3)), ('Mass', 'f8')])
data['Position'] = numpy.random.random(size=(1024, 3))
data['Mass'] = numpy.random.random(size=1024)
# save to a npy file
numpy.save("npy-example.npy", data)
data = numpy.load("npy-example.npy")
# initialize the catalog
f = ArrayCatalog(data)
print(f)
print("columns = ", f.columns) # default Weight,Selection also present
print("total size = ", f.csize)
f = ArrayCatalog({'Position' : data['Position'], 'Mass' : data['Mass'] })
print(f)
print("columns = ", f.columns) # default Weight,Selection also present
print("total size = ", f.csize)
ArrayCatalog(size=1024)
columns = ['Mass', 'Position', 'Selection', 'Value', 'Weight']
total size = 1024
ArrayCatalog(size=1024)
columns = ['Mass', 'Position', 'Selection', 'Value', 'Weight']
total size = 1024
Reading a Custom Data Format¶
Users can implement their own subclasses of CatalogSource
for reading
custom data formats with a few easy steps. The core functionality of the
CatalogSource
classes described in this section use the
nbodykit.io
module for reading data from disk. This module implements the
nbodykit.io.base.FileType
base class, which is an abstract
class that behaves like a file
-like object. For the built-in
file formats discussed in this section, we have implemented the following
subclasses of FileType
in the nbodykit.io
module: CSVFile
, BinaryFile
,
BigFile
, HDFFile
, and FITSFile
.
To make a valid subclass of FileType
, users must:
Implement the
read()
function that reads a range of the data from disk.Set the
size
in the__init__()
function, specifying the total size of the data on disk.Set the
dtype
in the__init__()
function, specifying the type of data stored on disk.
Once we have the custom subclass implemented, the
nbodykit.source.catalog.file.FileCatalogFactory()
function can
be used to automatically create a custom CatalogSource
object
from the subclass.
As a toy example, we will illustrate how this is done for data saved
using the numpy .npy
format. First, we will implement our
subclass of the FileType
class:
[11]:
from nbodykit.io.base import FileType
class NPYFile(FileType):
"""
A file-like object to read numpy ``.npy`` files
"""
def __init__(self, path):
self.path = path
self.attrs = {}
# load the data and set size and dtype
self._data = numpy.load(self.path)
self.size = len(self._data) # total size
self.dtype = self._data.dtype # data dtype
def read(self, columns, start, stop, step=1):
"""
Read the specified column(s) over the given range
"""
return self._data[start:stop:step]
And now generate the subclass of CatalogSource
:
[12]:
from nbodykit.source.catalog.file import FileCatalogFactory
NPYCatalog = FileCatalogFactory('NPYCatalog', NPYFile)
And finally, we generate some fake data, save it to a .npy
file,
and then load it with our new NPYCatalog
class:
[13]:
# generate the fake data
data = numpy.empty(1024, dtype=[('Position', ('f8', 3)), ('Mass', 'f8')])
data['Position'] = numpy.random.random(size=(1024, 3))
data['Mass'] = numpy.random.random(size=1024)
# save to a npy file
numpy.save("npy-example.npy", data)
# and now load the data
f = NPYCatalog("npy-example.npy")
print(f)
print("columns = ", f.columns) # default Weight,Selection also present
print("total size = ", f.csize)
NPYCatalog(size=1024, FileStack(NPYFile(path=/tmp/tmpedljiijj/npy-example.npy, dataset=None, ncolumns=2, shape=(1024,)>, ... 1 files))
columns = ['Mass', 'Position', 'Selection', 'Value', 'Weight']
total size = 1024
This toy example illustrates how custom data formats can be incorporated
into nbodykit, but users should take care to optimize their storage
solutions for more complex applications. In particular, data storage formats
that are stored in column-major format and allow data slices from arbitrary
locations to be read should be favored. This enables large speed-ups when
reading data in parallel. On the contrary, our simple toy example class
NPYFile
reads the entirety of the data before returning
a certain slice in the read()
function. In general, this should be
avoided if at all possible.
On Demand IO via dask.array
¶
nbodykit uses the dask package to store the columns
in CatalogSource
objects. The dask
package implements
a dask.array.Array
object that mimics that interface of the more
familiar numpy array. In this section, we describe what exactly a dask array
is and how it is used in nbodykit.
What is a dask array?¶
In nbodykit, the dask array object is a data container that behaves nearly
identical to a numpy array, except for one key difference. When performing
manipulations on a numpy array, the operations are performed immediately.
This is not the case for dask arrays. Instead, dask arrays store these
operations in a task graph and only evaluate the operations when the user
specifies to via a call to a compute()
function. When using
nbodykit, often the first task in this graph is loading data from disk.
Thus, dask provides nbodykit with on-demand IO functionality, allowing the user
to control when data is read from disk.
It is useful to describe a bit more about the nuts and bolts of the dask array to illustrate its full power. The dask array object cuts up the full array into many smaller arrays and performs calculations on these smaller “chunks”. This allows array computations to be performed on large data that does not fit into memory (but can be stored on disk). Particularly useful on laptops and other systems with limited memory, it extends the maximum size of useable datasets from the size of memory to the size of the disk storage. For further details, please see the introduction to the dask array in the dask documentation.
By Example¶
The dask array functionality is best illustrated by example. Here, we
initialize a UniformCatalog
that generates objects with uniformly distributed position and velocity columns.
[1]:
from nbodykit.lab import UniformCatalog
cat = UniformCatalog(nbar=100, BoxSize=1.0, seed=42)
/home/yfeng1/anaconda3/install/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters
[2]:
print("catalog = ", cat)
catalog = UniformCatalog(size=96, seed=42)
[3]:
print("Position = ", cat['Position'])
Position = dask.array<array, shape=(96, 3), dtype=float64, chunksize=(96, 3)> first: [0.45470105 0.83263203 0.06905134] last: [0.62474599 0.15388738 0.84302209]
We see that the Position
column can be accessed by indexing the catalog
with the column name and that the returned object is not a numpy array
but a dask array. The dask array has the same shape
(96,3) and
dtype
(‘f8’) as the underlying numpy array but also includes the
chunksize
attribute. This attribute specifies the size of the internal
chunks that dask uses to examine arrays in smaller pieces. In this case,
the data size is small enough that only a single chunk is needed.
The dask.array
module¶
The dask.array
module provides much of the same functionality as the
numpy
module, but with functions optimized to perform operations on
dask arrays.
For example, we can easily compute the minimum and maximum position coordinates
using the dask.array.min()
and dask.array.max()
functions.
[4]:
import dask.array as da
pos = cat['Position']
minpos = da.min(pos, axis=0)
maxpos = da.max(pos, axis=0)
print("minimum position coordinates = ", minpos)
print("maximum position coordinates = ", maxpos)
minimum position coordinates = dask.array<amin-aggregate, shape=(3,), dtype=float64, chunksize=(3,)>
maximum position coordinates = dask.array<amax-aggregate, shape=(3,), dtype=float64, chunksize=(3,)>
Here, we see that the result of our calls to dask.array.min()
and
dask.array.max()
are also stored as dask arrays. The task
has not yet been performed but instead added to the internal dask task graph.
For a full list of the available functionality, please see the
dask array documentation.
A large subset of the most commonly used functions in numpy have
implementations in the dask.array
module. In addition to these
functions, dask arrays support the usual array arithmetic operations. For
example, to rescale the position coordinate array, use
[5]:
BoxSize = 2500.0
pos *= BoxSize
rescaled_minpos = da.min(pos, axis=0)
rescaled_maxpos = da.max(pos, axis=0)
Evaluating a dask array¶
The CatalogSource.compute()
function computes a dask array and returns
the result of the internal series of tasks, either a numpy array or float.
For example, we can compute the minimum and maximum of the position coordinates
using:
[6]:
minpos, maxpos = cat.compute(minpos, maxpos)
print("minimum position coordinates = ", minpos)
print("maximum position coordinates = ", maxpos)
minimum position coordinates = [0.00402579 0.00015685 0.00271747]
maximum position coordinates = [0.9927406 0.99610592 0.99925086]
And similarly, we see the result of the rescaling operation earlier:
[7]:
minpos, maxpos = cat.compute(rescaled_minpos, rescaled_maxpos)
print("minimum re-scaled position coordinates = ", minpos)
print("maximum re-scaled position coordinates = ", maxpos)
minimum re-scaled position coordinates = [10.06446279 0.39212416 6.79367111]
maximum re-scaled position coordinates = [2481.85149744 2490.26480949 2498.12715085]
Caching with Dask¶
nbodykit automatically caches tasks computed when evaluating dask arrays. The
global cache is controlled via the nbodykit.GlobalCache
class.
Often the most expensive task when evaluating a dask array is loading the data
from disk. By using dask’s caching features, CatalogSource
objects are
able to cache intermediate results, such that repeated calls to
CatalogSource.compute()
do not repeat expensive IO operations.
The global cache has a fixed size. By default, we set the value to a reasonable
(not too large) value. The default value is controlled by the global_cache_size
keyword, which can be controlled via the set_options
function.
Users can control the size of the global cache using:
from nbodykit import GlobalCache
GlobalCache.resize(2e9) # set cache size to 2 GB
Note
When accessing columns of a CatalogSource
, the returned dask
array also has a compute()
function. When using this function to
evaluate dask arrays, internal caching will also be used. So users have the
option of using CatalogSource.compute()
or the compute()
attached to each dask array.
Examining Larger-than-Memory Data¶
CatalogSource
objects automatically take advantage of the chunking
features of the dask array, greatly reducing the difficulties of
analyzing larger-than-memory data. When combined with the ability of the
CatalogSource
object to provide a continuous view of multiple files
at once, we can analyze large amounts of data from a single catalog with ease.
A common use case is examining a directory of large binary outputs from a
N-body simulation on a laptop. Often the user wishes to select a smaller subsample
of the catalog or perform some trivial data inspection to verify the accuracy of
the data. These tasks become straightforward with nbodykit, using the functionality
provided by the CatalogSource
object and the dask
package.
Regardless of the size of the data that the user is loading, the nbodykit
CatalogSource
interface remains the same.
Common Data Operations¶
Here, we detail some of the most common operations when dealing with
data in the form of a CatalogSource
. The native format for data columns
in a CatalogSource
object is the dask array. Be sure to read
the previous section for an introduction to dask arrays
before proceeding.
The dask array format allows users to easily manipulate columns in their input data and feed any transformed data into one of the nbodykit algorithms. This provides a fast and easy way to transform the data while hiding the implementation details needed to compute these transformations internally. In this section, we’ll provide examples of some of these data transformations to get users acclimated to dask arrays quickly.
To help illustrate these operations, we’ll initialize the nbodykit “lab” and load a catalog of uniformly distributed objects.
[2]:
from nbodykit.lab import *
cat = UniformCatalog(nbar=100, BoxSize=1.0, seed=42)
/home/yfeng1/anaconda3/install/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters
Accessing Data Columns¶
Specific columns can be accessed by indexing the catalog object using the
column name, and a dask.array.Array
object is returned (see
What is a dask array? for more details on dask arrays).
[3]:
position = cat['Position']
velocity = cat['Velocity']
print(position)
print(velocity)
dask.array<array, shape=(96, 3), dtype=float64, chunksize=(96, 3)> first: [0.45470105 0.83263203 0.06905134] last: [0.62474599 0.15388738 0.84302209]
dask.array<array, shape=(96, 3), dtype=float64, chunksize=(96, 3)> first: [0.0006346 0.00675438 0.00704942] last: [0.00375581 0.00046149 0.00819726]
While in the format of the dask array, data columns can easily be manipulated by the user. For example, here we normalize the position coordinates to the range 0 to 1 by dividing by the box size:
[4]:
# normalize the position
normed_position = position / cat.attrs['BoxSize']
print(normed_position)
dask.array<truediv, shape=(96, 3), dtype=float64, chunksize=(96, 3)>
Note that the normalized position array is also a dask array and that the actual normalization operation is yet to occur. This makes these kinds of data transformations very fast for the user.
Computing Data Columns¶
Columns can be converted from dask.array.Array
objects to
numpy arrays using the compute()
function (see
Evaluating a dask array for further details on computing dask arrays).
[5]:
position, velocity = cat.compute(cat['Position'], cat['Velocity'])
print(type(position))
print(type(velocity))
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
We can also compute the max of the normalized position coordinates from the previous section:
[6]:
maxpos = normed_position.max(axis=0)
print(maxpos)
print(cat.compute(maxpos))
dask.array<amax-aggregate, shape=(3,), dtype=float64, chunksize=(3,)>
[0.9927406 0.99610592 0.99925086]
Adding New Columns¶
New columns can be easily added to a CatalogSource
object by
directly setting them:
[7]:
# no "Mass" column originally
print("contains 'Mass'? :", 'Mass' in cat)
# add a random array as the "Mass" column
cat['Mass'] = numpy.random.random(size=len(cat))
# "Mass" exists!
print("contains 'Mass'? :", 'Mass' in cat)
# can also add scalar values -- converted to correct length
cat['Type'] = b"central"
print(cat['Mass'])
print(cat['Type'])
contains 'Mass'? : False
contains 'Mass'? : True
dask.array<array, shape=(96,), dtype=float64, chunksize=(96,)> first: 0.3745401188473625 last: 0.49379559636439074
dask.array<array, shape=(96,), dtype=|S7, chunksize=(96,)> first: b'central' last: b'central'
Here, we have added two new columns to the catalog, Mass
and Type
.
Internally, nbodykit stores the new columns as dask arrays and
will automatically convert them to the correct type if they are not already.
Caveats
New columns must be either be a scalar value, or an array with the same length as the catalog. Scalar values will automatically be broadcast to the correct length.
Setting a column of the wrong length will raise an exception.
Overwriting Columns¶
The same syntax used for adding new columns can also be used to overwrite
columns that already exist in a CatalogSource
. This procedure
works as one would expect – the most up-to-date values of columns are
always used in operations.
In the example below we overwrite both the Position
and Velocity
columns, and each time the columns are accessed, the most up-to-date values
are used, as expected.
[8]:
# some fake data
data = numpy.ones(5, dtype=[
('Position', ('f4', 3)),
('Velocity', ('f4', 3))]
)
# initialize a catalog directly from the structured array
src = ArrayCatalog(data)
# overwrite the Velocity column
src['Velocity'] = src['Position'] + src['Velocity'] # 1 + 1 = 2
# overwrite the Position column
src['Position'] = src['Position'] + src['Velocity'] # 1 + 2 = 3
print("Velocity = ", src.compute(src['Velocity'])) # all equal to 2
print("Position = ", src.compute(src['Position'])) # all equal to 3
Velocity = [[2. 2. 2.]
[2. 2. 2.]
[2. 2. 2.]
[2. 2. 2.]
[2. 2. 2.]]
Position = [[3. 3. 3.]
[3. 3. 3.]
[3. 3. 3.]
[3. 3. 3.]
[3. 3. 3.]]
Adding Redshift-space Distortions¶
A useful operation in large-scale structure is the mapping of positions
in simulations from real space to redshift space, referred to
as redshift space distortions (RSD).
This operation can be easily performed by combining the Position
and
Velocity
columns to overwrite the Position
column. As first
found by Kaiser 1987,
the mapping from real to redshift space is:
where \(r\) is the line-of-sight position in real space, \(s\) is the line-of-sight position in redshift space, \(\vv\) is the velocity vector, \(\vnhat\) is the line-of-sight unit vector, \(a\) is the scale factor, and \(H\) is the Hubble parameter at \(a\).
As an example, below we add RSD along the z
axis of a simulation box:
[9]:
# apply RSD along the z axis
line_of_sight = [0,0,1]
# redshift and cosmology
redshift = 0.55; cosmo = cosmology.Cosmology(h=0.7).match(Omega0_m=0.31)
# the RSD normalization factor
rsd_factor = (1+redshift) / (100 * cosmo.efunc(redshift))
# update Position, applying RSD
src['Position'] = src['Position'] + rsd_factor * src['Velocity'] * line_of_sight
The RSD factor is known as the conformal Hubble parameter
\(\mathcal{H} = a H(a)\). This calculation requires a cosmology,
which can be specified via the
Cosmology
class. We use the
efunc()
function which returns
\(E(z)\), where the Hubble parameter is defined as \(H(z) = 100h\ E(z)\)
in units of km/s/Mpc. Note that the operation above assumes the Position
column is in units of \(\mathrm{Mpc}/h\).
For catalogs in nbodykit that generate mock data, such as the
log-normal catalogs or HOD catalogs,
there is an additional column, VelocityOffset
, available to facilitate
RSD operations. This column has units of \(\mathrm{Mpc}/h\) and
includes the rsd_factor
above. Thus, this allows users to add RSD
simply by using:
src['Position'] = src['Position'] + src['VelocityOffset'] * line_of_sight
Selecting a Subset¶
A subset of a CatalogSource
object can be selected using slice notation.
There are two ways to select a subset:
use a boolean array, which specifies which rows of the catalog to select
use a slice object specifying which rows of the catalog to select
For example,
[10]:
# boolean selection array
select = cat['Mass'] < 0.5
print("number of True entries = ", cat.compute(select.sum()))
# select only entries where select = True
subcat = cat[select]
print("size of subcat = ", subcat.size)
# select the first ten rows
subcat = cat[:10]
print("size of subcat = ", subcat.size)
# select first and last row
subcat = cat[[0, -1]]
print("size of subcat = ", subcat.size)
number of True entries = 50
size of subcat = 50
size of subcat = 10
size of subcat = 2
Caveats
When indexing with a boolean array, the array must have the same length as the
size
attribute, or an exception will be raised.Selecting a single object by indexing with an integer is not supported. If the user wishes to select a single row, a list of length one can be used to select the specific row.
Selecting a Subset of Columns from a CatalogSource
¶
A subset of columns can be selected from a CatalogSource
object by
indexing the catalog with a list of the names of the desired columns.
For example,
[11]:
print("columns in catalog = ", cat.columns)
# select Position + Mass
subcat = cat[['Position', 'Mass']]
# the selected columns + default columns
print("columns in subset = ", subcat.columns)
columns in catalog = ['Mass', 'Position', 'Selection', 'Type', 'Value', 'Velocity', 'Weight']
columns in subset = ['Mass', 'Position', 'Selection', 'Value', 'Weight']
Caveats
When selecting a subset of columns, note that in addition to the desired columns, the sub-catalog will also contain the default columns (
Weight
,Value
, andSelection
).
The nbodykit.transform
module¶
The nbodykit.transform
module includes several commonly used functions
for convenience. We describe a few of the most common use cases in the sub-sections
below.
Note
The transform
module is available to users when
from nbodykit.lab import *
is executed.
Concatenating CatalogSource
Objects¶
When CatalogSource
objects have the same columns, they can be
concatenated together into a single object using the
nbodykit.transform.ConcatenateSources()
function. For example,
[12]:
cat1 = UniformCatalog(nbar=50, BoxSize=1.0, seed=42)
cat2 = UniformCatalog(nbar=150, BoxSize=1.0, seed=42)
combined = transform.ConcatenateSources(cat1, cat2)
print("total size = %d + %d = %d" %(cat1.size, cat2.size, combined.size))
total size = 47 + 145 = 192
Stacking Columns Together¶
Another common use case is when users need to combine separate
data columns vertically, as the columns of a new array. For example, often the
Cartesian position coordinates x
, y
, and z
are stored as separate
columns, and the Position
column must be added to a catalog from these
individual columns. We provide the nbodykit.transform.StackColumns()
function for this exact purpose. For example,
[13]:
# fake position data
data = numpy.random.random(size=(5,3))
# save to a plaintext file
numpy.savetxt('csv-example.dat', data, fmt='%.7e')
# the cartesian coordinates
names =['x', 'y', 'z']
# read the data
f = CSVCatalog('csv-example.dat', names)
# make the "Position" column
f['Position'] = transform.StackColumns(f['x'], f['y'], f['z'])
print(f['Position'])
print(f.compute(f['Position']))
dask.array<transpose, shape=(5, 3), dtype=float64, chunksize=(5, 1)> first: [0.52273283 0.42754102 0.02541913] last: [0.22879817 0.07697991 0.28975145]
[[0.52273283 0.42754102 0.02541913]
[0.10789143 0.03142919 0.63641041]
[0.31435598 0.50857069 0.90756647]
[0.24929223 0.41038292 0.75555114]
[0.22879817 0.07697991 0.28975145]]
Converting from Sky to Cartesian Coordinates¶
We provide the function nbodykit.transform.SkyToCartesian()
for converting
sky coordinates, in the form of right ascension, declination, and redshift,
to Cartesian coordinates. The conversion from redshift to comoving distance
requires a cosmology instance, which can be specified via the
Cosmology
class.
Below, we initialize a catalog holding random right ascension,
declination, and redshift coordinates, and then add the Cartesian position
as the Position
column.
The random number generator provided by a RandomCatalog always generate the correct number of items for the catalog. Use the ? to see its docstring.
[14]:
src = RandomCatalog(100, seed=42)
# add random (ra, dec, z) coordinates
src['z'] = src.rng.normal(loc=0.5, scale=0.1)
src['ra'] = src.rng.uniform(low=0, high=360)
src['dec'] = src.rng.uniform(low=-180, high=180.)
# initialize a set of cosmology parameters
cosmo = cosmology.Cosmology(h=0.7)
# add the position
src['Position'] = transform.SkyToCartesian(src['ra'], src['dec'], src['z'], degrees=True, cosmo=cosmo)
Caveats
Whether the right ascension and declination arrays are in degrees (as opposed to radians) should be specified via the
degrees
keyword.The units of the returned
Position
column are \(\mathrm{Mpc}/h\).
Using the dask.array
module¶
For more general column transformations, users should take advantage of the
dask.array
module, which implements most functions in the numpy
package in a manner optimized for dask arrays. The module can be accessed from the
nbodykit.transform
module as nbodykit.transform.da
.
Important
For a full list of functions available in the dask.array
module,
please see the dask array documentation.
We strongly recommend that new users read through this documentation
and familiarize themselves with the functionality provided by
the dask.array
module.
As a simple illustration, below we convert an array holding right ascension values from
degrees to radians, compute the sine of the array, and find the min and
max values using functions available in the dask.array
module.
[15]:
ra = transform.da.deg2rad(src['ra']) # from degrees to radians
sin_ra = transform.da.sin(ra) # compute the sine
print("min(sin(ra)) = ", src.compute(sin_ra.min()))
print("max(sin(ra)) = ", src.compute(sin_ra.max()))
min(sin(ra)) = -0.999907640154132
max(sin(ra)) = 0.9988053754673173
Generating Catalogs of Mock Data¶
nbodykit includes several methods for generating mock catalogs, with varying
levels of sophistication. These CatalogSource
objects allow users to create catalogs of objects at run time and include:
Randomly Distributed Objects¶
nbodykit includes two subclasses of
CatalogSource
that
generate particles randomly in a box: RandomCatalog
and UniformCatalog
. While these catalogs do not produce
realistic cosmological distributions of objects, they are especially useful
for generating catalogs quickly and for testing purposes.
RandomCatalog
¶
The RandomCatalog
class includes a random number generator
with the functionality of numpy.random.RandomState
that generates
random numbers in parallel and in a manner that is independent of the number
of MPI ranks being used. This property is especially useful for running
reproducible tests where the number of CPUs might vary. The random number
generator is stored as the rng
attribute.
Users can use this random number generator to add columns to the catalog,
using the syntax to add columns.
For example,
[1]:
from nbodykit.lab import RandomCatalog
import numpy
# initialize a catalog with only the default columns
cat = RandomCatalog(csize=100) # collective size of 100
print("columns = ", cat.columns) # only the default columns present
# add mass uniformly distributed in log10
cat['Mass'] = 10**(cat.rng.uniform(12, 15))
# add normally distributed velocity
cat['Velocity'] = cat.rng.normal(loc=0, scale=500)
print(cat.columns)
columns = ['Selection', 'Value', 'Weight']
['Mass', 'Selection', 'Value', 'Velocity', 'Weight']
/home/yfeng1/anaconda3/install/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters
Caveats
For a list of the full functionality of the
rng
attribute, please see the API documentation forMPIRandomState
.When adding columns, the new column must have the same length as the local size of the catalog, as specified by the
size
attribute. Most functions of therng
attribute accept thesize
keyword to generate an array of the correct size.
UniformCatalog
¶
The UniformCatalog
is a subclass of
RandomCatalog
that includes Position
and Velocity
columns that are uniformly distributed. The positions of the particles are
uniformly distributed between zero and the size of the box (as specified by
the user), with an input number density. The velocities are also
uniformly distributed but on a scale that is 1% of the size of the box.
For example,
[2]:
from nbodykit.lab import UniformCatalog
cat = UniformCatalog(nbar=100, BoxSize=1.0, seed=42)
print("columns = ", cat.columns)
# min must be greater than 0
print("position minimum = ", cat.compute(cat['Position'].min()))
# max must be less than 1.0
print("position maximum = ", cat.compute(cat['Position'].max()))
# min must be greater than 0
print("velocity minimum = ", cat.compute(cat['Velocity'].min()))
# max must be less than 0.01
print("velocity maximum = ", cat.compute(cat['Velocity'].max()))
columns = ['Position', 'Selection', 'Value', 'Velocity', 'Weight']
position minimum = 0.00015684966390538957
position maximum = 0.9992508603400948
velocity minimum = 2.4238347118806793e-05
velocity maximum = 0.00999869916263692
Note that because UniformCatalog
is a subclass of
RandomCatalog
users can also use the
rng
attribute to add new columns to
a UniformCatalog
object.
Log-normal Mocks¶
The LogNormalCatalog
offers a more realistic
approximation of cosmological large-scale structure. The class
generates a set of objects by Poisson sampling a log-normal density field,
using the Zel’dovich approximation to model non-linear evolution.
Given a linear power spectrum function, redshift, and linear bias
supplied by the user, this class performs the following steps:
Generate Gaussian initial conditions
First, a Gaussian overdensity field \(\delta_L(\vk)\) is generated in Fourier space with a power spectrum given by a function specified by the user. We also generate linear velocity fields from the overdensity field in Fourier space, using
where \(f\) is the logarithmic growth rate, \(a\) is the scale factor, and \(H\) is the Hubble parameter at \(a\). Note that bold variables reflect vector quantities.
Finally, we Fourier transform \(\delta_L(\vk)\) and \(\vv(\vk)\) to configuration space. These fields serve as the “initial conditions”. They will be evolved forward in time and Poisson sampled to create the final catalog.
Perform the log-normal transformation
Next, we perform a log-normal transformation on the density field \(\delta\). As first discussed in Coles and Jones 1991, the distribution of galaxies on intermediate to large scales can be well-approximated by a log-normal distribution. An additional useful property of a log-normal field is that it follows the natural constraint \(\delta(\vx) \ge -1\) of density contrasts by definition. This property does not generally hold true for Gaussian realizations of the overdensity field.
The new, transformed density field is given by
where the normalization factor \(\sigma^2\) ensures that the mean of
the \(\delta(\vx)\) vanishes and \(b_L\) is the Lagrangian bias
factor, which is related to the final, linear bias as \(b_L = b_1 - 1\).
Here, \(b_1\) is the value input by the user as the bias
keyword
to the LogNormalCatalog
class.
Poisson sample the density field
We then generate discrete positions of objects by Poisson sampling the
overdensity field in each cell of the mesh. We assign each object the
velocity of the mesh cell that it is located in, and objects are placed randomly
inside their cells. The desired number density of objects in the box is
specified by the user as the nbar
parameter to the
LogNormalCatalog
class.
Apply the Zel’dovich approximation
Finally, we evolve the overdensity field according to the Zel’dovich approximation, which is 1st order Lagrangian perturbation theory. To do this, we move the positions of each object according to the linear velocity field,
where \(\vr(\vx)\) is the final position of the objects, \(\vr_0(\vx)\) is the initial position, and \(\vv(\vx)\) is the velocity assigned in the previous step.
After this step, we have a catalog of discrete objects, with a Position
column in units of \(\mathrm{Mpc}/h\), a Velocity
columns in units
of km/s, and a VelocityOffset
column in units of \(\mathrm{Mpc}/h\).
The VelocityOffset
is a convenience function for adding redshift-space
distortions (see Adding Redshift-space Distortions), such that RSD can be added using:
line_of_sight = [0,0,1]
src['Position'] = src['Position'] + src['VelocityOffset'] * line_of_sight
Note
For examples using log-normal mocks, see the cookbook/lognormal-mocks.ipynb recipe in The Cookbook.
Halo Occupation Distribution Mocks¶
nbodykit includes functionality to generate mock galaxy catalogs using the
Halo Occupation Distribution (HOD) technique via the HODCatalog
class. The HOD technique populates a catalog of halos with galaxies based on
a functional form for the probability that a halo of mass \(M\) hosts
\(N\) objects, \(P(N|M)\). The functional form of the HOD used by
HODCatalog
is the form used in Zheng et al 2007.
The average number of galaxies in a halo of mass \(M\) is
where the occupation functions for centrals and satellites are given by
This HOD parametrization has 5 parameters, which can be summarized as:
Parameter |
Name |
Default |
Description |
\(\log_{10}M_\mathrm{min}\) |
|
13.031 |
Minimum mass required for a halo to host a central galaxy |
\(\sigma_{\log_{10}M}\) |
|
0.38 |
Rate of transition from \(N_\mathrm{cen}=0\) to \(N_\mathrm{cen}=1\) |
\(\alpha\) |
|
0.76 |
Power law slope of the relation between halo mass and \(N_\mathrm{sat}\) |
\(\log_{10}M_0\) |
|
13.27 |
Low-mass cutoff in \(N_\mathrm{sat}\) |
\(\log_{10}M_1\) |
|
14.08 |
Characteristic halo mass where \(N_\mathrm{sat}\) begins to assume a power law form |
The default values of the HOD parameters are taken from Reid et al. 2014.
This form of the HOD clustering description assumes the galaxy – halo connection depends only on the halo mass. Thus, given a catalog of halo objects, with associated mass values, users can quickly generate realistic galaxy catalogs using this class.
Interfacing with halotools
via a HaloCatalog
¶
Internally, the HODCatalog
class uses the halotools
package to perform the halo population step. For further details,
see the documentation for halotools.empirical_models.Zheng07Cens
and
halotools.empirical_models.Zheng07Sats
, as well as
this tutorial on the Zheng 07 HOD model.
The catalog of halos input to the HODCatalog
class must be of type
halotools.sim_manager.UserSuppliedHaloCatalog
, the tabular
data format preferred by halotools
. nbodykit includes the
HaloCatalog
class in order to interface nicely with
halotools
. In particular, this catalog object
includes a to_halotools()
function to create a
UserSuppliedHaloCatalog
from the data columns
in the HaloCatalog
object.
Given a CatalogSource
object, the HaloCatalog
object
interprets the objects as halos, using a specified redshift, cosmology,
and mass definition, to add several analytic columns to the catalog, including
Radius()
and Concentration()
.
For example, below we generate uniform particles in a box and then interpret them as halos by specifying a redshift and cosmology:
[3]:
from nbodykit.lab import HaloCatalog, cosmology
# uniform objects in a box
cat = UniformCatalog(nbar=100, BoxSize=1.0, seed=42)
# add a Mass column to the objects
cat['Mass'] = 10**(cat.rng.uniform(12, 15))
# initialize the halos
halos = HaloCatalog(cat, cosmo=cosmology.Planck15, redshift=0., mdef='vir', position='Position', velocity='Velocity', mass='Mass')
print(halos.columns)
['Concentration', 'Mass', 'Position', 'Radius', 'Selection', 'Value', 'Velocity', 'VelocityOffset', 'Weight']
And using the to_halotools()
function, we can create
the halotools.sim_manager.UserSuppliedHaloCatalog
object needed
to initialize the HODCatalog
object.
[4]:
halocat = halos.to_halotools()
print(halocat)
print(halocat.halo_table[:10])
<halotools.sim_manager.user_supplied_halo_catalog.UserSuppliedHaloCatalog object at 0x7f78f96c0198>
halo_x halo_y ... halo_upid halo_local_id
------------------- ------------------- ... --------- -------------
0.45470105222137214 0.8326320323149597 ... -1.0 0
0.31944725103976734 0.48518719297596336 ... -1.0 1
0.21242627399222258 0.16674683796980805 ... -1.0 2
0.3185452432219371 0.3490676632113582 ... -1.0 3
0.5066846142238558 0.23705948996927073 ... -1.0 4
0.47273713613639634 0.8569679300356892 ... -1.0 5
0.45934116150501314 0.9523800798479475 ... -1.0 6
0.9244771851038425 0.8843779857073643 ... -1.0 7
0.4553655200041342 0.8086086697963959 ... -1.0 8
0.5231105620516908 0.10008694708748456 ... -1.0 9
Caveats
The units of the halo position, velocity, and mass input to
HaloCatalog
are assumed to be \(\mathrm{Mpc}/h\), km/s, and \(M_\odot/h\), respectively. These units are necessary to interface withhalotools
.The mass definition input to
HaloCatalog
can be “vir” to use virial masses, or an overdensity factor with respect to the critical or mean density, i.e. “200c”, “500c”, or “200m”, “500m”.If using the built-in friends-of-friends (FOF) finder class,
FOF
, to identify halos, the user can use theto_halos()
function to directly produce aHaloCatalog
from the result of running the FOF algorithm.By default, the halo concentration values stored in the
Concentration
column of aHaloCatalog
object are generated using the input mass definition and the analytic formulas from Dutton and Maccio 2014. Users can overwrite this column with their own values if they wish to use custom concentration values when generating HOD catalogs.
The HODCatalog
Class¶
Users can initialize the HOD catalog directly from the
UserSuppliedHaloCatalog
object and the desired
HOD parameters. The HODCatalog
object will include all of the
columns from the UserSuppliedHaloCatalog
object,
with the usual columns Position
, Velocity
, and VelocityOffset
for the generated galaxies. The additional columns are:
conc_NFWmodel: the concentration of the halo
gal_type: the galaxy type, 0 for centrals and 1 for satellites
halo_id: the global ID of the halo that this galaxy belongs to, between 0 and
csize
halo_local_id: the local ID of the halo that this galaxy belongs to, between 0 and
size
halo_mvir: the halo mass, in units of \(M_\odot/h\)
halo_nfw_conc: alias of
conc_NFWmodel
halo_num_centrals: the number of centrals that this halo hosts, either 0 or 1
halo_num_satellites: the number of satellites that this halo hosts
halo_rvir: the halo radius, in units of \(\mathrm{Mpc}/h\)
halo_upid: equal to -1; should be ignored by the user
halo_vx, halo_vy, halo_vz: the three components of the halo velocity, in units of km/s
halo_x, halo_y, halo_z: the three components of the halo position, in units of \(\mathrm{Mpc}/h\)
host_centric_distance: the distance from this galaxy to the center of the halo, in units of \(\mathrm{Mpc}/h\)
vx, vy, vz: the three components of the galaxy velocity, equal to
Velocity
, in units of km/sx,y,z: the three components of the galaxy position, equal to
Position
, in units of \(\mathrm{Mpc}/h\)
Below we populate galaxies with Zheng07Model, and compute the
number of centrals and satellites in the catalog, using the gal_type
column.
[5]:
from nbodykit.hod import Zheng07Model
hod = halos.populate(Zheng07Model, alpha=0.5, sigma_logM=0.40, seed=42)
print("total number of HOD galaxies = ", hod.csize)
print(hod.columns)
print("number of centrals = ", hod.compute((hod['gal_type']==0).sum()))
print("number of satellites = ", hod.compute((hod['gal_type']==1).sum()))
total number of HOD galaxies = 291
['Position', 'Selection', 'Value', 'Velocity', 'VelocityOffset', 'Weight', 'conc_NFWmodel', 'gal_type', 'halo_hostid', 'halo_id', 'halo_mvir', 'halo_num_centrals', 'halo_num_satellites', 'halo_rvir', 'halo_upid', 'halo_vx', 'halo_vy', 'halo_vz', 'halo_x', 'halo_y', 'halo_z', 'host_centric_distance', 'vx', 'vy', 'vz', 'x', 'y', 'z']
number of centrals = 94
number of satellites = 197
Caveats
The HOD population step requires halo concentration. If the user wishes to uses custom concentration values, the
UserSuppliedHaloCatalog
table should contain ahalo_nfw_conc
column. Otherwise, the analytic prescriptions from Dutton and Maccio 2014 are used.
Repopulating a HOD Catalog¶
We can also quickly repopulate a HOD catalog in place, generating a new set of galaxies for the same set of halos, either changing the random seed or the HOD parameters. For example,
[6]:
# repopulate, just changing the random seed
hod.repopulate(seed=84)
print("total number of HOD galaxies = ", hod.csize)
print("number of centrals = ", hod.compute((hod['gal_type']==0).sum()))
print("number of satellites = ", hod.compute((hod['gal_type']==1).sum()))
# re-populate with new parameters
hod.repopulate(logM0=13.2, logM1=14.5)
print("total number of HOD galaxies = ", hod.csize)
print("number of centrals = ", hod.compute((hod['gal_type']==0).sum()))
print("number of satellites = ", hod.compute((hod['gal_type']==1).sum()))
total number of HOD galaxies = 282
number of centrals = 95
number of satellites = 187
total number of HOD galaxies = 141
number of centrals = 93
number of satellites = 48
Note
For examples using HOD mocks, see the cookbook/hod-mocks.ipynb recipe in The Cookbook.
Using a Custom HOD Model¶
Users can implement catalogs that use custom HOD modeling by subclassing
the HODBase
class. This base class is abstract, and subclasses must
implement the __makemodel__()
function. This function
returns a HodModelFactory
object, which is the halotools
object responsible for supporting
custom HOD models. For more information on designing your own HOD model
using halotools
, see
this series of halotools tutorials.
[7]:
Dealing with Data on a Mesh¶
What is a MeshSource
?¶
Often in large-scale structure data analysis, we wish to manipulate
representations of continuous quantities on a discrete grid. The canonical
example is the analysis of the cosmological density field,
interpolated on to a 3D mesh from a discrete set of galaxies. To support
such calculations, nbodykit provides the
nbodykit.base.mesh.MeshSource
object.
Fundamentally, the MeshSource
object stores a (possibly weighted)
density field on a three-dimensional mesh, with the Nmesh
parameter
determining the number of grid cells per side (such that there are
\(\mathrm{Nmesh}^3\) mesh cells). nbodykit adds the functionality
to analyze these fields in both configuration space (often referred
to real space) and Fourier space through an interface to the
RealField
and ComplexField
objects
implemented by the pmesh
package. These objects are
paired classes, related through the operation of a 3D
fast Fourier transform
(FFT). The FFT operation implemented in pmesh
relies on
the pfft-python package, which
is a Python binding of PFFT, a massively parallel
FFT library.
Use Cases¶
The MeshSource
is an abstract base class – it cannot be directly
initialized. Instead, nbodykit includes several specialized subclasses of
MeshSource
in the nbodykit.source.mesh
module. In general,
these subclasses fall into three categories:
Generating mesh data from a
CatalogSource
(see Converting a CatalogSource to a Mesh)Reading mesh data from disk (see Saving and Loading a Mesh)
Generating mock fields directly on a mesh (see Gaussian Realizations)
Painting the Mesh¶
The MeshSource.paint()
function produces the values of the field
on the mesh, returning either a RealField
or
ComplexField
. This function treats the mesh equally in
either configuration space or Fourier space, internally
performing the appropriate FFTs. By specifying the mode
keyword to the
paint()
function, users can access either the field
data in configuration space or the complex modes of the field in Fourier space.
The “painting” nomenclature derives from the most common use case. The
process of interpolating a set of discrete objects on to the mesh evokes
the imagery of “painting” the mesh. More generally, the paint()
function is responsible for filling in the mesh with data, which could also
involve reading data from disk or generating mock fields directly on the mesh.
For further details and examples of painting a catalog of discrete objects to a mesh, see Painting Catalogs to a Mesh.
Fields: RealField
and ComplexField
¶
The MeshSource
class provides an interface to the
pmesh.pm.RealField
and pmesh.pm.ComplexField
objects.
These classes behave like numpy arrays and include functions to
perform parallel forward and inverse FFTs. These
field objects are initialized from a pmesh.pm.ParticleMesh
, which
sets the number of mesh cells and stores FFT-related grid quantities.
[1]:
from pmesh.pm import ParticleMesh, RealField, ComplexField
# a 8^3 mesh
pm = ParticleMesh(Nmesh=[8,8,8])
# initialize a RealField
rfield = RealField(pm)
# shape
print("shape = ", rfield.shape)
# set entire mesh to unity
rfield[...] = 1.0
# print the mean of the underlying array
print("mean = ", rfield.value.mean())
shape = (8, 8, 8)
mean = 1.0
All MeshSource
objects implement either the
MeshSource.to_real_field()
function or the
MeshSource.to_complex_field()
function. These
functions are responsible for returning either a RealField
or a ComplexField
. The MeshSource.paint
function
calls these functions, providing the core functionality
of the MeshSource
class.
The c2r()
and r2c()
functions¶
Users can transform between RealField
and
ComplexField
objects using the r2c()
function for forward FFTs and the c2r()
function
for inverse FFTs. These operations take advantage of the fact that the field objects in
configuration space store real-valued quantities to perform real-to-complex
FFTs. This type of FFT uses the symmetry of real-valued quantities to store
only half of the complex modes along the z
axis.
[2]:
# perform the forward FFT
cfield = rfield.r2c()
# stores Nmesh/2+1 in z axis b/c of conjugate symmetry
print("shape = ", cfield.shape)
# k=0 mode is the mean value of configuration space field
print("mean of configuration space field from k=0 = ", cfield[0,0,0])
# perform the inverse FFT
rfield2 = cfield.c2r()
# print the mean of the underlying array
print("mean of real field = ", rfield2.value.mean())
shape = (8, 8, 5)
mean of configuration space field from k=0 = (1+0j)
mean of real field = 1.0
Storing Meta-data¶
For all MeshSource
objects, the input parameters and additional
meta-data are stored in the attrs
dictionary attribute.
API¶
For more information about specific mesh sources, please see the API section.
Creating a Mesh¶
In this section, we outline how users interact with data on a mesh in nbodykit. The main ways to create a mesh include:
Converting a CatalogSource
to a Mesh¶
Users can create mesh objects from CatalogSource
objects by specifying the desired number of cells per mesh side via the
Nmesh
parameter and using the
to_mesh()
function.
Below, we convert a UniformCatalog
to a MeshSource
using a \(16^3\) mesh in
a box of side length \(1\) \(\mathrm{Mpc}/h\).
[2]:
from nbodykit.lab import UniformCatalog
cat = UniformCatalog(nbar=100, BoxSize=1.0, seed=42)
mesh = cat.to_mesh(Nmesh=16)
print("mesh = ", mesh)
/home/yfeng1/anaconda3/install/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters
mesh = (UniformCatalog(size=96, seed=42) as CatalogMesh)
Important
The to_mesh()
operation does not perform any
interpolation operations, which nbodykit refers to as “painting” the mesh.
This function merely initializes a new object that sets up the mesh with
the configuration provided by the user. For more details and examples of
painting a catalog of discrete objects to a mesh, see Painting Catalogs to a Mesh.
The Window Kernel¶
When interpolating discrete particles on to a regular mesh, we must choose which kind of interpolation kernel to use. The kernel determines which cells an object will contribute to on the mesh. In the simplest case, known as Nearest Grid Point interpolation, an object only contributes to the one cell that is closest to its position. In general, higher order interpolation schemes, which spread out objects over more cells, lead to more accurate results. See Section 2.3 of Sefusatti et al. 2015 for an introduction to different interpolation schemes.
nbodykit supports several different interpolation kernels, which can be
specified using the window
keyword of the to_mesh()
function. The default value is cic
, representing the second-order
interpolation scheme known as Cloud In Cell. The third-order interpolation
scheme, known as Triangular Shaped Cloud, can be specified by setting
window
to tsc
. CIC and TSC are the most commonly used interpolation
windows in the field of large-scale structure today.
Support for wavelet-based kernels is also provided. The Daubechies wavelet
with various sizes can be used by specifying db6
, db12
, or db20
.
The closely related Symlet wavelet can be used by specifying
sym6
, sym12
, or sym20
. These are symmetric Daubechies wavelets
and tend to perform better than the non-symmetric versions. For more
information on using wavelet kernels for painting a mesh, see
Cui et al. 2008.
Note that the non-traditional interpolation windows can be considerably slower
than the cic
or tsc
methods. For this reason, nbodykit uses the
cic
interpolation window by default. See pmesh.window.methods
for
the full list of supported window kernels.
Note
Please see this cookbook recipe for a notebook exploring the accuracy of different interpolation windows.
Interlacing¶
nbodykit provides support for the interlacing technique, which can reduce the effects of aliasing when Fourier transforming the density field on the mesh. This technique involves interpolating objects on to two separate meshes, separated by half of a cell size. When combining the complex fields in Fourier space from these two meshes, the effects of aliasing are significantly reduced on the combined field. For a more detailed discussion behind the mathematics of this technique, see Section 3.1 of Sefusatti et al. 2015.
By default this technique is turned off, but it can be turned on by the user
by passing interlaced=True
to the to_mesh()
function.
Note
Please see this cookbook recipe for a notebook exploring the effects of interlacing on density fields in Fourier space.
Compensation: Deconvolving the Window Kernel¶
Interpolating discrete objects on to the mesh produces a density field
defined on the mesh that is convolved with the interpolation kernel.
In Fourier space, the complex field is then the product of the true density
field and the Fourier transform of the window kernel as given by the
Convolution Theorem. For the TSC and CIC window kernels, there are
well-known correction factors that can be applied to the density field
in Fourier space. If we apply these correction factors, we refer to the
field as “compensated”, and the use of these correction factors
is controlled via the compensated
keyword of the
to_mesh()
function.
If compensated
is set to True
, the correction factors that will be
applied are:
Window |
Interlacing |
Compensation Function |
Reference |
|
|
|
eq 20 of Jing et al. 2005 |
|
|
|
eq 20 of Jing et al. 2005 |
|
|
|
eq 18 of Jing et al. 2005 (\(p=2\)) |
|
|
|
eq 18 of Jing et al. 2005 (\(p=3\)) |
Note
If window
is not equal to tsc
or cic
, no compensation correction
is currently implemented by default in nbodykit, and if compensated
is
set to True
, an exception will be raised. Users can implement custom
compensation functions via the syntax for Applying Functions to the Mesh.
Additional Mesh Configuration Options¶
The to_mesh()
function supports additional keywords
for customizing the painting process. These keywords are:
weight
:The
weight
keyword can be used to select a column to use as weights when painting objects to a mesh. By default, it is set to theWeight
column, a default column equal to unity for all objects. In this default configuration, the density field on the mesh is normalized as \(1+\delta\). See cookbook/painting.ipynb#Painting-Multiple-Species-of-Particles for an example of using theweight
keyword to represent particle masses.value
:The
value
keyword can be used to select a column to use as the field value when painting a mesh. The mesh field is a weighted average ofvalue
, with the weights given byweight
. By default, it is set to theValue
column, a default column equal to unity for all objects. In this default configuration, the density field on the mesh is normalized as \(1+\delta\). See cookbook/painting.ipynb#Painting-the-Line-of-sight-Momentum-Field for an example of using thevalue
keyword to paint the momentum field.selection
:The
selection
keyword specifies a boolean column that selects a subset of theCatalogSource
object to paint to the mesh. By default, theselection
keyword is set to theSelection
column, a default column in allCatalogSource
objects that is set toTrue
for all objects.position
:By default, nbodykit assumes that the
Position
column is the name of the column holding the Cartesian coordinates of the objects in the catalog. Thus, theto_mesh()
function uses this column to paint a catalog to a mesh. The user can change this behavior by specifying the name of the desired column using theposition
keyword of theto_mesh()
function.
Gaussian Realizations¶
A Gaussian realization of a density field can be initialized directly
on a mesh using the LinearMesh
class.
This class generates the Fourier modes of density field with a variance
set by an input power spectrum function. It allows the user to create
density fields with a known power spectrum, which is often a useful tool in
large-scale structure analysis.
Users can take advantage of the built-in linear power spectrum
class, LinearPower
, or use their own
function to specify the desired power spectrum. The function should take
a single argument k
, the wavenumber. Several transfer functions can be
used with the LinearPower
class,
including from the CLASS CMB Boltzmann code, and the analytic fitting formulas from
Eisenstein and Hu 1998 with and without
Baryon Acoustic Oscillations.
In addition to the power spectrum function, users need to specify
a mesh size via the Nmesh
parameter and a box size via the BoxSize
parameter. For example, to create
a density field on a mesh using the 2015 Planck cosmological parameters
and the Eisenstein-Hu linear power spectrum at redshift \(z=0\), use
[3]:
from nbodykit.lab import LinearMesh, cosmology
from matplotlib import pyplot as plt
cosmo = cosmology.Planck15
Plin = cosmology.LinearPower(cosmo, redshift=0, transfer='EisensteinHu')
# initialize the mesh
mesh = LinearMesh(Plin, Nmesh=128, BoxSize=1380, seed=42)
# preview the density field
plt.imshow(mesh.preview(axes=[0,1]))
[3]:
<matplotlib.image.AxesImage at 0x7f0b1af1b6d8>

From In-memory Data¶
From a RealField
or ComplexField
¶
If a pmesh.pm.RealField
or pmesh.pm.ComplexField
object
is already stored in memory, they can be converted easily into a mesh object
using the FieldMesh
class. For example,
[4]:
from nbodykit.lab import FieldMesh
from pmesh.pm import RealField, ComplexField, ParticleMesh
# a 8^3 mesh
pm = ParticleMesh(Nmesh=[8,8,8])
# initialize a RealField
rfield = RealField(pm)
# set entire mesh to unity
rfield[...] = 1.0
# initialize from the RealField
real_mesh = FieldMesh(rfield)
# can also initialize from a ComplexField
cfield = rfield.r2c()
complex_mesh = FieldMesh(cfield)
From a Numpy Array¶
Given a 3D numpy array stored in memory that represents data on a mesh, users
can initialize a mesh object using the
ArrayMesh
class. The array for the full
mesh must be stored in memory on a single rank and not split in parallel across
multiple ranks. After initializing the ArrayMesh
object, the mesh data will be automatically spread out across the available ranks.
A common use case for this class is when a single rank handles the input/output
of the mesh data in the form of numpy arrays. Then, a single rank can read
in the array data from disk, and the mesh object can be initialized using
the ArrayMesh
class.
For example,
[5]:
from nbodykit.lab import ArrayMesh
import numpy
# generate random data on a 128^3 mesh
data = numpy.random.random(size=(128,128,128))
# inititalize the mesh
mesh = ArrayMesh(data, BoxSize=1.0)
# preview the density mesh
plt.imshow(mesh.preview(axes=[0,1]))
[5]:
<matplotlib.image.AxesImage at 0x7f0b1aebf0b8>

Painting Catalogs to a Mesh¶
The MeshSource.paint()
function produces the values
of the field on the mesh, returning either a RealField
or
ComplexField
. In this section, we focus on the
process of interpolating a set of discrete objects in a CatalogSource
on to a mesh and how users can customize this procedure.
The Painted Field¶
The compute()
function paints mass-weighted
(or equivalently, number-weighted) fields to a mesh. So, when painting a
CatalogSource
to a mesh, the field \(F(\vx)\) that is painted
is:
where \(V(\vx)\) represents the field value painted to the mesh and \(\delta'(\vx)\) is the (weighted) overdensity field, given by:
where \(\bar{n}'\) is the weighted mean number density of objects. Here, quantities denoted with a prime (\('\)) indicate weighted quantities. The unweighted number density field \(n(\vx)\) is related to its weighted counterpart via \(n'(\vx) = W(\vx) n(\vx)\), where \(W(\vx)\) are the weights.
Users can control the behavior of the value \(V(\vx)\) and the weights
\(W(\vx)\) when converting a CatalogSource
object to a mesh
via the to_mesh()
function. Specifically, the
weight
and value
keywords allow users to indicate the name of
the column in the CatalogSource
to use for \(W(\vx)\)
and \(V(\vx)\). See Additional Mesh Configuration Options for more details
on these keywords.
Operations¶
The painted field is an instance of pmesh.pm.RealField
. Methods
are provided for Fourier transforms (to a pmesh.pm.ComplexField
object),
and transfer functions on both the Real and Complex fields:
field = mesh.compute()
def tf(k, v):
return 1j * k[2] / k.normp(zeromode=1) ** 0.5 * v
gradz = field.apply(tf).c2r()
The underlying numerical values of the field can be accessed via indexing. A RealField is distributed across the entire MPI communicator of the mesh object, and in general each single rank in the MPI communicator only sees a region of the field.
numpy methods (e.g. field[…].std() that operates on the local field values only compute the results on a single rank, thus only correct when a single rank is used:
collective methods provide the correct result that has been reduced on the entire MPI communicator. For example, to compute the standard deviation of the field in a script that runs on sevearl MPI ranks, we shall use
((field ** 2).cmean() - field.cmean() ** 2) ** 0.5
instead offield[...].std()
.
The positions of the grid points on which the field value resides can be obtained from
field = mesh.compute()
grid = field.pm.generate_uniform_particle_grid(shift=0)
A low resolution projected preview of the field can be obtained (the example is along x-y plain)
field = mesh.compute()
imshow(field.preview(Nmesh=64, axes=[0, 1]).T,
origin='lower',
extent=(0, field.BoxSize[0], 0, field.BoxSize[1]))
Shot-noise¶
The shot-noise level of a weighted field is given by
where L^3 is the total volume of the box, and W is the weight of individual objects. We see in the limit where W=1 everywhere, the shotnoise is simply \(1 / \bar{n}\).
Default Behavior¶
The default behavior is \(W(\vx) = 1\) and \(V(\vx) = 1\), in which case the painted field is given by:
In the CatalogSource.to_mesh()
function, the default
values for the value
and weight
keywords are the Value
and
Weight
columns, respectively. These are
default columns that are in all
CatalogSource
objects that are set to unity by default.
More Examples¶
The Painting Recipes section of the cookbook contains several more examples that change the default behavior to paint customized fields to the mesh.
For example, users can set \(V(\vx)\) to a column holding a component of the velocity field, in which case the painted field \(F(\vx)\) would represent the momentum (mass-weighted velocity) field. See the Painting the Line-of-sight Momentum Field recipe for further details.
Another common example is setting the weights \(W(\vx)\) to a
column representing mass and painting multiple species of particles
to the same mesh using the
MultipleSpeciesCatalog
object.
See the
Painting Multiple Species of Particles
recipe for more details.
Common Mesh Operations¶
In this section, we detail some common operations for manipulating data
defined on a mesh via a MeshSource
object.
Previewing the Mesh¶
The MeshSource.preview()
function allows users to preview a
low-resolution mesh by resampling the mesh and gathering the mesh to all
ranks. It can also optionally project the mesh across multiple axes, which
enables visualizations of the projected density field for quick data
inspection by the user.
For example, below we initialize a
LinearMesh
object on a \(128^3\)
mesh and preview the mesh on a \(64^3\) mesh after projecting the field
along two axes:
[2]:
from nbodykit.lab import LinearMesh, cosmology
from matplotlib import pyplot as plt
cosmo = cosmology.Planck15
Plin = cosmology.LinearPower(cosmo, redshift=0, transfer='EisensteinHu')
mesh = LinearMesh(Plin, Nmesh=128, BoxSize=1380, seed=42)
density = mesh.preview(Nmesh=64, axes=(0,1))
plt.imshow(density)
/home/yfeng1/anaconda3/install/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters
[2]:
<matplotlib.image.AxesImage at 0x7fb154c74828>

Note
The previewed mesh result is broadcast to all ranks, so each rank allocates \(\mathrm{Nmesh}^3\) in memory.
Saving and Loading a Mesh¶
The MeshSource.save()
function paints the mesh via a call to
MeshSource.paint()
and saves the mesh using a bigfile
format.
The output mesh saved to file can be in either configuration space or Fourier
space, by specifying mode
as either real
or complex
.
Below, we save our LinearMesh
to a
bigfile
file:
[3]:
# save the RealField
mesh.save('linear-mesh-real.bigfile', mode='real', dataset='Field')
# save the ComplexField
mesh.save('linear-mesh-complex.bigfile', mode='real', dataset='Field')
The saved mesh can be loaded from disk using the
BigFileMesh
class:
[4]:
from nbodykit.lab import BigFileMesh
import numpy
# load the mesh in the form of a RealField
real_mesh = BigFileMesh('linear-mesh-real.bigfile', 'Field')
# return the RealField via paint
rfield = real_mesh.paint(mode='real')
# load the mesh in the form of a ComplexField
complex_mesh = BigFileMesh('linear-mesh-complex.bigfile', 'Field')
# FFT to get the ComplexField as a RealField
rfield2 = complex_mesh.paint(mode='real')
# the two RealFields must be the same!
numpy.allclose(rfield.value, rfield2.value)
[4]:
True
Here, we load our meshes in configuration space and Fourier space and then
paint both with mode=real
and verify that the results are the same.
Applying Functions to the Mesh¶
nbodykit supports performing transformations to the mesh data by applying
arbitrary functions in either configuration space or Fourier space. Users
can use the MeshSource.apply()
function to apply these transformations.
The function applied to the mesh should take two arguments, x
and v
:
The
x
argument provides a list of length three holding the coordinate arrays that define the mesh. These arrays broadcast to the full shape of the mesh, i.e., they have shapes \((N_x,1,1)\), \((1,N_y,1)\), and \((1,1,N_z)\) if the mesh has shape \((N_x, N_y, N_z)\).The
v
argument is the array holding the value of the mesh field at the coordinate arrays inx
The units of the x
coordinate arrays depend upon the values of the
kind
and mode
keywords passed to the apply()
function.
The various cases are:
mode |
kind |
range of the “x” argument |
|
|
\([-L/2, L/2)\) |
|
|
\([0, N)\) |
|
|
\([- \pi N/L, \pi N / L)\) |
|
|
\([-\pi, \pi)\) |
|
|
\([0, N)\) |
Here, \(L\) is the size of the box and N is the number of cells per mesh side.
One common use of the MeshSource.apply()
functionality is applying
compensation function to the mesh to correct for the interpolation window.
The table of built-in compensation functions in the
Compensation: Deconvolving the Window Kernel section provide examples of the syntax needed to apply
functions to the mesh.
In the example below, we apply a filter function in Fourier space that divides
the mesh by the squared norm of the wavenumber k
on the mesh, and then
print out the first few mesh cells of the filtered mesh to verify the
function was applied properly.
[5]:
def filter(k, v):
kk = sum(ki ** 2 for ki in k) # k^2 on the mesh
kk[kk == 0] = 1
return v / kk # divide the mesh by k^2
# apply the filter and get a new mesh
filtered_mesh = mesh.apply(filter, mode='complex', kind='wavenumber')
# get the filtered RealField object
filtered_rfield = filtered_mesh.paint(mode='real')
print("head of filtered Realfield = ", filtered_rfield[:10,0,0])
print("head of original RealField = ", rfield[:10,0,0])
head of filtered Realfield = [1094.8562 1060.4702 1058.0209 1059.883 1067.5662 1104.8212 1150.1516
1193.5796 1244.2278 1295.758 ]
head of original RealField = [0.14927006 0.85538614 1.7375357 2.270503 1.6951047 1.9259002
1.4472127 0.8612448 0.8651851 1.6219351 ]
Resampling a Mesh¶
Users can resample a mesh by specifying the Nmesh
keyword to the
MeshSource.paint()
function.
For example, below we resample a
LinearMesh
object, changing the mesh
resolution from Nmesh=128
to Nmesh=32
.
[6]:
from nbodykit.lab import LinearMesh, cosmology
# linear mesh
Plin = cosmology.LinearPower(cosmology.Planck15, redshift=0.55, transfer='EisensteinHu')
source = LinearMesh(Plin, Nmesh=64, BoxSize=512, seed=42)
# paint, re-sampling to Nmesh=32
real = source.paint(mode='real', Nmesh=32)
print("original Nmesh = ", source.attrs['Nmesh'])
print("resampled Nmesh = ", real.Nmesh)
print("shape of resampled density field = ", real.cshape)
original Nmesh = [64 64 64]
resampled Nmesh = [32 32 32]
shape of resampled density field = [32 32 32]
Available Algorithms¶
nbodykit includes several state-of-the-art implementations of canonical algorithms in the field of large-scale structure. It includes functionality for computing a wide variety of clustering statistics on data, as well as algorithms for finding groups of objects.
We describe these algorithms in detail below, and users can also find further examples in The Cookbook section of the documentation.
Power Spectrum Algorithms¶
Simulation Box Power Spectrum/Multipoles (FFTPower
)¶
The FFTPower
class computes the 1d power spectrum \(P(k)\), 2d
power spectrum \(P(k,\mu)\), and/or multipoles \(P_\ell(k)\) for data
in a simulation box, using a Fast Fourier Transform (FFT). Here, we provide
a brief overview of the algorithm itself as well as the key things to know for
the user to get up and running quickly.
Note
To jump right into the FFTPower
algorithm, see this
cookbook recipe for a detailed
walk-through of the FFTPower
algorithm.
The Algorithm¶
The steps involved in computing the power spectrum via FFTPower
are as follows:
Generate data on a mesh
Data must be painted on to a discrete mesh to compute the power spectrum. There are several ways to generate data on a mesh (see Creating a Mesh), but the most common is painting a discrete catalog of objects on to a mesh (see Converting a CatalogSource to a Mesh and Painting Catalogs to a Mesh). The
FFTPower
class accepts input data in either the form of aMeshSource
or aCatalogSource
. In the latter case, the catalog is automatically converted to a mesh using the default parameters of theto_mesh()
function.When converting from a catalog to a mesh, users can customize the painting procedure via the options of the
to_mesh()
function. These options have important effects on the resulting power spectrum of the field in Fourier space. See Converting a CatalogSource to a Mesh for more details.FFT the mesh to Fourier space
Once the density field is painted to the mesh, the Fourier transform of the field \(\delta(\vx)\) is performed in parallel to obtain the complex modes of the overdensity field, \(\delta(\vk)\). The field is stored using the
ComplexField
object.Generate the 3D power spectrum on the mesh
The 3D power spectrum field is computed on the mesh, using
\[P(\mathbf{k}) = \delta(\mathbf{k}) \cdot \delta^\star(\mathbf{k}),\]where \(\delta^\star (\mathbf{k})\) is the complex conjugate of \(\delta(\mathbf{k})\).
Perform the binning in the specified basis
Finally, the 3D power defined on the mesh \(P(\mathbf{k})\) is binned using the basis specified by the user. The available options for binning are:
1D binning as a function of wavenumber \(k\)
2D binning as a function of wavenumber \(k\) and cosine of the angle to the line-of-sight \(\mu\)
Multipole binning as a function of \(k\) and multipole number \(\ell\)
The Functionality¶
Users can compute various quantities using the FFTPower
. We’ll discuss
the available functionality briefly in the sub-sections below.
Both auto and cross spectra are supported. Users can compute cross power spectra
by passing a second mesh object to the FFTPower
class using
the second
keyword. The first mesh object should always be specified as
the first
argument.
The 1D power spectrum \(P(k)\) can be computed by specifying the
mode
argument as “1d”. The wavenumber binning will be linear, and can be
customized by specifying the dk
and kmin
attributes. By default,
the edge of the last wavenumber bin is the
Nyquist frequency, given
by \(k_\mathrm{Nyq} = \pi N_\mathrm{mesh} / L_\mathrm{box}\). If dk
is not specified, then the fundamental mode of the box is used:
\(2\pi/L_\mathrm{box}\).
The 2D power spectrum \(P(k,\mu)\) can be computed by specifying the
mode
argument as “2d”. The number of \(\mu\) bins is specified via
the Nmu
keyword. The bins range from \(\mu=0\) to \(\mu=1\).
The FFTPower
class can also compute the multipoles of the 2D power
spectrum, defined as
where \(\mathcal{L}_\ell\) is the Legendre polynomial of order
\(\ell\). Users can specify which multipoles they wish to compute
by passing a list of the desired \(\ell\) values as the poles
keyword to the FFTPower
class.
For example, we can compute both \(P(k,\mu)\) and \(P_\ell(k)\) for a uniform catalog of objects using:
[2]:
from nbodykit.lab import UniformCatalog, FFTPower
cat = UniformCatalog(nbar=100, BoxSize=1.0, seed=42)
r = FFTPower(cat, mode='2d', Nmesh=32, Nmu=5, poles=[0,2,4])
/home/yfeng1/anaconda3/install/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters
The Results¶
The power spectrum results are stored in two attributes of the
initialized FFTPower
object:
power
and poles
. These attributes are
BinnedStatistic
objects, which
behave like structured numpy arrays and store
the measured results on a coordinate grid defined by the bins.
See Analyzing your Results for a full tutorial on using
the BinnedStatistic
class.
The power
attribute stores the following variables:
- k :
the mean value for each
k
bin
- muif
mode=2d
the mean value for each
mu
bin
- muif
- power :
complex array storing the real and imaginary components of the power
- modes :
the number of Fourier modes averaged together in each bin
The poles
attribute stores the following variables:
- k :
the mean value for each
k
bin
- power_L :
complex array storing the real and imaginary components for the \(\ell=L\) multipole
- modes :
the number of Fourier modes averaged together in each bin
Note that measured power results for bins where modes
is zero (no data points
to average over) are set to NaN
.
In our example, the power
and poles
attributes are:
[3]:
# the 2D power spectrum results
print("power = ", r.power)
print("variables = ", r.power.variables)
for name in r.power.variables:
var = r.power[name]
print("'%s' has shape %s and dtype %s" %(name, var.shape, var.dtype))
power = <BinnedStatistic: dims: (k: 16, mu: 5), variables: ('k', 'mu', 'power', 'modes')>
variables = ['k', 'mu', 'power', 'modes']
'k' has shape (16, 5) and dtype float64
'mu' has shape (16, 5) and dtype float64
'power' has shape (16, 5) and dtype complex128
'modes' has shape (16, 5) and dtype int64
[4]:
# the multipole results
print("poles = ", r.poles)
print("variables = ", r.poles.variables)
for name in r.poles.variables:
var = r.poles[name]
print("'%s' has shape %s and dtype %s" %(name, var.shape, var.dtype))
poles = <BinnedStatistic: dims: (k: 16), variables: 5 total>
variables = ['k', 'power_0', 'power_2', 'power_4', 'modes']
'k' has shape (16,) and dtype float64
'power_0' has shape (16,) and dtype complex128
'power_2' has shape (16,) and dtype complex128
'power_4' has shape (16,) and dtype complex128
'modes' has shape (16,) and dtype int64
These attributes also store meta-data computed during the power calculation
in the attrs
dictionary. Most importantly, the shotnoise
key
gives the Poisson shot noise, \(P_\mathrm{shot} = V / N\), where V
is the volume of the simulation box and N is the number of objects. The keys
N1
and N2
store the number of objects.
In our example, the meta-data is:
[5]:
for k in r.power.attrs:
print("%s = %s" %(k, str(r.power.attrs[k])))
Nmesh = [32 32 32]
BoxSize = [1. 1. 1.]
Lx = 1.0
Ly = 1.0
Lz = 1.0
volume = 1.0
mode = 2d
los = [0, 0, 1]
Nmu = 5
poles = [0, 2, 4]
dk = 6.283185307179586
kmin = 0.0
N1 = 96
N2 = 96
shotnoise = 0.010416666666666666
Note
The shot noise is not subtracted from any measured results. Users can
access the Poisson shot noise value in the meta-data attrs
dictionary.
Saving and Loading¶
Results can easily be saved and loaded from disk in a reproducible manner
using the FFTPower.save()
and FFTPower.load()
functions.
The save
function stores the state of the algorithm,
including the meta-data in the FFTPower.attrs
dictionary, in a
JSON plaintext format.
[6]:
# save to file
r.save("fftpower-example.json")
# load from file
r2 = FFTPower.load("fftpower-example.json")
print("power = ", r2.power)
print("poles = ", r2.poles)
print("attrs = ", r2.attrs)
power = <BinnedStatistic: dims: (k: 16, mu: 5), variables: ('k', 'mu', 'power', 'modes')>
poles = <BinnedStatistic: dims: (k: 16), variables: 5 total>
attrs = {'Nmesh': array([32, 32, 32]), 'BoxSize': array([1., 1., 1.]), 'Lx': 1.0, 'Ly': 1.0, 'Lz': 1.0, 'volume': 1.0, 'mode': '2d', 'los': [0, 0, 1], 'Nmu': 5, 'poles': [0, 2, 4], 'dk': 6.283185307179586, 'kmin': 0.0, 'N1': 96, 'N2': 96, 'shotnoise': 0.010416666666666666}
Common Pitfalls¶
The default configuration of nbodykit should lead to reasonable results
when using the FFTPower
algorithm. When performing custom, more complex
analyses, some of the more common pitfalls are:
When the results of
FFTPower
do not seem to make sense, the most common culprit is usually the configuration of the mesh, and whether or not the mesh is “compensated”. In the language of nbodykit, “compensated” refers to whether the effects of the interpolation window used to paint the density field have been de-convolved in Fourier space. See the Converting a CatalogSource to a Mesh section for detailed notes on this procedure.Be wary of normalization issues when painting weighted density fields. See Painting Catalogs to a Mesh for further details on painting meshes and Applying Functions to the Mesh for notes on applying arbitrary functions to the mesh while painting. See this cookbook recipe for examples of more advanced painting uses.
Power Spectrum Multipoles of Survey Data (ConvolvedFFTPower
)¶
The ConvolvedFFTPower
class computes the power spectrum multipoles
\(P_\ell(k)\) for data from an observational survey that includes non-trivial
selection effects. The input data is expected to be in the form of
angular coordinates (right ascension and declination) and redshift.
The measured power spectrum multipoles represent the true multipoles convolved
with the window function. Here, the window function refers to the Fourier
transform of the survey volume. In this section, we provide an overview of the
FFT-based algorithm used to compute the multipoles
and detail important things to know for the user to get up and running quickly.
Note
To jump right into the ConvolvedFFTPower
algorithm, see this
cookbook recipe for a detailed
walk-through of the algorithm.
Some Background¶
We begin by defining the estimator for the multipole power spectrum, often referred to as the “Yamamoto estimator” in the literature (Yamamoto et al 2006):
where \(\Omega_k\) represents the solid angle in Fourier space, and \(\L_\ell\) is the Legendre polynomial of order \(\ell\). The weighted density field \(F(\vr)\) is defined as
where \(n_g'\) and \(n_s'\) are the number densities for the galaxy catalog and synthetic catalog of randoms object, respectively, and \(\alpha'\) is the ratio of the number of real galaxies to random galaxies. Often, the catalog of random objects has a much higher number density than the galaxy catalog, and the factor of \(\alpha'\) re-normalizes to the proper number density. The normalization \(A\) is given by
The shot noise \(P_\ell^{\rm noise}\) in equation (1) is
The FKP weights, first derived in Feldman, Kaiser, and Peacock 1994, minimize the variance of the estimator at a desired power spectrum value. Denoted as \(w_\mathrm{FKP}\), these weights are given by
where \(P_0\) is the power spectrum amplitude in units of \(h^{-3} \mathrm{Mpc}^3\) where the estimator is optimized. For typically galaxy survey analyses, a value of order \(P_0 = 10^{4} \ h^{-3} \mathrm{Mpc}^3\) is usually assumed.
In our notation, quantities marked with a prime (\('\)) include completeness weights. These weights, denoted as \(w_c\), help account for systematic variations in the number density fields. Typically the weights for the random catalog are unity, but for full generality we allow for the possibility of non-unity weights for \(n_s\) as well. For example, \(\alpha' = N'_\mathrm{gal} / N'_\mathrm{ran}\), where \(N'_\mathrm{gal} = \sum_i^{N_\mathrm{gal}} w_c^i\).
The Algorithm¶
We use the FFT-based algorithm of Hand et al 2017 to compute the power spectrum multipoles. This work improves upon the previous FFT-based estimators of \(P_\ell(k)\) presented in Bianchi et al 2015 and Scoccimarro 2015 by only requiring the use of \(2\ell+1\) FFTs to compute a multipole of order \(\ell\). The algorithms uses the spherical harmonic addition theorem to express equation (1) as
where
The sum over \(m\) in equation (7) contains \(2 \ell + 1\)
terms, each of which can be computed using a FFT. Thus, the multipole moments
can be expressed as a sum of Fourier transforms of the weighted density field,
with weights given by the appropriate spherical harmonic.
We evaluate equation (7) using the real-to-complex FFT functionality
of the pmesh
package and
use the real form of the spherical harmonics \(\Ylm\).
We use the symbolic manipulation functionality available in the SymPy Python package to compute the spherical harmonic expressions in terms of Cartesian vectors. This allows the user to specify the desired multipoles at runtime, enabling the code to be used to compute multipoles of arbitrary \(\ell\).
Getting Started¶
Here, we outline the necessary steps for users to get started using the
ConvolvedFFTPower
algorithm to compute the power spectrum multipoles
from an input data catalog.
FKPCatalog
Class¶The ConvolvedFFTPower
algorithm requires a galaxy catalog and a
synthetic catalog of random objects without any clustering signal,
which we refer to as the “data” and “randoms” catalogs, respectively.
The CatalogSource
object responsible for handling these two types
of catalogs is the FKPCatalog
class.
The FKPCatalog
determines the size of
the Cartesian box that the “data” and “randoms” are placed in, which is then
also used during the FFT operation. By default, the box size is determined
automatically from the maximum extent of the “randoms” positions.
In this automatic case, the size of the box can be artificially extended and
padded with zeros via the BoxPad
keyword.
Users can also specify a desired box size by passing in the BoxSize
keyword.
The FKPCatalog
object can be converted to a mesh object,
FKPCatalogMesh
, via the
to_mesh()
function. This
mesh object knows how to paint the FKP weighted density field,
given by equation (2), to the mesh using the “data” and “randoms” catalogs.
With the FKP density field painted to the mesh, the ConvolvedFFTPower
algorithm uses equations (6) and (7) to compute the
multipoles specified by the user.
The Position
column, holding the Cartesian coordinates,
is required for both the “data” and “randoms” catalogs.
We provide the function nbodykit.transform.SkyToCartesian()
for converting
sky coordinates, in the form of right ascension, declination, and redshift,
to Cartesian coordinates. The conversion from redshift to comoving distance
requires a cosmology instance, which can be specified via the
Cosmology
class.
For more details, see Converting from Sky to Cartesian Coordinates.
The number density of the “data” catalog as a function of redshift, in units of \(h^{3} \mathrm{Mpc}^{-3}\), is required to properly normalize the power spectrum using equation (3) and to compute the shot noise via equation (4). The “data” and “randoms” catalog should contain a column that gives this quantity, evaluated at the redshift of each object in the catalogs.
When converting from a FKPCatalog
to a FKPCatalogMesh
, the name
of the \(n'_g(z)\) column should be passed as the nbar
keyword to the
to_mesh()
function. The
\(n'_g(z)\) column should have the same name in the “data” and “randoms”
catalogs.
By default, the name of the nbar
column is set to NZ
. If this
column is missing from the “data” or “randoms” catalog, an exception
will be raised.
Note that the RedshiftHistogram
algorithm
can compute a weighted \(n(z)\) for an input catalog and may be useful
if the user needs to compute \(n_g(z)\).
Important
By construction, the objects in the “randoms” catalog should follow the
same redshift distribution as the “data” catalog. Often, the “randoms”
will have an overall number density that is 10, 50, or 100 times the number
density of the “data” catalog. Even if the “randoms” catalog has a higher
number density, the nbar
column in both the “data” and “randoms”
catalogs should hold number density values on the same scale, corresponding
to the value of \(n_g(z)\) at the redshift of the objects in the
catalogs.
The ConvolvedFFTPower
algorithm supports the use of completeness
weights to account for systematic variations in the number density of the
“data” and “randoms” objects. In our notation in the earlier
background section, quantities marked with a
prime (\('\)) include completeness weights. The weighted and
unweighted number densities of the “data” and “randoms” fields are
then related by:
Typically the weights for the random catalog are unity, but for full generality, we allow for the possibility of non-unity weights for \(n_s\) as well. For example, \(\alpha' = N'_\mathrm{gal} / N'_\mathrm{ran}\), where \(N'_\mathrm{gal} = \sum_i^{N_\mathrm{gal}} w_{c,g}^i\) and \(N'_\mathrm{ran} = \sum_i^{N_\mathrm{ran}} w_{c,s}^i\).
When converting from a FKPCatalog
to a FKPCatalogMesh
, the name
of the completeness weight column should be passed as the comp_weight
keyword to the to_mesh()
function.
By default, the name of the comp_weight
column is set to the
default column Weight
, which has a value of unity for all objects.
If specifying a different name, the column should have the same name in both
the “data” and “randoms” catalogs.
Users can also specify the name of a column in the “data” and “randoms” catalogs that represents a FKP weight for each object, as given by equation (5). The FKP weights do not weight the individual number density fields as the completeness weights do, but rather they weight the combined field, \(n'_g(\vr) - \alpha' n'_s(\vr)\) (see equation (2)).
When converting from a FKPCatalog
to a FKPCatalogMesh
, the name
of the FKP weight column should be passed as the fkp_weight
keyword to the to_mesh()
function.
By default, the name of the fkp_weight
column is set to FKPWeight
,
which has a value of unity for all objects. If specifying a different name,
the column should have the same name in both the “data” and “randoms”
catalogs.
The ConvolvedFFTPower
algorithm can also automatically generate
and use FKP weights from the input nbar
column if the user specifies
the use_fkp_weights
keyword of the algorithm to be True
. In this case,
the user must also specify the P0_FKP
keyword, which gives the desired
\(P_0\) value to use in equation (5).
The Results¶
The multipole results are stored as the poles
attribute of the initialized ConvolvedFFTPower
object.
This attribute is a BinnedStatistic
object,
which behaves like a structured numpy array and stores
the measured results on a coordinate grid defined by the wavenumber bins
specified by the user. See Analyzing your Results for a full tutorial on using
the BinnedStatistic
class.
The poles
attribute stores the following variables:
- k :
the mean value for each
k
bin
- power_L :
complex array storing the real and imaginary components for the \(\ell=L\) multipole
- modes :
the number of Fourier modes averaged together in each bin
Note that measured power results for bins where modes
is zero (no data points
to average over) are set to NaN
.
Note
The shot noise is not subtracted from any measured results. Users can
access the shot noise value for monopole, computed according to
equation (4) in the meta-data attrs
dictionary. Shot
noise for multipoles with \(\ell>0\) is assumed to be zero.
Several important meta-data calculations are also performed during the
algorithm’s execution. This meta-data is stored in both the
ConvolvedFFTPower.attrs
attribute and the ConvolvedFFTPower.poles.attrs
atrribute.
- data.N, randoms.N :
the unweighted number of data and randoms objects
- data.W, randoms.W :
the weighted number of data and randoms objects, using the column specified as the completeness weights. This is given by:
\[ \begin{align}\begin{aligned}W_\mathrm{data} &= \sum_\mathrm{data} w_c\\W_\mathrm{ran} &= \sum_\mathrm{randoms} w_c\end{aligned}\end{align} \]
- alpha :
the ratio of
data.W
torandoms.W
- data.norm, randoms.norm :
the normalization \(A\) of the power spectrum, computed from either the “data” or “randoms” catalog (they should be similar). They are given by:
(8)¶\[ \begin{align}\begin{aligned}A_\mathrm{data} &= \sum_\mathrm{data} n'_g w_c \wfkp^2\\A_\mathrm{ran} &= \alpha \sum_\mathrm{randoms} n'_g w_c \wfkp^2\end{aligned}\end{align} \]
- data.shotnoise, randoms.shotnoise :
the contributions to the monopole shot noise from the “data” and “random” catalogs. These values are given by:
\[ \begin{align}\begin{aligned}P^\mathrm{shot}_\mathrm{data} &= A_\mathrm{ran}^{-1} \sum_\mathrm{data} (w_c \wfkp)^2\\P^\mathrm{shot}_\mathrm{ran} &= \alpha^2 A_\mathrm{ran}^{-1} \sum_\mathrm{randoms} (w_c \wfkp)^2\end{aligned}\end{align} \]
- shotnoise :
the total shot noise for the monopole power spectrum, which should be subtracted from the monopole measurement in
ConvolvedFFTPower.poles
. This is computed as:\[P^\mathrm{shot} = P^\mathrm{shot}_\mathrm{data} + P^\mathrm{shot}_\mathrm{ran}\]
- BoxSize :
the size of the Cartesian box used to grid the “data” and “randoms” objects on the Cartesian mesh.
The ConvolvedFFTPower.to_pkmu()
allows users to rotate the measured
multipoles, stored as the ConvolvedFFTPower.poles
attribute, into
\(P(k,\mu)\) wedges (bins in \(\mu\)). The function returns a
BinnedStatistic
holding the
binned \(P(k,\mu)\) data.
Saving and Loading¶
Results can easily be saved and loaded from disk in a reproducible manner
using the ConvolvedFFTPower.save()
and ConvolvedFFTPower.load()
functions. The save
function stores the state of
the algorithm, including the meta-data in the ConvolvedFFTPower.attrs
dictionary, in a JSON plaintext format.
Common Pitfalls¶
Some of the more common issues users run into are:
The
Position
column is missing from the input “data” and “randoms” catalogs. The input position columns for this algorithm are assumed to be in terms of the right ascension, declination, and redshifts, and users must add thePosition
column holding the Cartesian coordinates explicitly to both the “data” and “randoms” catalogs.Normalization issues may occur if the number density columns in the “data” and “randoms” catalogs are on different scales. Similar issues may arise if the FKP weight column uses number density values on different scales. In all cases, the number density to be used should be that of the data, denoted as \(n'_g(z)\). The algorithm will compute the power spectrum normalization from both the “data” and “randoms” catalogs (as given by equation (8)). If the values do not agree, there is likely an issue with varying number density scales, and the algorithm will raise an exception.
currentmodule:: nbodykit.algorithms.fftpower
Power Spectrum of a Projected Field (ProjectedFFTPower
)¶
Correlation Function Algorithms¶
currentmodule:: nbodykit.algorithms.fftcorr
Correlation Functions via FFT (FFTCorr
)¶
Pair Counts for Data in a Simulation Box (SimulationBoxPairCount
)¶
Pair Counts for Survey Data (SurveyDataPairCount
)¶
This includes angular clustering as well.
Correlation Function for Data in a Simulation Box (SimulationBox2PCF
)¶
Correlation Function for Survey Data (SurveyData2PCF
)¶
This includes angular clustering as well.
3-pt Correlation Function for Data in a Simulation Box (SimulationBox3PCF
)¶
3-pt Correlation Function for Survey Data (SurveyData3PCF
)¶
Grouping Methods¶
Cylindrical Groups (CylindricalGroups
)¶
Fiber Assignment and Collisions (FiberCollisions
)¶
Miscellaneous¶
Computing the \(n(z)\) of Data (RedshiftHistogram
)¶
Parallel Computation with nbodykit¶
nbodykit is fully parallelized and is built on top of the Message Passage Interface
(MPI) using the Python bindings provided by the mpi4py
library.
In this section, we discuss some of the crucial ways in which nbodykit
interacts with MPI. For those unfamiliar with MPI, a good place to start is the
documentation for mpi4py.
Running nbodykit in parallel¶
Users can execute Python scripts using nbodykit in parallel
using the standardized mpiexec
MPI executable. If installing nbodykit
through Anaconda (as described here), the
mpiexec
command will be available for use in your conda environment.
As an example, we can run
this example script
in parallel. The example script can be downloaded manually from GitHub, or
using the wget
utility:
# install wget via conda, if not installed
$ conda install wget
# download the example script
$ wget https://raw.githubusercontent.com/bccp/nbodykit/master/nersc/example.py
We can execute the example script with 4 MPI cores in parallel using:
# run with 4 MPI workers
$ mpiexec -n 4 python example.py
This example script generates a set of simulated objects and computes the
power spectrum via the FFTPower
algorithm, saving the result to a JSON file nbkit_example_power.json
.
The MPI calling sequence typically differs for supercomputing environments and depends on the task scheduler being used. For example, when using the Slurm manager (as on NERSC), the equivalent command to execute the example script is:
$ srun -n 4 python example.py
On a managed computing facility, it is also usually necessary to include directives that reserves the computing resource used for running the script. Often, you can refer to the ‘Submitting a Job’ section of the user guide of the facility, e.g., NERSC Cori, NERSC Edison, and the NCSA. However, be aware these user guides are not always accurate, and it is always better to check with someone who uses these facilities first.
A Primer on MPI Communication¶
MPI stands for Message Passage Interface, and unsurprisingly, one of its key elements is the communication between processes running in parallel. The MPI standard allows processes running in parallel, which own their own memory, to exchange messages, thus allowing the independent results computed by individual processes to be combined into a single result.
The MPI communicator object is responsible for managing the communication
of data and messages between parallel processes. In nbodykit, we manage the
current MPI communicator using the nbodykit.CurrentMPIComm
class.
By default, all nbodykit objects, i.e., catalog and mesh objects, use
the communicator defined by this class to exchange results. For
catalog and mesh objects, the communicator is stored as the comm
attribute.
Users can access the current communicator by calling the
get()
function of the
CurrentMPIComm
object. For example,
from nbodykit import CurrentMPIComm
comm = CurrentMPIComm.get()
The communicator object carries a rank
attribute, which provides
a unique numbering of the available processes controlled
by the communicator, and a size
attribute giving the total number of
processes within the communicator. Often, the rank
attribute
is used to reduce the amount of messages printed to the terminal, e.g.
if comm.rank == 0:
print("I am Groot.")
In this case, we get only one print statement, whereas we would get as many as
comm.size
messages without the condition. The
tutorials
provided in the mpi4py
documentation provide
more examples illustrating the power of MPI communicators.
For more advanced users, the current communicator can be set using
the set()
function of the
CurrentMPIComm
object. This can be useful if the
default communicator (which includes all processes) is split into
sub-communicators. This framework is how we implement the task-based
parallelism provided by the TaskManager
object
(see the Task-based parallelism section below). Setting the current
MPI communicator in this manner only affects the creation of objects
after the comm has been set.
Data-based parallelism¶
When writing code with nbodykit, it’s important to keep in mind that memory is not shared across different CPUs when using MPI. This is particularly relevant when interacting with data in nbodykit. Both the catalog and mesh objects are distributed containers, meaning that the data is spread out evenly across the available processes within an MPI communicator. A single process does not have access to the full catalog or 3D mesh but merely the portion stored in its memory. A crucial benefit of the distributed nature of data in nbodykit is that we can quickly load large data sets that would otherwise not fit into the memory of a single process.
When working with several processes, a simple way to gather the full data set
onto every process is to combine the numpy.concatenate()
function
with the allgather()
method of the current MPI communicator.
For example, given a catalog object, we can gather the full “Position” column
to all ranks using
data = numpy.concatenate(catalog.comm.allgather(catalog['Position'].compute()), axis=0)
Important
Beware of such allgather()
operations. Each process gets a full copy
of the data, and the computer will quickly run out of memory if the catalog is large.
Task-based parallelism¶
Often, large-scale structure data analysis involves hundreds to thousands
of iterations of a single, less computationally expensive task.
Examples include parameter sampling and minimization and the calculation of
clustering statistics from thousands of simulations to determine
covariance properties. nbodykit includes the
TaskManager
class to allow
users to easily iterate over multiple tasks while using nbodykit.
Users can specify the desired number of MPI ranks per task, and tasks will
run in parallel ensuring that all MPI ranks are being used.
We attempt to hide most of the MPI complexities from the user by
implementing the TaskManager
utility as a Python
context manager. This allows users to simply paste the workflow of a single
task into the context of a TaskManager
to iterate
through tasks in parallel.
For example, we can compute the power spectrum of a simulated catalog of particles with several different bias values using:
from nbodykit.lab import *
# the bias values to iterate over
biases = [1.0, 2.0, 3.0, 4.0]
# initialize the task manager to run the tasks
with TaskManager(cpus_per_task=2, use_all_cpus=True) as tm:
# set up the linear power spectrum
redshift = 0.55
cosmo = cosmology.Planck15
Plin = cosmology.LinearPower(cosmo, redshift, transfer='EisensteinHu')
# iterate through the bias values
for bias in tm.iterate(biases):
# initialize the catalog for this bias
cat = LogNormalCatalog(Plin=Plin, nbar=3e-3, BoxSize=1380., Nmesh=256, bias=bias)
# compute the power spectrum
r = FFTPower(cat, mode="2d")
# and save
r.save("power-" + str(bias) + ".json")
Here, we rely on the iterate()
function to
iterate through our tasks using a for loop. We could have also used the
map()
function to apply a function to each
bias value (it behaves identical to the built-in map()
).
The TaskManager
utility works by first splitting
the available CPUs in to subgroups, where the size of the subgroups is determined
by the cpus_per_task
task argument passed to TaskManager
.
With these subgroups ready to execute the tasks, the manager uses one root process
to distribute the tasks to one of the subgroups of workers as they become available
and stops when all of the tasks have finished. Internally, the
TaskManager
class splits the global MPI communicator
into smaller groups, so that each subgroup of processes can communicate only
with themselves.
Important
Users should execute all of the nbodykit-related code from within
the context of the TaskManager
. In particular,
users should take care to ensure that all of the catalog and mesh objects
are created from within the manager’s context. Otherwise, there will be
issues with the MPI communication and the code is likely to stall.
The above code is available in the nbodykit source code on GitHub
here.
We encourage users to download the script and experiment with different
cpus_per_task
values and total MPI processes. For example, if we
run with 3 total processes and use a single process per task, then two tasks
will always be executed simultaneously (one process is reserved to
distribute the tasks to the other ranks). When running with this configuration,
users will see the following output:
$ mpiexec -n 3 python example-batch.py 1
rank 2: computing for bias = 1.0
rank 1: computing for bias = 2.0
rank 2: computing for bias = 3.0
rank 1: computing for bias = 4.0
Analyzing your Results¶
Several nbodykit algorithms compute binned clustering statistics
and store the results as a BinnedStatistic
object (see a list
of these algorithms here. In this section,
we provide an overview of some of the functionality of this class to help
users better analyze their algorithm results.
The BinnedStatistic
class is designed to hold data variables at
fixed coordinates, i.e., a grid of \((r, \mu)\) or \((k, \mu)\) bins
and is modeled after the syntax of the xarray.Dataset
object.
Loading and Saving Results¶
A BinnedStatistic
object is serialized to disk using a
JSON format. The to_json()
and
from_json()
functions can be used to save and load
BinnedStatistic
objects. respectively.
To start, we read two BinnedStatistic
results from JSON files,
one holding 1D power measurement \(P(k)\) and one holding a 2D power
measurement \(P(k,\mu)\).
[2]:
from nbodykit.binned_statistic import BinnedStatistic
data_dir = os.path.join(os.path.abspath('.'), 'data')
power_1d = BinnedStatistic.from_json(os.path.join(data_dir, 'dataset_1d.json'))
power_2d = BinnedStatistic.from_json(os.path.join(data_dir, 'dataset_2d.json'))
Coordinate Grid¶
The clustering statistics are measured for fixed bins, and the
BinnedStatistic
class has several attributes to access the
coordinate grid defined by these bins:
shape
: the shape of the coordinate griddims
: the names of each dimension of the coordinate gridcoords
: a dictionary that gives the center bin values for each dimension of the gridedges
: a dictionary giving the edges of the bins for each coordinate dimension
[3]:
print("1D shape = ", power_1d.shape)
print("2D shape = ", power_2d.shape)
1D shape = (64,)
2D shape = (64, 5)
[4]:
print("1D dims = ", power_1d.dims)
print("2D dims = ", power_2d.dims)
1D dims = ['k']
2D dims = ['k', 'mu']
[5]:
print("2D edges = ", power_2d.coords)
2D edges = {'k': array([0.00613593, 0.01840777, 0.03067962, 0.04295147, 0.05522331,
0.06749516, 0.079767 , 0.09203884, 0.10431069, 0.11658255,
0.1288544 , 0.14112625, 0.1533981 , 0.1656699 , 0.17794175,
0.1902136 , 0.20248545, 0.2147573 , 0.22702915, 0.239301 ,
0.25157285, 0.2638447 , 0.27611655, 0.2883884 , 0.30066025,
0.3129321 , 0.32520395, 0.3374758 , 0.3497476 , 0.36201945,
0.3742913 , 0.38656315, 0.398835 , 0.41110685, 0.4233787 ,
0.43565055, 0.4479224 , 0.46019425, 0.4724661 , 0.48473795,
0.4970098 , 0.5092816 , 0.52155345, 0.5338253 , 0.54609715,
0.558369 , 0.57064085, 0.5829127 , 0.59518455, 0.6074564 ,
0.61972825, 0.6320001 , 0.64427195, 0.6565438 , 0.6688156 ,
0.68108745, 0.6933593 , 0.70563115, 0.717903 , 0.73017485,
0.7424467 , 0.75471855, 0.7669904 , 0.77926225]), 'mu': array([0.1, 0.3, 0.5, 0.7, 0.9])}
[6]:
print("2D edges = ", power_2d.edges)
2D edges = {'k': array([0. , 0.01227185, 0.02454369, 0.03681554, 0.04908739,
0.06135923, 0.07363108, 0.08590292, 0.09817477, 0.1104466 ,
0.1227185 , 0.1349903 , 0.1472622 , 0.159534 , 0.1718058 ,
0.1840777 , 0.1963495 , 0.2086214 , 0.2208932 , 0.2331651 ,
0.2454369 , 0.2577088 , 0.2699806 , 0.2822525 , 0.2945243 ,
0.3067962 , 0.319068 , 0.3313399 , 0.3436117 , 0.3558835 ,
0.3681554 , 0.3804272 , 0.3926991 , 0.4049709 , 0.4172428 ,
0.4295146 , 0.4417865 , 0.4540583 , 0.4663302 , 0.478602 ,
0.4908739 , 0.5031457 , 0.5154175 , 0.5276894 , 0.5399612 ,
0.5522331 , 0.5645049 , 0.5767768 , 0.5890486 , 0.6013205 ,
0.6135923 , 0.6258642 , 0.638136 , 0.6504079 , 0.6626797 ,
0.6749515 , 0.6872234 , 0.6994952 , 0.7117671 , 0.7240389 ,
0.7363108 , 0.7485826 , 0.7608545 , 0.7731263 , 0.7853982 ]), 'mu': array([0. , 0.2, 0.4, 0.6, 0.8, 1. ])}
Accessing the Data¶
The names of data variables stored in a BinnedStatistic
are stored in
the variables
attribute, and the data
attribute stores the
arrays for each of these names in a structured array. The data for a given
variable can be accessed in a dict-like fashion:
[7]:
print("1D variables = ", power_1d.variables)
print("2D variables = ", power_2d.variables)
1D variables = ['power', 'k', 'mu', 'modes']
2D variables = ['power', 'k', 'mu', 'modes']
[8]:
# the real component of the 1D power
Pk = power_1d['power'].real
print(type(Pk), Pk.shape, Pk.dtype)
# complex power array
Pkmu = power_2d['power']
print(type(Pkmu), Pkmu.shape, Pkmu.dtype)
<class 'numpy.ndarray'> (64,) float64
<class 'numpy.ndarray'> (64, 5) complex128
In some cases, the variable value for a given bin will be missing or invalid,
which is indicated by a numpy.nan
value in the data
array for
the given bin. The BinnedStatistic
class carries a mask
attribute that defines which elements of the data array have
numpy.nan
values.
Meta-data¶
An OrderedDict
of meta-data for a
BinnedStatistic
class is stored in the attrs
attribute.
Typically in nbodykit, the attrs
dictionary stores
information about box size, number of objects, etc:
[9]:
print("attrs = ", power_2d.attrs)
attrs = {'N1': 4033, 'Lx': 512.0, 'Lz': 512.0, 'N2': 4033, 'Ly': 512.0, 'volume': 134217728.0}
To attach additional meta-data to a BinnedStatistic
class, the user
can add additional keywords to the attrs
dictionary.
Slicing¶
Slices of the coordinate grid of a BinnedStatistic
can be achieved
using array-like indexing of the main BinnedStatistic
object, which
will return a new BinnedStatistic
holding the sliced data:
[10]:
# select the first mu bin
print(power_2d[:,0])
<BinnedStatistic: dims: (k: 64), variables: ('power', 'k', 'mu', 'modes')>
[11]:
# select the first and last mu bins
print(power_2d[:, [0, -1]])
<BinnedStatistic: dims: (k: 64, mu: 2), variables: ('power', 'k', 'mu', 'modes')>
[12]:
# select the first 5 k bins
print(power_1d[:5])
<BinnedStatistic: dims: (k: 5), variables: ('power', 'k', 'mu', 'modes')>
A typical usage of array-like indexing is to loop over the mu
dimension
of a 2D BinnedStatistic
, such as when plotting:
[13]:
from matplotlib import pyplot as plt
# the shot noise is volume / number of objects
shot_noise = power_2d.attrs['volume'] / power_2d.attrs['N1']
# plot each mu bin separately
for i in range(power_2d.shape[1]):
pk = power_2d[:,i]
label = r"$\mu = %.1f$" % power_2d.coords['mu'][i]
plt.loglog(pk['k'], pk['power'].real - shot_noise, label=label)
plt.legend()
plt.xlabel(r"$k$ [$h$/Mpc]", fontsize=14)
plt.ylabel(r"$P(k,\mu)$ $[\mathrm{Mpc}/h]^3$", fontsize=14)
plt.show()

The coordinate grid can also be sliced using label-based indexing, similar to
the syntax of xarray.Dataset.sel()
. The method
keyword of
sel()
determines if exact
coordinate matching is required (method=None
, the default) or if the
nearest grid coordinate should be selected automatically (method='nearest'
).
For example, we can slice power spectrum results based on the
k
and mu
coordinate values:
[14]:
# get all mu bins for the k bin closest to k=0.1
print(power_2d.sel(k=0.1, method='nearest'))
<BinnedStatistic: dims: (mu: 5), variables: ('power', 'k', 'mu', 'modes')>
[15]:
# slice from k=0.01-0.1 for mu = 0.5
print(power_2d.sel(k=slice(0.01, 0.1), mu=0.5, method='nearest'))
<BinnedStatistic: dims: (k: 8), variables: ('power', 'k', 'mu', 'modes')>
We also provide a squeeze()
function which behaves
similar to the numpy.squeeze()
function:
[16]:
# get all mu bins for the k bin closest to k=0.1, but keep k dimension
sliced = power_2d.sel(k=[0.1], method='nearest')
print(sliced)
<BinnedStatistic: dims: (k: 1, mu: 5), variables: ('power', 'k', 'mu', 'modes')>
[17]:
# and then squeeze to remove the k dimension
print(sliced.squeeze())
<BinnedStatistic: dims: (mu: 5), variables: ('power', 'k', 'mu', 'modes')>
Note that, by default, array-based or label-based indexing will automatically “squeeze” sliced objects that have a dimension of length one, unless a list of indexers is used, as is done above.
Reindexing¶
It is possible to reindex a specific dimension of the coordinate grid using
reindex()
. The new bin spacing must be an integral
multiple of the original spacing, and the variable values will be averaged together
on the new coordinate grid.
[18]:
# re-index into wider k bins
print(power_2d.reindex('k', 0.02))
<BinnedStatistic: dims: (k: 32, mu: 5), variables: ('power', 'k', 'mu', 'modes')>
/home/yfeng1/anaconda3/install/lib/python3.6/site-packages/nbodykit/binned_statistic.py:56: RuntimeWarning: Mean of empty slice
ndarray = operation(ndarray, axis=-1*(i+1))
[19]:
# re-index into wider mu bins
print(power_2d.reindex('mu', 0.4))
<BinnedStatistic: dims: (k: 64, mu: 2), variables: ('power', 'k', 'mu', 'modes')>
/home/yfeng1/anaconda3/install/lib/python3.6/site-packages/nbodykit/binned_statistic.py:56: RuntimeWarning: Mean of empty slice
ndarray = operation(ndarray, axis=-1*(i+1))
Note
Any variable names passed to reindex()
via the
fields_to_sum
keyword will have their values summed, instead of averaged,
when reindexing.
Averaging¶
The average of a specific dimension can be taken using
the average()
function. A common use case is averaging
over the mu
dimension of a 2D BinnedStatistic
, which is accomplished
by:
[20]:
# compute P(k) from P(k,mu)
print(power_2d.average('mu'))
<BinnedStatistic: dims: (k: 64), variables: ('power', 'k', 'mu', 'modes')>
/home/yfeng1/anaconda3/install/lib/python3.6/site-packages/nbodykit/binned_statistic.py:56: RuntimeWarning: Mean of empty slice
ndarray = operation(ndarray, axis=-1*(i+1))
Saving your Results¶
API Reference¶
The full list of nbodykit modules is available here. We summarize the most important aspects of the API below.
The nbodykit lab¶
To make things easier for users, we import all of the classes and modules needed to do cool science into a single module:
The nbodykit lab, containing all of the necessary ingredients to use nbodykit. |
Cosmology (nbodykit.cosmology
)¶
The main cosmology object relies on the functionality of the classylss
package, which provides a binding of the CLASS CMB Boltzmann code.
The syntax largely follows that used by CLASS. Below, we list the main cosmology class,
as well as its attributes and methods:
|
A cosmology calculator based on the CLASS binding in |
Attributes
Methods
There are several transfer functions available to the user:
|
The linear matter transfer function using the CLASS Boltzmann code. |
|
The linear matter transfer function using the Eisenstein & Hu (1998) fitting formula with BAO wiggles. |
|
Linear power spectrum using the Eisenstein & Hu (1998) fitting formula without BAO wiggles. |
There are several power spectrum classes
|
An object to compute the linear power spectrum and related quantities, using a transfer function from the CLASS code or the analytic Eisenstein & Hu approximation. |
|
Nonlinear power spectrum computed using HaloFit via CLASS. |
|
The matter power spectrum in the Zel’dovich approximation. |
And a correlation function class and functions for transforming between power spectra and correlation functions
|
Evaluate the correlation function by Fourier transforming a power spectrum object, with automatic re-scaling with redshift and sigma8. |
|
Return a callable function returning the power spectrum multipole of degree \(\ell\), as computed from the Fourier transform of the input \(r\) and \(\xi_\ell(r)\) arrays. |
|
Return a callable function returning the correlation function multipole of degree \(\ell\), as computed from the Fourier transform of the input \(k\) and \(P_\ell(k)\) arrays. |
We also have a class for computing LPT background calculations:
|
The built-in cosmologies are:
Name |
Source |
\(H_0\) |
\(\Omega_{m,0}\) |
Flat |
---|---|---|---|---|
Komatsu et al. 2009 |
70.2 |
0.277 |
Yes |
|
Komatsu et al. 2011 |
70.4 |
0.272 |
Yes |
|
Hinshaw et al. 2013 |
69.3 |
0.287 |
Yes |
|
Planck Collab 2013, Paper XVI |
67.8 |
0.307 |
Yes |
|
Planck Collab 2015, Paper XIII |
67.7 |
0.307 |
Yes |
Transforming Catalog Data (nbodykit.transform
)¶
|
Concatenate CatalogSource objects together, optionally including only certain columns in the returned source. |
|
Stack the input dask arrays vertically, column by column. |
|
Return a dask array of the specified |
|
Convert sky coordinates ( |
|
Convert sky coordinates ( |
|
Return halo concentration from halo mass, based on the analytic fitting formulas presented in Dutton and Maccio 2014. |
|
Return proper halo radius from halo mass, based on the specified mass definition. |
Data Sources¶
Discrete Objects¶
Base class:
|
An abstract base class representing a catalog of discrete particles. |
And subclasses:
|
A CatalogSource that uses |
|
A CatalogSource that uses |
|
A CatalogSource that uses |
|
A CatalogSource that uses |
|
A CatalogSource that uses |
|
A CatalogSource that uses |
|
A CatalogSource that uses |
|
A CatalogSource initialized from an in-memory |
|
A CatalogSource of objects that represent halos, which can be populated using analytic models from |
|
A CatalogSource containing biased particles that have been Poisson-sampled from a log-normal density field. |
|
A CatalogSource that has uniformly-distributed |
|
A CatalogSource that can have columns added via a collective random number generator. |
|
An interface for simultaneous modeling of a |
|
A CatalogSource interface for handling multiples species of particles. |
|
Create a demo catalog of halos using one of the built-in |
Interpolating Objects to a Mesh¶
For simple object catalogs:
|
A mesh generated by resampling a Catalog with the given parameters. |
And Multiple Species
|
A subclass of |
And Survey Catalogs with a window / mask:
|
A subclass of |
Data Directly on a Mesh¶
Base class:
|
Base class for a source in the form of data on an input grid. |
And subclasses:
|
A MeshSource object that reads a mesh from disk using |
|
A MeshSource object that generates a |
|
A MeshSource initialized from an in-memory Field object, either a |
|
A MeshSource initalized from an in-memory numpy array. |
Algorithms (nbodykit.algorithms
)¶
Clustering Statistics¶
|
Algorithm to compute the 1d or 2d power spectrum and/or multipoles in a periodic box, using a Fast Fourier Transform (FFT). |
|
The power spectrum of a field in a periodic box, projected over certain axes. |
|
Algorithm to compute power spectrum multipoles using FFTs for a data survey with non-trivial geometry. |
|
Algorithm to compute the 1d or 2d correlation and/or multipoles in a periodic box, using a Fast Fourier Transform (FFT). |
|
Count (weighted) pairs of objects in a simulation box as a function of \(r\), \((r,\mu)\), \((r_p, \pi)\), or \(\theta\) using the |
|
Count (weighted) pairs of objects from a survey data catalog as a function of \(r\), \((r,\mu)\), \((r_p, \pi)\), or \(\theta\) using the |
|
Compute the two-point correlation function for data in a simulation box as a function of \(r\), \((r,\mu)\), \((r_p, \pi)\), or \(\theta\) using pair counting. |
|
Compute the two-point correlation function for observational survey data as a function of \(r\), \((r,\mu)\), \((r_p, \pi)\), or \(\theta\) using pair counting. |
|
Compute the multipoles of the isotropic, three-point correlation function in configuration space for data in a simulation box. |
|
Compute the multipoles of the isotropic, three-point correlation function in configuration space for observational survey data. |
Grouping Methods¶
|
A friends-of-friends halo finder that computes the label for each particle, denoting which halo it belongs to. |
|
Compute groups of objects using a cylindrical grouping method. |
|
Run an angular FOF algorithm to determine fiber collision groups from an input catalog, and then assign fibers such that the maximum amount of object receive a fiber. |
Miscellaneous¶
|
Estimate a proxy density based on the distance to the nearest neighbor. |
|
Compute the mean number density as a function of redshift \(n(z)\) from an input CatalogSource of particles. |
Managing Multiple Tasks (TaskManager
)¶
|
An MPI task manager that distributes tasks over a set of MPI processes, using a specified number of independent workers to compute each task. |
|
A generator that iterates through a series of tasks in parallel. |
|
Like the built-in |
Analyzing Results (BinnedStatistic
)¶
|
Lightweight class to hold statistics binned at fixed coordinates. |
|
Initialize a BinnedStatistic from a JSON file. |
|
Write a BinnedStatistic from a JSON file. |
|
Returns a copy of the BinnedStatistic, optionally change the type to cls. |
|
Rename a variable in |
|
Compute the average of each variable over the specified dimension. |
|
Reindex the dimension |
|
Return a new BinnedStatistic indexed by coordinate values along the specified dimension(s). |
|
Squeeze the BinnedStatistic along the specified dimension, which removes that dimension from the BinnedStatistic. |
The IO Library (nbodykit.io
)¶
Base class:
An abstract base class representing a file object. |
Subclasses available from the nbodykit.io
module:
|
A file object to handle the reading of columns of data from a |
|
A file object to handle the reading of columns of data from a binary file. |
|
A file object to handle the reading of columns of data from a CSV file. |
|
A file object to handle the reading of FITS data using the |
|
A file object to handle the reading of columns of data from a |
|
A file object that offers a continuous view of a stack of subclasses of |
|
Read snapshot binary files from Martin White’s TPM simulations. |
|
Read snapshot binary files from Volkers Gadget 1/2/3 simulations. |
Internal Nuts and Bolts¶
MPI Utilities¶
A class to faciliate getting and setting the current MPI communicator. |
|
Decorator to attach the current MPI communicator to the input keyword arguments of |
|
Get the default current MPI communicator. |
|
Set the current MPI communicator to the input value. |
|
|
Gather the input data array from all ranks to the specified |
|
Scatter the input data array across all ranks, assuming data is initially only on root (and None on other ranks). |
General Utilities¶
|
Turn on logging, with the specified level. |
|
Set global configuration options. |
|
A subclass of |
|
A subclass of |
Generating Mock Data¶
Make a Gaussian realization of a overdensity field, \(\delta(x)\). |
|
Make a Gaussian realization of a overdensity field in real-space \(\delta(x)\). |
|
Poisson sample the linear delta and displacement fields to points. |
Contact and Support¶
Users can submit questions, feedback, bug reports, and feature requests by opening up an issue on the GitHub Issue Tracker.
If submitting a bug report or feature request, be sure to read our contributing guidelines.
Contributing Guidelines¶
We welcome user contributions to nbodykit, but be sure to read this guide first to make the process as smooth as possible!
Requesting a Feature¶
You can use GitHub issues to request features you would like to see in nbodykit.
Provide a clear and detailed explanation of the feature you want to add and its use case in large-scale structure data analysis. Keep in mind that the goal is to include features that are useful for a wide set of users.
If you are able, start writing some code and attempt a pull request. Be sure to read the submission guidelines. There are often several features competing for time commitments, and user pull requests are greatly appreciated!
Bug Reporting¶
If you think you’ve found a bug in nbodykit, follow these steps to submit a report:
First, double check that your bug isn’t already fixed. Follow our instructions for setting up a local development environment and make sure you’ve installed the tip of the master branch.
Search for similar issues on GitHub. Make sure to delete is:open on the issue search to find solved tickets as well. It is possible that this bug has been encountered before.
Please include the versions of Python, nbodykit, and other dependency libraries if they are applicable (numpy, scipy, etc)
Provide us with the logging output of the script from when the bug was encountered, including a traceback if appropriate. If at all possible, provide us with a standalone script that will reproduce the issue. Issues have a much higher chance of being resolved quickly if we can easily reproduce the bug.
Take a stab at fixing the bug yourself! The
runtests
module used by nbodykit supports on-line debugging via the PDB interface. It also supports drop-in replacements for PDB such as PDB++. A common debugging route is to add a regression unit test to the nbodykit test suite that fails due the bug and then run the test in debugging mode:$ python run-tests.py nbodykit/path/to/your/test --pdb
Pull requests for bug fixes are always welcome!
We strongly recommend following the above steps and providing as much information as possible when bugs are encountered. This will help us resolve issues faster – and get everyone back to doing more accurate science!
Setting up for Local Development¶
Fork nbodykit on GitHub:
$ git clone https://github.com/bccp/nbodykit.git $ cd nbodykit
Install the dependencies. We recommend using the Anaconda environment manager. To create a new environment for nbodykit development, use:
# create and activate a new conda environment $ conda create -n nbodykit-dev python=3 $ source activate nbodykit-dev # install requirements $ conda install -c bccp --file requirements.txt --file requirements-extras.txt
Install nbodykit in develop mode using
pip
:# install in develop mode $ pip install -e .
Opening a Pull Request¶
Write the code implementing your bug fix or feature, making sure to use detailed commit messages.
Ensure that any new code is properly documented, with docstrings following the NumPy/Scipy documentation style guide.
Write tests of the new code, ensuring that it has full unit test coverage. This is a crucial step as new pull requests will not be merged if they reduce the overall test coverage of nbodykit.
Run the test suite locally. From the main nbodykit directory, run:
$ python run-tests.py --with-coverage --html-cov
This will also output the test coverage statistics to
build/coverage/index.html
.Make sure all of the tests have passed and that the coverage statistics indicate that any new code is fully covered by the test suite.
Be sure to update the changelog to indicate what was added/modified.
Submit your pull request to
nbodykit:master
. The Travis CI build must be passing before your pull request can be merged. Additionally, the overall coverage of the test suite must not decrease for the pull request to be merged.
Contributing to the Cookbook¶
If you have an application of nbodykit that is concise and interesting, please consider adding it to our cookbook of recipes. We also welcome feedback and improvements for these recipes. Users can submit issues or open a pull request on the nbodykit cookbook repo on GitHub.
Cookbook recipes should be in the form of Jupyter notebooks. See the existing recipes for examples. The recipes are designed to illustrate interesting uses of nbodykit for other users to learn from.
We appreciate any and all contributions!
Changelog¶
0.3.10 (2019-02-07)¶
#557: Use dask’s gufunc for sky transforms
#558: Note the unit of HaloRadius (proper, not comoving. bite me)
#561: Allow setting the position column in 2PCF.
#563: RedshiftHistogram extrapolates to zero rather than any number.
#564: Fix missing compensations with Multi-species meshes.
#565: add keyward header to HDFCatalog for reading additional meta data
#566: fix a bug sorting on float32, and add a persist method to Catalog.
#567: suppress redundant output in convpower.
0.3.9 (2019-01-07)¶
0.3.8 (2018-12-29)¶
#543: Further performance improvements on catalog slicing.
#542: The IO module shall make sure buffer is c-contiguous before reshaping
#541: Allow setting cartesian / sphericial transformation reference frame
#540: Allow not saving the header in Catalog.save
#539: Allow non-uniform redshifts in halo property transformations.
#538: Stop gathering catalog to a single rank in HaloCatalog
#537: Use numpy.sum for summing of integers.
#536: Fix boxsize mismatch comparision in pair counters.
#535: Improve working with a dask cluster.
#532: Improve speed of slicing of a catalog.
#531: Additional throttling during painting.
#530: Use setuptools (need to change conda-build-bccp recipe)
#529: Add kmax(rmax) to FFTPower, FFTCorr, ConvPower.
#528: Add dataset= to Catalog.save, deprecate datasets=[]
0.3.7 (2018-10-17)¶
0.3.5 (2018-08-23)¶
0.3.4 (2018-06-29)¶
#495: Improve scaling of LogNormal catalog
#497: take method to BinnedStatistic
#498: add compute method to Catalog interface; CatalogMesh no longer a Catalog
#500: unique binning in FFTPower and FFTCorr
#503: redistributing a catalog spatially
#504: Catalog.copy hangs
#505: update docrep to 0.2.3
#506: compatible with dask 0.18.1.
0.3.3 (2018-05-30)¶
0.3.2 (2018-05-14)¶
#475: proper normalization of the Landy-Szalay estimator, included R1R2 option and to_xil function
#487: Linear theory correspondant of nbody simulation. (three fluid model)
#486: overdecomposition in FOF
#483: switching to a new type in BinnedStatistics.copy()
#482: Fix a crash when two datasets passed into corrfunc are of different dtypes.
#480: BigFileCatalog shall look for header relative to the root of file.
#479: GatherArray allows root=Ellipsis (for allbather)
#476: Fix MeshSource.apply if MeshSource.action is overriden
#471: Decompose of surveydata to the correct bounds.
0.3.1 (2018-04-10)¶
0.3.0 (2017-12-18)¶
#439: added updated pair counter algorithms, SurveyDataPairCount and SimulationBoxPairCount.
#439: added correlation function algorithms, SurveyData2PCF and SimulationBox2PCF
#441: add a DemoHaloCatalog for tutorials that downloads small halo catalogs using Halotools
#441: add hod module with wrapper classes for Halotools models and create HOD catalog by calling the populate() method of a HaloCatalog
#445: add a global cache with fixed size for dask calculations
#446: fixes future warning generated by pandas
#447: adds PCS sampling windows
0.2.9 (2017-11-15)¶
#442: bug fix: fixes MemoryError when data is larger than memory in paint(); adds paint_chunk_size default option
#440: Selection, Value, Weight specified as “default” columns; default columns are not saved to disk
#437: bug fix: make sure to copy attributes of catalog when copy() is called
#436: FFT-based correlation function algorithm, FFTCorr addded
#435: binder badge added to README and documentation for cookbook recipes
#433: by default, the header file will be found automatically in Bigfile
#430: fix bug in FOF due to stricter numpy casting rules in numpy 1.13.3
#428: fixes bug in painting normalization when using interlacing is used
#422: proper list of attributes/methods added to documentation of Cosmology class
#425: latex requirement removed from
notebook.mplstyle
style file#423: support for Gadget 1 file format
0.2.8 (2017-10-06)¶
#398: AngularPairCount algorithm added to compute pair counts for survey data as a function of angular separation
#364: fix load balancing for survey pair counting algorithms
#415: fix sympy pickling issue
#409: fix periodic boundary condition issues with FOF for low number of ranks
#420: fix bug introduced in 0.2.7 causing selection of CatalogSources to sometimes hang
#420: remove dask selection optimizations, which can cause the code to crash in uncontrollable ways
#421: better error messaging when using deprecated __init__ syntax for Cosmology class
#406: add global sort and slice operations to CatalogSource objects
0.2.7 (2017-09-25)¶
#384: fix packaging bug causing
notebook.mplstyle
to be missing from the conda buildrename test driver from
runtests.py
torun-tests.py
set_options context manager add to set global configuration variables
#392, #403: add optimized slicing via dask when applying a boolean selection index to a CatalogSource
#393: CatalogMesh is implemented as a view of a CatalogSource – column set/gets operate on the underlying CatalogSource
ConvolvedFFTPower supports cross-correlations of 2 mesh objects originating from the same data/randoms, allowing users to apply different weighting schemes to the two meshes
transform.SkyToCartesion deprecated in favor of transform.SkyToCartesian
#386: bug fixes related to behavior of Cosmology.clone
0.2.6 (2017-08-29)¶
#379: updated Cosmology class built on classylss, a Python binding of the CLASS Boltzmann code
#379: LinearPower object added with CLASS or Eisenstein-Hu transfer
#379: ZeldovichPower object added to compute Zel’dovich power spectrum
#379:HalofitPower object added to compute nonlinear power
#379: CorrelationFunction object added to FFT power spectra to compute theoretical correlation functions
#379: EHPower and NoWiggleEHPower deprecated in favor of LinearPower object
0.2.5 (2017-08-25)¶
#359: CSVFile and CSVCatalog no longer fail to read the last line of data when the file does not end in a newline
#361: add CylindricalGroups algorithm for computing groups of objects using the cylindrical grouping method of arXiv:1611.04165
#355: SimulationBoxPairCount and SurveyDataPairCount classes added to perform pair counting of objects in either simulation boxes or from survey data catalogs (using
Corrfunc
code)#370: large addition of documentation for version 0.2.x; still partially completed
DataSet has been renamed to BinnedStatistic
calculation of
dk
fixed in ProjectedFFTPowerpaint() supports a Nmesh parameter, for easier re-sampling of mesh objects
#368: addition of
Value
column for painting mesh objects; this represents the value of the field painted, i.e., unity to paint density, or velocity to paint momentum (number-weighted velocity)addition of style module with matplotlib style sheet to make nice plots in our doc tutorials; this makes the docs reproducible by users
transform.vstack deprecated in favor of transform.StackColumns
transform.concatenate deprecated in favor of transform.ConcatenateSources
when painting catalogs to a mesh, users can specify the position column to use via the
position
keyword#142: MultipleSpeciesCatalog object added to support painting multiple species of particles to the same mesh, i.e, baryons and dark matter particles in hydro simulations
CatalogMeshSource renamed to CatalogMesh internally
can now delete a column from a CatalogSource
can now slice a CatalogSource using a list of column names
#373: fix bug in ConstantArray when length is 1
0.2.4 (2017-06-18)¶
#339: transform.StackColumns renamed to
vstack
#339: transform.concatenate function added, which takes a list of source objects, and returns a new Source that has the concatenation of all data
#345: fix compatibility with halotools version 0.5
#346: ability to resample a MemoryMesh object
#344: bug fixes related to calculation of growth rate in cosmology module
#347: ArrayCatalog can now be initialized from a dictionary or structured array
#348: add a ProjectedFFTPower algorithm, that computes the FFT Power, but can project over certain axes, i.e., projected axes have their power averaged over
#353: FITSCatalog added to the io module, for reading FITS files
#352: KDDensity to quickly estimate local density in density region.
#352: FOF also identifies Peak position and velocity.
0.2.3 (2017-05-19)¶
use of
resampler
keyword in thepaint
function for compatibility with pmesh versions >= 0.1.24bug fixes and code cleanup
0.2.2 (2017-04-27)¶
package maintenance updates only
0.2.1 (2017-04-26)¶
base dependencies + extras (halotools, h5py); install all dependencies via pip nbodykit[extras]
meta-data calculations in FKPCatalog now account for Source selection properly
support for numpy int/float meta-data in JSON output files
Cosmology instances no longer return attributes as Quantity instances, assuming a default set of units
renaming of various classes/module related to the nbodykit.Source syntax
no more nbodykit.Source in nbodykit.lab
nbodykit.source.particle has been renamed to nbodykit.source.catalog
source objects are now catalogs – there class names have “Catalog” appended to their names
added individual catalogs for different file types in nbodykit.io, i.e., CSVCatalog, HDFCatalog, etc
the
.apply
operation is no longer in place for sources; it returns a view with the list of actions extendedgalaxy type (central vs satellite) stored as integers in HODCatalog