Analyzing your Results¶

Several nbodykit algorithms compute binned clustering statistics and store the results as a BinnedStatistic object (see a list of these algorithms here. In this section, we provide an overview of some of the functionality of this class to help users better analyze their algorithm results.

The BinnedStatistic class is designed to hold data variables at fixed coordinates, i.e., a grid of \((r, \mu)\) or \((k, \mu)\) bins and is modeled after the syntax of the xarray.Dataset object.

Loading and Saving Results¶

A BinnedStatistic object is serialized to disk using a JSON format. The to_json() and from_json() functions can be used to save and load BinnedStatistic objects. respectively.

To start, we read two BinnedStatistic results from JSON files, one holding 1D power measurement \(P(k)\) and one holding a 2D power measurement \(P(k,\mu)\).

In [1]: from nbodykit.binned_statistic import BinnedStatistic

In [2]: power_1d = BinnedStatistic.from_json(os.path.join(data_dir, 'dataset_1d.json'))

In [3]: power_2d = BinnedStatistic.from_json(os.path.join(data_dir, 'dataset_2d.json'))

Coordinate Grid¶

The clustering statistics are measured for fixed bins, and the BinnedStatistic class has several attributes to access the coordinate grid defined by these bins:

shape: the shape of the coordinate grid
dims: the names of each dimension of the coordinate grid
coords: a dictionary that gives the center bin values for each dimension of the grid
edges: a dictionary giving the edges of the bins for each coordinate dimension

In [4]: print(power_1d.shape, power_2d.shape)
(64,) (64, 5)

In [5]: print(power_1d.dims, power_2d.dims)
['k'] ['k', 'mu']

In [6]: power_2d.coords
Out[6]: 
{'k': array([ 0.00613593,  0.01840777,  0.03067962,  0.04295147,  0.05522331,
         0.06749516,  0.079767  ,  0.09203884,  0.10431069,  0.11658255,
         0.1288544 ,  0.14112625,  0.1533981 ,  0.1656699 ,  0.17794175,
         0.1902136 ,  0.20248545,  0.2147573 ,  0.22702915,  0.239301  ,
         0.25157285,  0.2638447 ,  0.27611655,  0.2883884 ,  0.30066025,
         0.3129321 ,  0.32520395,  0.3374758 ,  0.3497476 ,  0.36201945,
         0.3742913 ,  0.38656315,  0.398835  ,  0.41110685,  0.4233787 ,
         0.43565055,  0.4479224 ,  0.46019425,  0.4724661 ,  0.48473795,
         0.4970098 ,  0.5092816 ,  0.52155345,  0.5338253 ,  0.54609715,
         0.558369  ,  0.57064085,  0.5829127 ,  0.59518455,  0.6074564 ,
         0.61972825,  0.6320001 ,  0.64427195,  0.6565438 ,  0.6688156 ,
         0.68108745,  0.6933593 ,  0.70563115,  0.717903  ,  0.73017485,
         0.7424467 ,  0.75471855,  0.7669904 ,  0.77926225]),
 'mu': array([ 0.1,  0.3,  0.5,  0.7,  0.9])}

In [7]: power_2d.edges
Out[7]: 
{'k': array([ 0.        ,  0.01227185,  0.02454369,  0.03681554,  0.04908739,
         0.06135923,  0.07363108,  0.08590292,  0.09817477,  0.1104466 ,
         0.1227185 ,  0.1349903 ,  0.1472622 ,  0.159534  ,  0.1718058 ,
         0.1840777 ,  0.1963495 ,  0.2086214 ,  0.2208932 ,  0.2331651 ,
         0.2454369 ,  0.2577088 ,  0.2699806 ,  0.2822525 ,  0.2945243 ,
         0.3067962 ,  0.319068  ,  0.3313399 ,  0.3436117 ,  0.3558835 ,
         0.3681554 ,  0.3804272 ,  0.3926991 ,  0.4049709 ,  0.4172428 ,
         0.4295146 ,  0.4417865 ,  0.4540583 ,  0.4663302 ,  0.478602  ,
         0.4908739 ,  0.5031457 ,  0.5154175 ,  0.5276894 ,  0.5399612 ,
         0.5522331 ,  0.5645049 ,  0.5767768 ,  0.5890486 ,  0.6013205 ,
         0.6135923 ,  0.6258642 ,  0.638136  ,  0.6504079 ,  0.6626797 ,
         0.6749515 ,  0.6872234 ,  0.6994952 ,  0.7117671 ,  0.7240389 ,
         0.7363108 ,  0.7485826 ,  0.7608545 ,  0.7731263 ,  0.7853982 ]),
 'mu': array([ 0. ,  0.2,  0.4,  0.6,  0.8,  1. ])}

Accessing the Data¶

The names of data variables stored in a BinnedStatistic are stored in the variables attribute, and the data attribute stores the arrays for each of these names in a structured array. The data for a given variable can be accessed in a dict-like fashion:

In [8]: power_1d.variables
Out[8]: ['power', 'k', 'mu', 'modes']

In [9]: power_2d.variables
Out[9]: ['power', 'k', 'mu', 'modes']

# the real component of the 1D power
In [10]: Pk = power_1d['power'].real

In [11]: print(type(Pk), Pk.shape, Pk.dtype)
<class 'numpy.ndarray'> (64,) float64

# complex power array
In [12]: Pkmu = power_2d['power']

In [13]: print(type(Pkmu), Pkmu.shape, Pkmu.dtype)
<class 'numpy.ndarray'> (64, 5) complex128

In some cases, the variable value for a given bin will be missing or invalid, which is indicated by a numpy.nan value in the data array for the given bin. The BinnedStatistic class carries a mask attribute that defines which elements of the data array have numpy.nan values.

Meta-data¶

An OrderedDict of meta-data for a BinnedStatistic class is stored in the attrs attribute. Typically in nbodykit, the attrs dictionary stores information about box size, number of objects, etc:

In [14]: power_2d.attrs
Out[14]: 
{'Lx': 512.0,
 'Ly': 512.0,
 'Lz': 512.0,
 'N1': 4033,
 'N2': 4033,
 'volume': 134217728.0}

To attach additional meta-data to a BinnedStatistic class, the user can add additional keywords to the attrs dictionary.

Slicing¶

Slices of the coordinate grid of a BinnedStatistic can be achieved using array-like indexing of the main BinnedStatistic object, which will return a new BinnedStatistic holding the sliced data:

# select the first mu bin
In [15]: power_2d[:,0]
Out[15]: <BinnedStatistic: dims: (k: 64), variables: ('power', 'k', 'mu', 'modes')>

# select the first and last mu bins
In [16]: power_2d[:, [0, -1]]
Out[16]: <BinnedStatistic: dims: (k: 64, mu: 2), variables: ('power', 'k', 'mu', 'modes')>

# select the first 5 k bins
In [17]: power_1d[:5]
Out[17]: <BinnedStatistic: dims: (k: 5), variables: ('power', 'k', 'mu', 'modes')>

A typical usage of array-like indexing is to loop over the mu dimension of a 2D BinnedStatistic, such as when plotting:

In [18]: from matplotlib import pyplot as plt

# the shot noise is volume / number of objects
In [19]: shot_noise = power_2d.attrs['volume'] / power_2d.attrs['N1']

# plot each mu bin separately
In [20]: for i in range(power_2d.shape[1]):
   ....:     pk = power_2d[:,i]
   ....:     label = r"$\mu = %.1f$" % power_2d.coords['mu'][i]
   ....:     plt.loglog(pk['k'], pk['power'].real - shot_noise, label=label)
   ....: 

In [21]: plt.legend()
Out[21]: <matplotlib.legend.Legend at 0x7f2fb98e1860>

In [22]: plt.xlabel(r"$k$ [$h$/Mpc]", fontsize=14)
Out[22]: <matplotlib.text.Text at 0x7f2fb98e6e80>

In [23]: plt.ylabel(r"$P(k,\mu)$ $[\mathrm{Mpc}/h]^3$", fontsize=14)
Out[23]: <matplotlib.text.Text at 0x7f2fb9900d68>

In [24]: plt.show()

The coordinate grid can also be sliced using label-based indexing, similar to the syntax of xarray.Dataset.sel(). The method keyword of sel() determines if exact coordinate matching is required (method=None, the default) or if the nearest grid coordinate should be selected automatically (method='nearest').

For example, we can slice power spectrum results based on the k and mu coordinate values:

# get all mu bins for the k bin closest to k=0.1
In [25]: power_2d.sel(k=0.1, method='nearest')
Out[25]: <BinnedStatistic: dims: (mu: 5), variables: ('power', 'k', 'mu', 'modes')>

# slice from k=0.01-0.1 for mu = 0.5
In [26]: power_2d.sel(k=slice(0.01, 0.1), mu=0.5, method='nearest')
Out[26]: <BinnedStatistic: dims: (k: 8), variables: ('power', 'k', 'mu', 'modes')>

We also provide a squeeze() function which behaves similar to the numpy.squeeze() function:

# get all mu bins for the k bin closest to k=0.1, but keep k dimension
In [27]: sliced = power_2d.sel(k=[0.1], method='nearest')

In [28]: sliced
Out[28]: <BinnedStatistic: dims: (k: 1, mu: 5), variables: ('power', 'k', 'mu', 'modes')>

# and then squeeze to remove the k dimension
In [29]: sliced.squeeze()
Out[29]: <BinnedStatistic: dims: (mu: 5), variables: ('power', 'k', 'mu', 'modes')>

Note that, by default, array-based or label-based indexing will automatically “squeeze” sliced objects that have a dimension of length one, unless a list of indexers is used, as is done above.

Reindexing¶

It is possible to reindex a specific dimension of the coordinate grid using reindex(). The new bin spacing must be an integral multiple of the original spacing, and the variable values will be averaged together on the new coordinate grid.

In [30]: power_2d.reindex('k', 0.02)
Out[30]: <BinnedStatistic: dims: (k: 32, mu: 5), variables: ('power', 'k', 'mu', 'modes')>

In [31]: power_2d.reindex('mu', 0.4)
Out[31]: <BinnedStatistic: dims: (k: 64, mu: 2), variables: ('power', 'k', 'mu', 'modes')>

Note

Any variable names passed to reindex() via the fields_to_sum keyword will have their values summed, instead of averaged, when reindexing.

Averaging¶

The average of a specific dimension can be taken using the average() function. A common use case is averaging over the mu dimension of a 2D BinnedStatistic, which is accomplished by:

# compute P(k) from P(k,mu)
In [32]: power_2d.average('mu')
Out[32]: <BinnedStatistic: dims: (k: 64), variables: ('power', 'k', 'mu', 'modes')>