Analyzing your Results

Several nbodykit algorithms compute binned clustering statistics and store the results as a BinnedStatistic object (see a list of these algorithms here. In this section, we provide an overview of some of the functionality of this class to help users better analyze their algorithm results.

The BinnedStatistic class is designed to hold data variables at fixed coordinates, i.e., a grid of \((r, \mu)\) or \((k, \mu)\) bins and is modeled after the syntax of the xarray.Dataset object.

Loading and Saving Results

A BinnedStatistic object is serialized to disk using a JSON format. The to_json() and from_json() functions can be used to save and load BinnedStatistic objects. respectively.

To start, we read two BinnedStatistic results from JSON files, one holding 1D power measurement \(P(k)\) and one holding a 2D power measurement \(P(k,\mu)\).

In [2]:
from nbodykit.binned_statistic import BinnedStatistic

data_dir = os.path.join(os.path.abspath('.'), 'data')
power_1d = BinnedStatistic.from_json(os.path.join(data_dir, 'dataset_1d.json'))
power_2d = BinnedStatistic.from_json(os.path.join(data_dir, 'dataset_2d.json'))

Coordinate Grid

The clustering statistics are measured for fixed bins, and the BinnedStatistic class has several attributes to access the coordinate grid defined by these bins:

  • shape: the shape of the coordinate grid
  • dims: the names of each dimension of the coordinate grid
  • coords: a dictionary that gives the center bin values for each dimension of the grid
  • edges: a dictionary giving the edges of the bins for each coordinate dimension
In [3]:
print("1D shape = ", power_1d.shape)
print("2D shape = ", power_2d.shape)
1D shape =  (64,)
2D shape =  (64, 5)
In [4]:
print("1D dims = ", power_1d.dims)
print("2D dims = ", power_2d.dims)
1D dims =  ['k']
2D dims =  ['k', 'mu']
In [5]:
print("2D edges = ", power_2d.coords)
2D edges =  {'k': array([ 0.00613593,  0.01840777,  0.03067962,  0.04295147,  0.05522331,
        0.06749516,  0.079767  ,  0.09203884,  0.10431069,  0.11658255,
        0.1288544 ,  0.14112625,  0.1533981 ,  0.1656699 ,  0.17794175,
        0.1902136 ,  0.20248545,  0.2147573 ,  0.22702915,  0.239301  ,
        0.25157285,  0.2638447 ,  0.27611655,  0.2883884 ,  0.30066025,
        0.3129321 ,  0.32520395,  0.3374758 ,  0.3497476 ,  0.36201945,
        0.3742913 ,  0.38656315,  0.398835  ,  0.41110685,  0.4233787 ,
        0.43565055,  0.4479224 ,  0.46019425,  0.4724661 ,  0.48473795,
        0.4970098 ,  0.5092816 ,  0.52155345,  0.5338253 ,  0.54609715,
        0.558369  ,  0.57064085,  0.5829127 ,  0.59518455,  0.6074564 ,
        0.61972825,  0.6320001 ,  0.64427195,  0.6565438 ,  0.6688156 ,
        0.68108745,  0.6933593 ,  0.70563115,  0.717903  ,  0.73017485,
        0.7424467 ,  0.75471855,  0.7669904 ,  0.77926225]), 'mu': array([ 0.1,  0.3,  0.5,  0.7,  0.9])}
In [6]:
print("2D edges = ", power_2d.edges)
2D edges =  {'k': array([ 0.        ,  0.01227185,  0.02454369,  0.03681554,  0.04908739,
        0.06135923,  0.07363108,  0.08590292,  0.09817477,  0.1104466 ,
        0.1227185 ,  0.1349903 ,  0.1472622 ,  0.159534  ,  0.1718058 ,
        0.1840777 ,  0.1963495 ,  0.2086214 ,  0.2208932 ,  0.2331651 ,
        0.2454369 ,  0.2577088 ,  0.2699806 ,  0.2822525 ,  0.2945243 ,
        0.3067962 ,  0.319068  ,  0.3313399 ,  0.3436117 ,  0.3558835 ,
        0.3681554 ,  0.3804272 ,  0.3926991 ,  0.4049709 ,  0.4172428 ,
        0.4295146 ,  0.4417865 ,  0.4540583 ,  0.4663302 ,  0.478602  ,
        0.4908739 ,  0.5031457 ,  0.5154175 ,  0.5276894 ,  0.5399612 ,
        0.5522331 ,  0.5645049 ,  0.5767768 ,  0.5890486 ,  0.6013205 ,
        0.6135923 ,  0.6258642 ,  0.638136  ,  0.6504079 ,  0.6626797 ,
        0.6749515 ,  0.6872234 ,  0.6994952 ,  0.7117671 ,  0.7240389 ,
        0.7363108 ,  0.7485826 ,  0.7608545 ,  0.7731263 ,  0.7853982 ]), 'mu': array([ 0. ,  0.2,  0.4,  0.6,  0.8,  1. ])}

Accessing the Data

The names of data variables stored in a BinnedStatistic are stored in the variables attribute, and the data attribute stores the arrays for each of these names in a structured array. The data for a given variable can be accessed in a dict-like fashion:

In [7]:
print("1D variables = ", power_1d.variables)
print("2D variables = ", power_2d.variables)
1D variables =  ['power', 'k', 'mu', 'modes']
2D variables =  ['power', 'k', 'mu', 'modes']
In [8]:
# the real component of the 1D power
Pk = power_1d['power'].real
print(type(Pk), Pk.shape, Pk.dtype)

# complex power array
Pkmu = power_2d['power']
print(type(Pkmu), Pkmu.shape, Pkmu.dtype)
<class 'numpy.ndarray'> (64,) float64
<class 'numpy.ndarray'> (64, 5) complex128

In some cases, the variable value for a given bin will be missing or invalid, which is indicated by a numpy.nan value in the data array for the given bin. The BinnedStatistic class carries a mask attribute that defines which elements of the data array have numpy.nan values.

Meta-data

An OrderedDict of meta-data for a BinnedStatistic class is stored in the attrs attribute. Typically in nbodykit, the attrs dictionary stores information about box size, number of objects, etc:

In [9]:
print("attrs = ", power_2d.attrs)
attrs =  {'N1': 4033, 'Lx': 512.0, 'Lz': 512.0, 'N2': 4033, 'Ly': 512.0, 'volume': 134217728.0}

To attach additional meta-data to a BinnedStatistic class, the user can add additional keywords to the attrs dictionary.

Slicing

Slices of the coordinate grid of a BinnedStatistic can be achieved using array-like indexing of the main BinnedStatistic object, which will return a new BinnedStatistic holding the sliced data:

In [10]:
# select the first mu bin
print(power_2d[:,0])
<BinnedStatistic: dims: (k: 64), variables: ('power', 'k', 'mu', 'modes')>
In [11]:
# select the first and last mu bins
print(power_2d[:, [0, -1]])
<BinnedStatistic: dims: (k: 64, mu: 2), variables: ('power', 'k', 'mu', 'modes')>
In [12]:
# select the first 5 k bins
print(power_1d[:5])
<BinnedStatistic: dims: (k: 5), variables: ('power', 'k', 'mu', 'modes')>

A typical usage of array-like indexing is to loop over the mu dimension of a 2D BinnedStatistic, such as when plotting:

In [13]:
from matplotlib import pyplot as plt

# the shot noise is volume / number of objects
shot_noise = power_2d.attrs['volume'] / power_2d.attrs['N1']

# plot each mu bin separately
for i in range(power_2d.shape[1]):
    pk = power_2d[:,i]
    label = r"$\mu = %.1f$" % power_2d.coords['mu'][i]
    plt.loglog(pk['k'], pk['power'].real - shot_noise, label=label)

plt.legend()
plt.xlabel(r"$k$ [$h$/Mpc]", fontsize=14)
plt.ylabel(r"$P(k,\mu)$ $[\mathrm{Mpc}/h]^3$", fontsize=14)
plt.show()
_images/analyzing-results_18_0.png

The coordinate grid can also be sliced using label-based indexing, similar to the syntax of xarray.Dataset.sel(). The method keyword of sel() determines if exact coordinate matching is required (method=None, the default) or if the nearest grid coordinate should be selected automatically (method='nearest').

For example, we can slice power spectrum results based on the k and mu coordinate values:

In [14]:
# get all mu bins for the k bin closest to k=0.1
print(power_2d.sel(k=0.1, method='nearest'))
<BinnedStatistic: dims: (mu: 5), variables: ('power', 'k', 'mu', 'modes')>
In [15]:
# slice from k=0.01-0.1 for mu = 0.5
print(power_2d.sel(k=slice(0.01, 0.1), mu=0.5, method='nearest'))
<BinnedStatistic: dims: (k: 8), variables: ('power', 'k', 'mu', 'modes')>

We also provide a squeeze() function which behaves similar to the numpy.squeeze() function:

In [16]:
# get all mu bins for the k bin closest to k=0.1, but keep k dimension
sliced = power_2d.sel(k=[0.1], method='nearest')
print(sliced)
<BinnedStatistic: dims: (k: 1, mu: 5), variables: ('power', 'k', 'mu', 'modes')>
In [17]:
# and then squeeze to remove the k dimension
print(sliced.squeeze())
<BinnedStatistic: dims: (mu: 5), variables: ('power', 'k', 'mu', 'modes')>

Note that, by default, array-based or label-based indexing will automatically “squeeze” sliced objects that have a dimension of length one, unless a list of indexers is used, as is done above.

Reindexing

It is possible to reindex a specific dimension of the coordinate grid using reindex(). The new bin spacing must be an integral multiple of the original spacing, and the variable values will be averaged together on the new coordinate grid.

In [18]:
# re-index into wider k bins
print(power_2d.reindex('k', 0.02))
<BinnedStatistic: dims: (k: 32, mu: 5), variables: ('power', 'k', 'mu', 'modes')>
/Users/nhand/Research/Programs/nbodykit/nbodykit/binned_statistic.py:56: RuntimeWarning: Mean of empty slice
  ndarray = operation(ndarray, axis=-1*(i+1))
In [19]:
# re-index into wider mu bins
print(power_2d.reindex('mu', 0.4))
<BinnedStatistic: dims: (k: 64, mu: 2), variables: ('power', 'k', 'mu', 'modes')>
/Users/nhand/Research/Programs/nbodykit/nbodykit/binned_statistic.py:56: RuntimeWarning: Mean of empty slice
  ndarray = operation(ndarray, axis=-1*(i+1))

Note

Any variable names passed to reindex() via the fields_to_sum keyword will have their values summed, instead of averaged, when reindexing.

Averaging

The average of a specific dimension can be taken using the average() function. A common use case is averaging over the mu dimension of a 2D BinnedStatistic, which is accomplished by:

In [20]:
# compute P(k) from P(k,mu)
print(power_2d.average('mu'))
<BinnedStatistic: dims: (k: 64), variables: ('power', 'k', 'mu', 'modes')>
/Users/nhand/Research/Programs/nbodykit/nbodykit/binned_statistic.py:56: RuntimeWarning: Mean of empty slice
  ndarray = operation(ndarray, axis=-1*(i+1))