Analyzing your Results¶
Several nbodykit algorithms compute binned clustering statistics
and store the results as a BinnedStatistic
object (see a list
of these algorithms here. In this section,
we provide an overview of some of the functionality of this class to help
users better analyze their algorithm results.
The BinnedStatistic
class is designed to hold data variables at
fixed coordinates, i.e., a grid of \((r, \mu)\) or \((k, \mu)\) bins
and is modeled after the syntax of the xarray.Dataset
object.
Loading and Saving Results¶
A BinnedStatistic
object is serialized to disk using a
JSON format. The to_json()
and
from_json()
functions can be used to save and load
BinnedStatistic
objects. respectively.
To start, we read two BinnedStatistic
results from JSON files,
one holding 1D power measurement \(P(k)\) and one holding a 2D power
measurement \(P(k,\mu)\).
[2]:
from nbodykit.binned_statistic import BinnedStatistic
data_dir = os.path.join(os.path.abspath('.'), 'data')
power_1d = BinnedStatistic.from_json(os.path.join(data_dir, 'dataset_1d.json'))
power_2d = BinnedStatistic.from_json(os.path.join(data_dir, 'dataset_2d.json'))
Coordinate Grid¶
The clustering statistics are measured for fixed bins, and the
BinnedStatistic
class has several attributes to access the
coordinate grid defined by these bins:
shape
: the shape of the coordinate griddims
: the names of each dimension of the coordinate gridcoords
: a dictionary that gives the center bin values for each dimension of the gridedges
: a dictionary giving the edges of the bins for each coordinate dimension
[3]:
print("1D shape = ", power_1d.shape)
print("2D shape = ", power_2d.shape)
1D shape = (64,)
2D shape = (64, 5)
[4]:
print("1D dims = ", power_1d.dims)
print("2D dims = ", power_2d.dims)
1D dims = ['k']
2D dims = ['k', 'mu']
[5]:
print("2D edges = ", power_2d.coords)
2D edges = {'k': array([0.00613593, 0.01840777, 0.03067962, 0.04295147, 0.05522331,
0.06749516, 0.079767 , 0.09203884, 0.10431069, 0.11658255,
0.1288544 , 0.14112625, 0.1533981 , 0.1656699 , 0.17794175,
0.1902136 , 0.20248545, 0.2147573 , 0.22702915, 0.239301 ,
0.25157285, 0.2638447 , 0.27611655, 0.2883884 , 0.30066025,
0.3129321 , 0.32520395, 0.3374758 , 0.3497476 , 0.36201945,
0.3742913 , 0.38656315, 0.398835 , 0.41110685, 0.4233787 ,
0.43565055, 0.4479224 , 0.46019425, 0.4724661 , 0.48473795,
0.4970098 , 0.5092816 , 0.52155345, 0.5338253 , 0.54609715,
0.558369 , 0.57064085, 0.5829127 , 0.59518455, 0.6074564 ,
0.61972825, 0.6320001 , 0.64427195, 0.6565438 , 0.6688156 ,
0.68108745, 0.6933593 , 0.70563115, 0.717903 , 0.73017485,
0.7424467 , 0.75471855, 0.7669904 , 0.77926225]), 'mu': array([0.1, 0.3, 0.5, 0.7, 0.9])}
[6]:
print("2D edges = ", power_2d.edges)
2D edges = {'k': array([0. , 0.01227185, 0.02454369, 0.03681554, 0.04908739,
0.06135923, 0.07363108, 0.08590292, 0.09817477, 0.1104466 ,
0.1227185 , 0.1349903 , 0.1472622 , 0.159534 , 0.1718058 ,
0.1840777 , 0.1963495 , 0.2086214 , 0.2208932 , 0.2331651 ,
0.2454369 , 0.2577088 , 0.2699806 , 0.2822525 , 0.2945243 ,
0.3067962 , 0.319068 , 0.3313399 , 0.3436117 , 0.3558835 ,
0.3681554 , 0.3804272 , 0.3926991 , 0.4049709 , 0.4172428 ,
0.4295146 , 0.4417865 , 0.4540583 , 0.4663302 , 0.478602 ,
0.4908739 , 0.5031457 , 0.5154175 , 0.5276894 , 0.5399612 ,
0.5522331 , 0.5645049 , 0.5767768 , 0.5890486 , 0.6013205 ,
0.6135923 , 0.6258642 , 0.638136 , 0.6504079 , 0.6626797 ,
0.6749515 , 0.6872234 , 0.6994952 , 0.7117671 , 0.7240389 ,
0.7363108 , 0.7485826 , 0.7608545 , 0.7731263 , 0.7853982 ]), 'mu': array([0. , 0.2, 0.4, 0.6, 0.8, 1. ])}
Accessing the Data¶
The names of data variables stored in a BinnedStatistic
are stored in
the variables
attribute, and the data
attribute stores the
arrays for each of these names in a structured array. The data for a given
variable can be accessed in a dict-like fashion:
[7]:
print("1D variables = ", power_1d.variables)
print("2D variables = ", power_2d.variables)
1D variables = ['power', 'k', 'mu', 'modes']
2D variables = ['power', 'k', 'mu', 'modes']
[8]:
# the real component of the 1D power
Pk = power_1d['power'].real
print(type(Pk), Pk.shape, Pk.dtype)
# complex power array
Pkmu = power_2d['power']
print(type(Pkmu), Pkmu.shape, Pkmu.dtype)
<class 'numpy.ndarray'> (64,) float64
<class 'numpy.ndarray'> (64, 5) complex128
In some cases, the variable value for a given bin will be missing or invalid,
which is indicated by a numpy.nan
value in the data
array for
the given bin. The BinnedStatistic
class carries a mask
attribute that defines which elements of the data array have
numpy.nan
values.
Meta-data¶
An OrderedDict
of meta-data for a
BinnedStatistic
class is stored in the attrs
attribute.
Typically in nbodykit, the attrs
dictionary stores
information about box size, number of objects, etc:
[9]:
print("attrs = ", power_2d.attrs)
attrs = {'N1': 4033, 'Lx': 512.0, 'Lz': 512.0, 'N2': 4033, 'Ly': 512.0, 'volume': 134217728.0}
To attach additional meta-data to a BinnedStatistic
class, the user
can add additional keywords to the attrs
dictionary.
Slicing¶
Slices of the coordinate grid of a BinnedStatistic
can be achieved
using array-like indexing of the main BinnedStatistic
object, which
will return a new BinnedStatistic
holding the sliced data:
[10]:
# select the first mu bin
print(power_2d[:,0])
<BinnedStatistic: dims: (k: 64), variables: ('power', 'k', 'mu', 'modes')>
[11]:
# select the first and last mu bins
print(power_2d[:, [0, -1]])
<BinnedStatistic: dims: (k: 64, mu: 2), variables: ('power', 'k', 'mu', 'modes')>
[12]:
# select the first 5 k bins
print(power_1d[:5])
<BinnedStatistic: dims: (k: 5), variables: ('power', 'k', 'mu', 'modes')>
A typical usage of array-like indexing is to loop over the mu
dimension
of a 2D BinnedStatistic
, such as when plotting:
[13]:
from matplotlib import pyplot as plt
# the shot noise is volume / number of objects
shot_noise = power_2d.attrs['volume'] / power_2d.attrs['N1']
# plot each mu bin separately
for i in range(power_2d.shape[1]):
pk = power_2d[:,i]
label = r"$\mu = %.1f$" % power_2d.coords['mu'][i]
plt.loglog(pk['k'], pk['power'].real - shot_noise, label=label)
plt.legend()
plt.xlabel(r"$k$ [$h$/Mpc]", fontsize=14)
plt.ylabel(r"$P(k,\mu)$ $[\mathrm{Mpc}/h]^3$", fontsize=14)
plt.show()

The coordinate grid can also be sliced using label-based indexing, similar to
the syntax of xarray.Dataset.sel()
. The method
keyword of
sel()
determines if exact
coordinate matching is required (method=None
, the default) or if the
nearest grid coordinate should be selected automatically (method='nearest'
).
For example, we can slice power spectrum results based on the
k
and mu
coordinate values:
[14]:
# get all mu bins for the k bin closest to k=0.1
print(power_2d.sel(k=0.1, method='nearest'))
<BinnedStatistic: dims: (mu: 5), variables: ('power', 'k', 'mu', 'modes')>
[15]:
# slice from k=0.01-0.1 for mu = 0.5
print(power_2d.sel(k=slice(0.01, 0.1), mu=0.5, method='nearest'))
<BinnedStatistic: dims: (k: 8), variables: ('power', 'k', 'mu', 'modes')>
We also provide a squeeze()
function which behaves
similar to the numpy.squeeze()
function:
[16]:
# get all mu bins for the k bin closest to k=0.1, but keep k dimension
sliced = power_2d.sel(k=[0.1], method='nearest')
print(sliced)
<BinnedStatistic: dims: (k: 1, mu: 5), variables: ('power', 'k', 'mu', 'modes')>
[17]:
# and then squeeze to remove the k dimension
print(sliced.squeeze())
<BinnedStatistic: dims: (mu: 5), variables: ('power', 'k', 'mu', 'modes')>
Note that, by default, array-based or label-based indexing will automatically “squeeze” sliced objects that have a dimension of length one, unless a list of indexers is used, as is done above.
Reindexing¶
It is possible to reindex a specific dimension of the coordinate grid using
reindex()
. The new bin spacing must be an integral
multiple of the original spacing, and the variable values will be averaged together
on the new coordinate grid.
[18]:
# re-index into wider k bins
print(power_2d.reindex('k', 0.02))
<BinnedStatistic: dims: (k: 32, mu: 5), variables: ('power', 'k', 'mu', 'modes')>
/home/yfeng1/anaconda3/install/lib/python3.6/site-packages/nbodykit/binned_statistic.py:56: RuntimeWarning: Mean of empty slice
ndarray = operation(ndarray, axis=-1*(i+1))
[19]:
# re-index into wider mu bins
print(power_2d.reindex('mu', 0.4))
<BinnedStatistic: dims: (k: 64, mu: 2), variables: ('power', 'k', 'mu', 'modes')>
/home/yfeng1/anaconda3/install/lib/python3.6/site-packages/nbodykit/binned_statistic.py:56: RuntimeWarning: Mean of empty slice
ndarray = operation(ndarray, axis=-1*(i+1))
Note
Any variable names passed to reindex()
via the
fields_to_sum
keyword will have their values summed, instead of averaged,
when reindexing.
Averaging¶
The average of a specific dimension can be taken using
the average()
function. A common use case is averaging
over the mu
dimension of a 2D BinnedStatistic
, which is accomplished
by:
[20]:
# compute P(k) from P(k,mu)
print(power_2d.average('mu'))
<BinnedStatistic: dims: (k: 64), variables: ('power', 'k', 'mu', 'modes')>
/home/yfeng1/anaconda3/install/lib/python3.6/site-packages/nbodykit/binned_statistic.py:56: RuntimeWarning: Mean of empty slice
ndarray = operation(ndarray, axis=-1*(i+1))