Common Data Operations

Here, we detail some of the most common operations when dealing with data in the form of a CatalogSource. The native format for data columns in a CatalogSource object is the dask array. Be sure to read the previous section for an introduction to dask arrays before proceeding.

The dask array format allows users to easily manipulate columns in their input data and feed any transformed data into one of the nbodykit algorithms. This provides a fast and easy way to transform the data while hiding the implementation details needed to compute these transformations internally. In this section, we’ll provide examples of some of these data transformations to get users acclimated to dask arrays quickly.

To help illustrate these operations, we’ll initialize the nbodykit “lab” and load a catalog of uniformly distributed objects.

In [1]: from nbodykit.lab import *

In [2]: cat = UniformCatalog(nbar=100, BoxSize=1.0, seed=42)

Accessing Data Columns

Specific columns can be accessed by indexing the catalog object using the column name, and a dask.array.Array object is returned (see What is a dask array? for more details on dask arrays).

In [3]: position = cat['Position']

In [4]: velocity = cat['Velocity']

In [5]: print(position)
dask.array<array, shape=(96, 3), dtype=float64, chunksize=(96, 3)>

In [6]: print(velocity)
dask.array<array, shape=(96, 3), dtype=float64, chunksize=(96, 3)>

While in the format of the dask array, data columns can easily be manipulated by the user. For example, here we normalize the position coordinates to the range 0 to 1 by dividing by the box size:

# normalize the position
In [7]: normed_position = position / cat.attrs['BoxSize']

In [8]: print(normed_position)
dask.array<truediv, shape=(96, 3), dtype=float64, chunksize=(96, 3)>

Note that the normalized position array is also a dask array and that the actual normalization operation is yet to occur. This makes these kinds of data transformations very fast for the user.

Computing Data Columns

Columns can be converted from dask.array.Array objects to numpy arrays using the compute() function (see Evaluating a dask array for further details on computing dask arrays).

In [9]: position, velocity = cat.compute(cat['Position'], cat['Velocity'])

In [10]: print(type(position))
<class 'numpy.ndarray'>

In [11]: print(type(velocity))
<class 'numpy.ndarray'>

We can also compute the max of the normalized position coordinates from the previous section:

In [12]: maxpos = normed_position.max(axis=0)

In [13]: print(maxpos)
dask.array<amax-aggregate, shape=(3,), dtype=float64, chunksize=(3,)>

In [14]: print(cat.compute(maxpos))
[ 0.9927406   0.99610592  0.99925086]

Adding New Columns

New columns can be easily added to a CatalogSource object by directly setting them:

# no "Mass" column originally
In [15]: print("contains 'Mass'? :", 'Mass' in cat)
contains 'Mass'? : False

# add a random array as the "Mass" column
In [16]: cat['Mass'] = numpy.random.random(size=len(cat))

# "Mass" exists!
In [17]: print("contains 'Mass'? :", 'Mass' in cat)
contains 'Mass'? : True

# can also add scalar values -- converted to correct length
In [18]: cat['Type'] = b"central"

In [19]: print(cat['Mass'])
dask.array<array, shape=(96,), dtype=float64, chunksize=(96,)>

In [20]: print(cat['Type'])
dask.array<array, shape=(96,), dtype=|S7, chunksize=(96,)>

Here, we have added two new columns to the catalog, Mass and Type. Internally, nbodykit stores the new columns as dask arrays and will automatically convert them to the correct type if they are not already.

Caveats

  • New columns must be either be a scalar value, or an array with the same length as the catalog. Scalar values will automatically be broadcast to the correct length.
  • Setting a column of the wrong length will raise an exception.

Overwriting Columns

The same syntax used for adding new columns can also be used to overwrite columns that already exist in a CatalogSource. This procedure works as one would expect – the most up-to-date values of columns are always used in operations.

In the example below we overwrite both the Position and Velocity columns, and each time the columns are accessed, the most up-to-date values are used, as expected.

# some fake data
In [21]: data = numpy.ones(5, dtype=[
   ....:         ('Position', ('f4', 3)),
   ....:         ('Velocity', ('f4', 3))]
   ....:         )
   ....: 

# initialize a catalog directly from the structured array
In [22]: src = ArrayCatalog(data)

# overwrite the Velocity column
In [23]: src['Velocity'] = src['Position'] + src['Velocity'] # 1 + 1 = 2

# overwrite the Position column
In [24]: src['Position'] = src['Position'] + src['Velocity'] # 1 + 2 = 3

In [25]: print("Velocity = ", src.compute(src['Velocity'])) # all equal to 2
Velocity =  [[ 2.  2.  2.]
 [ 2.  2.  2.]
 [ 2.  2.  2.]
 [ 2.  2.  2.]
 [ 2.  2.  2.]]

In [26]: print("Position = ", src.compute(src['Position'])) # all equal to 3
Position =  [[ 3.  3.  3.]
 [ 3.  3.  3.]
 [ 3.  3.  3.]
 [ 3.  3.  3.]
 [ 3.  3.  3.]]

Adding Redshift-space Distortions

A useful operation in large-scale structure is the mapping of positions in simulations from real space to redshift space, referred to as redshift space distortions (RSD). This operation can be easily performed by combining the Position and Velocity columns to overwrite the Position column. As first found by Kaiser 1987, the mapping from real to redshift space is:

\[s = r + \frac{\vv \cdot \nhat}}{a H},\]

where \(r\) is the line-of-sight position in real space, \(s\) is the line-of-sight position in redshift space, \(\vv\) is the velocity vector, \(\vnhat\) is the line-of-sight unit vector, \(a\) is the scale factor, and \(H\) is the Hubble parameter at \(a\).

As an example, below we add RSD along the z axis of a simulation box:

# apply RSD along the z axis
In [27]: line_of_sight = [0,0,1]

# redshift and cosmology
In [28]: redshift =  0.55; cosmo = cosmology.Cosmology(Om0=0.31, H0=70)

# the RSD normalization factor
In [29]: rsd_factor = (1+redshift) / (100 * cosmo.efunc(redshift))

# update Position, applying RSD
In [30]: src['Position'] = src['Position'] + rsd_factor * src['Velocity'] * line_of_sight

The RSD factor is known as the conformal Hubble parameter \(\mathcal{H} = a H(a)\). This calculation requires a cosmology, which can be specified via the Cosmology class. We use the efunc() function which returns \(E(z)\), where the Hubble parameter is defined as \(H(z) = 100h\ E(z)\) in units of km/s/Mpc. Note that the operation above assumes the Position column is in units of \(\mathrm{Mpc}/h\).

For catalogs in nbodykit that generate mock data, such as the log-normal catalogs or HOD catalogs, there is an additional column, VelocityOffset, available to facilitate RSD operations. This column has units of \(\mathrm{Mpc}/h\) and includes the rsd_factor above. Thus, this allows users to add RSD simply by using:

src['Position'] = src['Position'] + src['VelocityOffset'] * line_of_sight

Selecting a Subset

A subset of a CatalogSource object can be selected using slice notation. There are two ways to select a subset:

  1. use a boolean array, which specifies which rows of the catalog to select
  2. use a slice object specifying which rows of the catalog to select

For example,

# boolean selection array
In [31]: select = cat['Mass'] < 0.5

In [32]: print("number of True entries = ", cat.compute(select.sum()))
number of True entries =  53

# select only entries where select = True
In [33]: subcat = cat[select]

In [34]: print("size of subcat = ", subcat.size)
size of subcat =  53

# select the first ten rows
In [35]: subcat = cat[:10]

In [36]: print("size of subcat = ", subcat.size)
size of subcat =  10

# select first and last row
In [37]: subcat = cat[[0, -1]]

In [38]: print("size of subcat = ", subcat.size)
size of subcat =  2

Caveats

  • When indexing with a boolean array, the array must have the same length as the size attribute, or an exception will be raised.
  • Selecting a single object by indexing with an integer is not supported. If the user wishes to select a single row, a list of length one can be used to select the specific row.

Selecting a Subset of Columns from a CatalogSource

A subset of columns can be selected from a CatalogSource object by indexing the catalog with a list of the names of the desired columns. For example,

In [39]: print("columns in catalog = ", cat.columns)
columns in catalog =  ['Mass', 'Position', 'Selection', 'Type', 'Value', 'Velocity', 'Weight']

# select Position + Mass
In [40]: subcat = cat[['Position', 'Mass']]

# the selected columns + default columns
In [41]: print("columns in subset = ", subcat.columns)
columns in subset =  ['Mass', 'Position', 'Selection', 'Value', 'Weight']

Caveats

  • When selecting a subset of columns, note that in addition to the desired columns, the sub-catalog will also contain the default columns (Weight, Value, and Selection).

The nbodykit.transform module

The nbodykit.transform module includes several commonly used functions for convenience. We describe a few of the most common use cases in the sub-sections below.

Note

The transform module is available to users when from nbodykit.lab import * is executed.

Concatenating CatalogSource Objects

When CatalogSource objects have the same columns, they can be concatenated together into a single object using the nbodykit.transform.ConcatenateSources() function. For example,

In [42]: cat1 = UniformCatalog(nbar=50, BoxSize=1.0, seed=42)

In [43]: cat2 = UniformCatalog(nbar=150, BoxSize=1.0, seed=42)

In [44]: combined = transform.ConcatenateSources(cat1, cat2)

In [45]: print("total size = %d + %d = %d" %(cat1.size, cat2.size, combined.size))
total size = 47 + 145 = 192

Stacking Columns Together

Another common use case is when users need to combine separate data columns vertically, as the columns of a new array. For example, often the Cartesian position coordinates x, y, and z are stored as separate columns, and the Position column must be added to a catalog from these individual columns. We provide the nbodykit.transform.StackColumns() function for this exact purpose. For example,

# fake position data
In [46]: data = numpy.random.random(size=(5,3))

# save to a plaintext file
In [47]: numpy.savetxt('csv-example.dat', data, fmt='%.7e')

# the cartesian coordinates
In [48]: names =['x', 'y', 'z']

# read the data
In [49]: f = CSVCatalog('csv-example.dat', names)

# make the "Position" column
In [50]: f['Position'] =  transform.StackColumns(f['x'], f['y'], f['z'])

In [51]: print(f['Position'])
dask.array<transpose, shape=(5, 3), dtype=float64, chunksize=(5, 1)>

In [52]: print(f.compute(f['Position']))
[[ 0.73436277  0.39534468  0.567177  ]
 [ 0.86954257  0.99191751  0.7192412 ]
 [ 0.54870289  0.43708508  0.97382372]
 [ 0.22716903  0.35559318  0.666647  ]
 [ 0.43245037  0.11340447  0.06631885]]

Converting from Sky to Cartesian Coordinates

We provide the function nbodykit.transform.SkyToCartesion() for converting sky coordinates, in the form of right ascension, declination, and redshift, to Cartesian coordinates. The conversion from redshift to comoving distance requires a cosmology instance, which can be specified via the Cosmology class.

Below, we initialize a catalog holding random right ascension, declination, and redshift coordinates, and then add the Cartesian position as the Position column.

In [53]: src = RandomCatalog(100, seed=42)

# add random (ra, dec, z) coordinates
In [54]: src['z'] = src.rng.normal(loc=0.5, scale=0.1, size=src.size)

In [55]: src['ra'] = src.rng.uniform(low=0, high=360, size=src.size)

In [56]: src['dec'] = src.rng.uniform(low=-180, high=180., size=src.size)

# initialize a set of cosmology parameters
In [57]: cosmo = cosmology.Cosmology(Om0=0.31, H0=70)

# add the position
In [58]: src['Position'] = transform.SkyToCartesion(src['ra'], src['dec'], src['z'], degrees=True, cosmo=cosmo)

Caveats

  • Whether the right ascension and declination arrays are in degrees (as opposed to radians) should be specified via the degrees keyword.
  • The units of the returned Position column are \(\mathrm{Mpc}/h\).
  • By default, the redshift to comoving distance calculation uses interpolated results for significant speed-ups. Set interpolate_cdist to False to compute exact results.

Using the dask.array module

For more general column transformations, users should take advantage of the dask.array module, which implements most functions in the numpy package in a manner optimized for dask arrays. The module can be accessed from the nbodykit.transform module as nbodykit.transform.da.

Important

For a full list of functions available in the dask.array module, please see the dask array documentation. We strongly recommend that new users read through this documentation and familiarize themselves with the functionality provided by the dask.array module.

As a simple illustration, below we convert an array holding right ascension values from degrees to radians, compute the sine of the array, and find the min and max values using functions available in the dask.array module.

In [59]: ra = transform.da.deg2rad(src['ra']) # from degrees to radians

In [60]: sin_ra = transform.da.sin(ra) # compute the sine

In [61]: print("min(sin(ra)) = ", src.compute(sin_ra.min()))
min(sin(ra)) =  -0.999061041207

In [62]: print("max(sin(ra)) = ", src.compute(sin_ra.max()))
max(sin(ra)) =  0.999599541358