Here, we detail some of the most common operations when dealing with
data in the form of a `CatalogSource`

. The native format for data columns
in a `CatalogSource`

object is the dask array. Be sure to read
the previous section for an introduction to dask arrays
before proceeding.

The dask array format allows users to easily manipulate columns in their input data and feed any transformed data into one of the nbodykit algorithms. This provides a fast and easy way to transform the data while hiding the implementation details needed to compute these transformations internally. In this section, we’ll provide examples of some of these data transformations to get users acclimated to dask arrays quickly.

To help illustrate these operations, we’ll initialize the nbodykit “lab” and load a catalog of uniformly distributed objects.

```
In [1]: from nbodykit.lab import *
In [2]: cat = UniformCatalog(nbar=100, BoxSize=1.0, seed=42)
```

Specific columns can be accessed by indexing the catalog object using the
column name, and a `dask.array.Array`

object is returned (see
What is a dask array? for more details on dask arrays).

```
In [3]: position = cat['Position']
In [4]: velocity = cat['Velocity']
In [5]: print(position)
dask.array<array, shape=(96, 3), dtype=float64, chunksize=(96, 3)>
In [6]: print(velocity)
dask.array<array, shape=(96, 3), dtype=float64, chunksize=(96, 3)>
```

While in the format of the dask array, data columns can easily be manipulated by the user. For example, here we normalize the position coordinates to the range 0 to 1 by dividing by the box size:

```
# normalize the position
In [7]: normed_position = position / cat.attrs['BoxSize']
In [8]: print(normed_position)
dask.array<truediv, shape=(96, 3), dtype=float64, chunksize=(96, 3)>
```

Note that the normalized position array is also a dask array and that the actual normalization operation is yet to occur. This makes these kinds of data transformations very fast for the user.

Columns can be converted from `dask.array.Array`

objects to
numpy arrays using the `compute()`

function (see
Evaluating a dask array for further details on computing dask arrays).

```
In [9]: position, velocity = cat.compute(cat['Position'], cat['Velocity'])
In [10]: print(type(position))
<class 'numpy.ndarray'>
In [11]: print(type(velocity))
<class 'numpy.ndarray'>
```

We can also compute the max of the normalized position coordinates from the previous section:

```
In [12]: maxpos = normed_position.max(axis=0)
In [13]: print(maxpos)
dask.array<amax-aggregate, shape=(3,), dtype=float64, chunksize=(3,)>
In [14]: print(cat.compute(maxpos))
[ 0.9927406 0.99610592 0.99925086]
```

New columns can be easily added to a `CatalogSource`

object by
directly setting them:

```
# no "Mass" column originally
In [15]: print("contains 'Mass'? :", 'Mass' in cat)
contains 'Mass'? : False
# add a random array as the "Mass" column
In [16]: cat['Mass'] = numpy.random.random(size=len(cat))
# "Mass" exists!
In [17]: print("contains 'Mass'? :", 'Mass' in cat)
contains 'Mass'? : True
# can also add scalar values -- converted to correct length
In [18]: cat['Type'] = b"central"
In [19]: print(cat['Mass'])
dask.array<array, shape=(96,), dtype=float64, chunksize=(96,)>
In [20]: print(cat['Type'])
dask.array<array, shape=(96,), dtype=|S7, chunksize=(96,)>
```

Here, we have added two new columns to the catalog, `Mass`

and `Type`

.
Internally, nbodykit stores the new columns as dask arrays and
will automatically convert them to the correct type if they are not already.

**Caveats**

- New columns must be either be a scalar value, or an array with the same length as the catalog. Scalar values will automatically be broadcast to the correct length.
- Setting a column of the wrong length will raise an exception.

The same syntax used for adding new columns can also be used to overwrite
columns that already exist in a `CatalogSource`

. This procedure
works as one would expect – the most up-to-date values of columns are
always used in operations.

In the example below we overwrite both the `Position`

and `Velocity`

columns, and each time the columns are accessed, the most up-to-date values
are used, as expected.

```
# some fake data
In [21]: data = numpy.ones(5, dtype=[
....: ('Position', ('f4', 3)),
....: ('Velocity', ('f4', 3))]
....: )
....:
# initialize a catalog directly from the structured array
In [22]: src = ArrayCatalog(data)
# overwrite the Velocity column
In [23]: src['Velocity'] = src['Position'] + src['Velocity'] # 1 + 1 = 2
# overwrite the Position column
In [24]: src['Position'] = src['Position'] + src['Velocity'] # 1 + 2 = 3
In [25]: print("Velocity = ", src.compute(src['Velocity'])) # all equal to 2
Velocity = [[ 2. 2. 2.]
[ 2. 2. 2.]
[ 2. 2. 2.]
[ 2. 2. 2.]
[ 2. 2. 2.]]
In [26]: print("Position = ", src.compute(src['Position'])) # all equal to 3
Position = [[ 3. 3. 3.]
[ 3. 3. 3.]
[ 3. 3. 3.]
[ 3. 3. 3.]
[ 3. 3. 3.]]
```

A useful operation in large-scale structure is the mapping of positions
in simulations from real space to redshift space, referred to
as redshift space distortions (RSD).
This operation can be easily performed by combining the `Position`

and
`Velocity`

columns to overwrite the `Position`

column. As first
found by Kaiser 1987,
the mapping from real to redshift space is:

\[s = r + \frac{\vv \cdot \nhat}}{a H},\]

where \(r\) is the line-of-sight position in real space, \(s\) is the line-of-sight position in redshift space, \(\vv\) is the velocity vector, \(\vnhat\) is the line-of-sight unit vector, \(a\) is the scale factor, and \(H\) is the Hubble parameter at \(a\).

As an example, below we add RSD along the `z`

axis of a simulation box:

```
# apply RSD along the z axis
In [27]: line_of_sight = [0,0,1]
# redshift and cosmology
In [28]: redshift = 0.55; cosmo = cosmology.Cosmology(Om0=0.31, H0=70)
# the RSD normalization factor
In [29]: rsd_factor = (1+redshift) / (100 * cosmo.efunc(redshift))
# update Position, applying RSD
In [30]: src['Position'] = src['Position'] + rsd_factor * src['Velocity'] * line_of_sight
```

The RSD factor is known as the conformal Hubble parameter
\(\mathcal{H} = a H(a)\). This calculation requires a cosmology,
which can be specified via the
`Cosmology`

class. We use the
`efunc()`

function which returns
\(E(z)\), where the Hubble parameter is defined as \(H(z) = 100h\ E(z)\)
in units of km/s/Mpc. Note that the operation above assumes the `Position`

column is in units of \(\mathrm{Mpc}/h\).

For catalogs in nbodykit that generate mock data, such as the
log-normal catalogs or HOD catalogs,
there is an additional column, `VelocityOffset`

, available to facilitate
RSD operations. This column has units of \(\mathrm{Mpc}/h\) and
includes the `rsd_factor`

above. Thus, this allows users to add RSD
simply by using:

```
src['Position'] = src['Position'] + src['VelocityOffset'] * line_of_sight
```

A subset of a `CatalogSource`

object can be selected using slice notation.
There are two ways to select a subset:

- use a boolean array, which specifies which rows of the catalog to select
- use a slice object specifying which rows of the catalog to select

For example,

```
# boolean selection array
In [31]: select = cat['Mass'] < 0.5
In [32]: print("number of True entries = ", cat.compute(select.sum()))
number of True entries = 53
# select only entries where select = True
In [33]: subcat = cat[select]
In [34]: print("size of subcat = ", subcat.size)
size of subcat = 53
# select the first ten rows
In [35]: subcat = cat[:10]
In [36]: print("size of subcat = ", subcat.size)
size of subcat = 10
# select first and last row
In [37]: subcat = cat[[0, -1]]
In [38]: print("size of subcat = ", subcat.size)
size of subcat = 2
```

**Caveats**

- When indexing with a boolean array, the array must have the same length as
the
`size`

attribute, or an exception will be raised. - Selecting a single object by indexing with an integer is not supported. If the user wishes to select a single row, a list of length one can be used to select the specific row.

`CatalogSource`

¶A subset of columns can be selected from a `CatalogSource`

object by
indexing the catalog with a list of the names of the desired columns.
For example,

```
In [39]: print("columns in catalog = ", cat.columns)
columns in catalog = ['Mass', 'Position', 'Selection', 'Type', 'Value', 'Velocity', 'Weight']
# select Position + Mass
In [40]: subcat = cat[['Position', 'Mass']]
# the selected columns + default columns
In [41]: print("columns in subset = ", subcat.columns)
columns in subset = ['Mass', 'Position', 'Selection', 'Value', 'Weight']
```

**Caveats**

- When selecting a subset of columns, note that in addition to the desired columns,
the sub-catalog will also contain the
default columns (
`Weight`

,`Value`

, and`Selection`

).

`nbodykit.transform`

module¶The `nbodykit.transform`

module includes several commonly used functions
for convenience. We describe a few of the most common use cases in the sub-sections
below.

Note

The `transform`

module is available to users when
`from nbodykit.lab import *`

is executed.

`CatalogSource`

Objects¶When `CatalogSource`

objects have the same columns, they can be
concatenated together into a single object using the
`nbodykit.transform.ConcatenateSources()`

function. For example,

```
In [42]: cat1 = UniformCatalog(nbar=50, BoxSize=1.0, seed=42)
In [43]: cat2 = UniformCatalog(nbar=150, BoxSize=1.0, seed=42)
In [44]: combined = transform.ConcatenateSources(cat1, cat2)
In [45]: print("total size = %d + %d = %d" %(cat1.size, cat2.size, combined.size))
total size = 47 + 145 = 192
```

Another common use case is when users need to combine separate
data columns vertically, as the columns of a new array. For example, often the
Cartesian position coordinates `x`

, `y`

, and `z`

are stored as separate
columns, and the `Position`

column must be added to a catalog from these
individual columns. We provide the `nbodykit.transform.StackColumns()`

function for this exact purpose. For example,

```
# fake position data
In [46]: data = numpy.random.random(size=(5,3))
# save to a plaintext file
In [47]: numpy.savetxt('csv-example.dat', data, fmt='%.7e')
# the cartesian coordinates
In [48]: names =['x', 'y', 'z']
# read the data
In [49]: f = CSVCatalog('csv-example.dat', names)
# make the "Position" column
In [50]: f['Position'] = transform.StackColumns(f['x'], f['y'], f['z'])
In [51]: print(f['Position'])
dask.array<transpose, shape=(5, 3), dtype=float64, chunksize=(5, 1)>
In [52]: print(f.compute(f['Position']))
[[ 0.73436277 0.39534468 0.567177 ]
[ 0.86954257 0.99191751 0.7192412 ]
[ 0.54870289 0.43708508 0.97382372]
[ 0.22716903 0.35559318 0.666647 ]
[ 0.43245037 0.11340447 0.06631885]]
```

We provide the function `nbodykit.transform.SkyToCartesion()`

for converting
sky coordinates, in the form of right ascension, declination, and redshift,
to Cartesian coordinates. The conversion from redshift to comoving distance
requires a cosmology instance, which can be specified via the
`Cosmology`

class.

Below, we initialize a catalog holding random right ascension,
declination, and redshift coordinates, and then add the Cartesian position
as the `Position`

column.

```
In [53]: src = RandomCatalog(100, seed=42)
# add random (ra, dec, z) coordinates
In [54]: src['z'] = src.rng.normal(loc=0.5, scale=0.1, size=src.size)
In [55]: src['ra'] = src.rng.uniform(low=0, high=360, size=src.size)
In [56]: src['dec'] = src.rng.uniform(low=-180, high=180., size=src.size)
# initialize a set of cosmology parameters
In [57]: cosmo = cosmology.Cosmology(Om0=0.31, H0=70)
# add the position
In [58]: src['Position'] = transform.SkyToCartesion(src['ra'], src['dec'], src['z'], degrees=True, cosmo=cosmo)
```

**Caveats**

- Whether the right ascension and declination arrays are in degrees
(as opposed to radians) should be specified via the
`degrees`

keyword. - The units of the returned
`Position`

column are \(\mathrm{Mpc}/h\). - By default, the redshift to comoving distance calculation uses interpolated
results for significant speed-ups. Set
`interpolate_cdist`

to`False`

to compute exact results.

`dask.array`

module¶For more general column transformations, users should take advantage of the
`dask.array`

module, which implements most functions in the `numpy`

package in a manner optimized for dask arrays. The module can be accessed from the
`nbodykit.transform`

module as `nbodykit.transform.da`

.

Important

For a full list of functions available in the `dask.array`

module,
please see the dask array documentation.
We strongly recommend that new users read through this documentation
and familiarize themselves with the functionality provided by
the `dask.array`

module.

As a simple illustration, below we convert an array holding right ascension values from
degrees to radians, compute the sine of the array, and find the min and
max values using functions available in the `dask.array`

module.

```
In [59]: ra = transform.da.deg2rad(src['ra']) # from degrees to radians
In [60]: sin_ra = transform.da.sin(ra) # compute the sine
In [61]: print("min(sin(ra)) = ", src.compute(sin_ra.min()))
min(sin(ra)) = -0.999061041207
In [62]: print("max(sin(ra)) = ", src.compute(sin_ra.max()))
max(sin(ra)) = 0.999599541358
```