Here, we detail some of the most common operations when dealing with
data in the form of a CatalogSource
. The native format for data columns
in a CatalogSource
object is the dask array. Be sure to read
the previous section for an introduction to dask arrays
before proceeding.
The dask array format allows users to easily manipulate columns in their input data and feed any transformed data into one of the nbodykit algorithms. This provides a fast and easy way to transform the data while hiding the implementation details needed to compute these transformations internally. In this section, we’ll provide examples of some of these data transformations to get users acclimated to dask arrays quickly.
To help illustrate these operations, we’ll initialize the nbodykit “lab” and load a catalog of uniformly distributed objects.
In [2]:
from nbodykit.lab import *
cat = UniformCatalog(nbar=100, BoxSize=1.0, seed=42)
Specific columns can be accessed by indexing the catalog object using the
column name, and a dask.array.Array
object is returned (see
What is a dask array? for more details on dask arrays).
In [3]:
position = cat['Position']
velocity = cat['Velocity']
print(position)
print(velocity)
dask.array<array, shape=(96, 3), dtype=float64, chunksize=(96, 3)> first: [ 0.45470105 0.83263203 0.06905134] last: [ 0.62474599 0.15388738 0.84302209]
dask.array<array, shape=(96, 3), dtype=float64, chunksize=(96, 3)> first: [ 0.00454701 0.00832632 0.00069051] last: [ 0.00624746 0.00153887 0.00843022]
While in the format of the dask array, data columns can easily be manipulated by the user. For example, here we normalize the position coordinates to the range 0 to 1 by dividing by the box size:
In [4]:
# normalize the position
normed_position = position / cat.attrs['BoxSize']
print(normed_position)
dask.array<truediv, shape=(96, 3), dtype=float64, chunksize=(96, 3)>
Note that the normalized position array is also a dask array and that the actual normalization operation is yet to occur. This makes these kinds of data transformations very fast for the user.
Columns can be converted from dask.array.Array
objects to
numpy arrays using the compute()
function (see
Evaluating a dask array for further details on computing dask arrays).
In [5]:
position, velocity = cat.compute(cat['Position'], cat['Velocity'])
print(type(position))
print(type(velocity))
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
We can also compute the max of the normalized position coordinates from the previous section:
In [6]:
maxpos = normed_position.max(axis=0)
print(maxpos)
print(cat.compute(maxpos))
dask.array<amax-aggregate, shape=(3,), dtype=float64, chunksize=(3,)>
[ 0.9927406 0.99610592 0.99925086]
New columns can be easily added to a CatalogSource
object by
directly setting them:
In [7]:
# no "Mass" column originally
print("contains 'Mass'? :", 'Mass' in cat)
# add a random array as the "Mass" column
cat['Mass'] = numpy.random.random(size=len(cat))
# "Mass" exists!
print("contains 'Mass'? :", 'Mass' in cat)
# can also add scalar values -- converted to correct length
cat['Type'] = b"central"
print(cat['Mass'])
print(cat['Type'])
contains 'Mass'? : False
contains 'Mass'? : True
dask.array<array, shape=(96,), dtype=float64, chunksize=(96,)> first: 0.374540118847 last: 0.493795596364
dask.array<array, shape=(96,), dtype=|S7, chunksize=(96,)> first: b'central' last: b'central'
Here, we have added two new columns to the catalog, Mass
and Type
.
Internally, nbodykit stores the new columns as dask arrays and
will automatically convert them to the correct type if they are not already.
Caveats
The same syntax used for adding new columns can also be used to overwrite
columns that already exist in a CatalogSource
. This procedure
works as one would expect – the most up-to-date values of columns are
always used in operations.
In the example below we overwrite both the Position
and Velocity
columns, and each time the columns are accessed, the most up-to-date values
are used, as expected.
In [8]:
# some fake data
data = numpy.ones(5, dtype=[
('Position', ('f4', 3)),
('Velocity', ('f4', 3))]
)
# initialize a catalog directly from the structured array
src = ArrayCatalog(data)
# overwrite the Velocity column
src['Velocity'] = src['Position'] + src['Velocity'] # 1 + 1 = 2
# overwrite the Position column
src['Position'] = src['Position'] + src['Velocity'] # 1 + 2 = 3
print("Velocity = ", src.compute(src['Velocity'])) # all equal to 2
print("Position = ", src.compute(src['Position'])) # all equal to 3
Velocity = [[ 2. 2. 2.]
[ 2. 2. 2.]
[ 2. 2. 2.]
[ 2. 2. 2.]
[ 2. 2. 2.]]
Position = [[ 3. 3. 3.]
[ 3. 3. 3.]
[ 3. 3. 3.]
[ 3. 3. 3.]
[ 3. 3. 3.]]
A useful operation in large-scale structure is the mapping of positions
in simulations from real space to redshift space, referred to
as redshift space distortions (RSD).
This operation can be easily performed by combining the Position
and
Velocity
columns to overwrite the Position
column. As first
found by Kaiser 1987,
the mapping from real to redshift space is:
where \(r\) is the line-of-sight position in real space, \(s\) is the line-of-sight position in redshift space, \(\vv\) is the velocity vector, \(\vnhat\) is the line-of-sight unit vector, \(a\) is the scale factor, and \(H\) is the Hubble parameter at \(a\).
As an example, below we add RSD along the z
axis of a simulation box:
In [9]:
# apply RSD along the z axis
line_of_sight = [0,0,1]
# redshift and cosmology
redshift = 0.55; cosmo = cosmology.Cosmology(h=0.7).match(Omega0_m=0.31)
# the RSD normalization factor
rsd_factor = (1+redshift) / (100 * cosmo.efunc(redshift))
# update Position, applying RSD
src['Position'] = src['Position'] + rsd_factor * src['Velocity'] * line_of_sight
The RSD factor is known as the conformal Hubble parameter
\(\mathcal{H} = a H(a)\). This calculation requires a cosmology,
which can be specified via the
Cosmology
class. We use the
efunc()
function which returns
\(E(z)\), where the Hubble parameter is defined as \(H(z) = 100h\ E(z)\)
in units of km/s/Mpc. Note that the operation above assumes the Position
column is in units of \(\mathrm{Mpc}/h\).
For catalogs in nbodykit that generate mock data, such as the
log-normal catalogs or HOD catalogs,
there is an additional column, VelocityOffset
, available to facilitate
RSD operations. This column has units of \(\mathrm{Mpc}/h\) and
includes the rsd_factor
above. Thus, this allows users to add RSD
simply by using:
src['Position'] = src['Position'] + src['VelocityOffset'] * line_of_sight
A subset of a CatalogSource
object can be selected using slice notation.
There are two ways to select a subset:
For example,
In [10]:
# boolean selection array
select = cat['Mass'] < 0.5
print("number of True entries = ", cat.compute(select.sum()))
# select only entries where select = True
subcat = cat[select]
print("size of subcat = ", subcat.size)
# select the first ten rows
subcat = cat[:10]
print("size of subcat = ", subcat.size)
# select first and last row
subcat = cat[[0, -1]]
print("size of subcat = ", subcat.size)
number of True entries = 50
size of subcat = 50
size of subcat = 10
size of subcat = 2
Caveats
size
attribute, or an exception will be raised.CatalogSource
¶A subset of columns can be selected from a CatalogSource
object by
indexing the catalog with a list of the names of the desired columns.
For example,
In [11]:
print("columns in catalog = ", cat.columns)
# select Position + Mass
subcat = cat[['Position', 'Mass']]
# the selected columns + default columns
print("columns in subset = ", subcat.columns)
columns in catalog = ['Mass', 'Position', 'Selection', 'Type', 'Value', 'Velocity', 'Weight']
columns in subset = ['Mass', 'Position', 'Selection', 'Value', 'Weight']
Caveats
Weight
, Value
,
and Selection
).nbodykit.transform
module¶The nbodykit.transform
module includes several commonly used functions
for convenience. We describe a few of the most common use cases in the sub-sections
below.
Note
The transform
module is available to users when
from nbodykit.lab import *
is executed.
CatalogSource
Objects¶When CatalogSource
objects have the same columns, they can be
concatenated together into a single object using the
nbodykit.transform.ConcatenateSources()
function. For example,
In [12]:
cat1 = UniformCatalog(nbar=50, BoxSize=1.0, seed=42)
cat2 = UniformCatalog(nbar=150, BoxSize=1.0, seed=42)
combined = transform.ConcatenateSources(cat1, cat2)
print("total size = %d + %d = %d" %(cat1.size, cat2.size, combined.size))
total size = 47 + 145 = 192
Another common use case is when users need to combine separate
data columns vertically, as the columns of a new array. For example, often the
Cartesian position coordinates x
, y
, and z
are stored as separate
columns, and the Position
column must be added to a catalog from these
individual columns. We provide the nbodykit.transform.StackColumns()
function for this exact purpose. For example,
In [13]:
# fake position data
data = numpy.random.random(size=(5,3))
# save to a plaintext file
numpy.savetxt('csv-example.dat', data, fmt='%.7e')
# the cartesian coordinates
names =['x', 'y', 'z']
# read the data
f = CSVCatalog('csv-example.dat', names)
# make the "Position" column
f['Position'] = transform.StackColumns(f['x'], f['y'], f['z'])
print(f['Position'])
print(f.compute(f['Position']))
dask.array<transpose, shape=(5, 3), dtype=float64, chunksize=(5, 1)> first: [ 0.52273283 0.42754102 0.02541913] last: [ 0.22879817 0.07697991 0.28975145]
[[ 0.52273283 0.42754102 0.02541913]
[ 0.10789143 0.03142919 0.63641041]
[ 0.31435598 0.50857069 0.90756647]
[ 0.24929223 0.41038292 0.75555114]
[ 0.22879817 0.07697991 0.28975145]]
We provide the function nbodykit.transform.SkyToCartesian()
for converting
sky coordinates, in the form of right ascension, declination, and redshift,
to Cartesian coordinates. The conversion from redshift to comoving distance
requires a cosmology instance, which can be specified via the
Cosmology
class.
Below, we initialize a catalog holding random right ascension,
declination, and redshift coordinates, and then add the Cartesian position
as the Position
column.
In [14]:
src = RandomCatalog(100, seed=42)
# add random (ra, dec, z) coordinates
src['z'] = src.rng.normal(loc=0.5, scale=0.1, size=src.size)
src['ra'] = src.rng.uniform(low=0, high=360, size=src.size)
src['dec'] = src.rng.uniform(low=-180, high=180., size=src.size)
# initialize a set of cosmology parameters
cosmo = cosmology.Cosmology(h=0.7)
# add the position
src['Position'] = transform.SkyToCartesian(src['ra'], src['dec'], src['z'], degrees=True, cosmo=cosmo)
Caveats
degrees
keyword.Position
column are \(\mathrm{Mpc}/h\).dask.array
module¶For more general column transformations, users should take advantage of the
dask.array
module, which implements most functions in the numpy
package in a manner optimized for dask arrays. The module can be accessed from the
nbodykit.transform
module as nbodykit.transform.da
.
Important
For a full list of functions available in the dask.array
module,
please see the dask array documentation.
We strongly recommend that new users read through this documentation
and familiarize themselves with the functionality provided by
the dask.array
module.
As a simple illustration, below we convert an array holding right ascension values from
degrees to radians, compute the sine of the array, and find the min and
max values using functions available in the dask.array
module.
In [15]:
ra = transform.da.deg2rad(src['ra']) # from degrees to radians
sin_ra = transform.da.sin(ra) # compute the sine
print("min(sin(ra)) = ", src.compute(sin_ra.min()))
print("max(sin(ra)) = ", src.compute(sin_ra.max()))
min(sin(ra)) = -0.999061041207
max(sin(ra)) = 0.999599541358