Common Data Operations¶
Here, we detail some of the most common operations when dealing with
data in the form of a CatalogSource
. The native format for data columns
in a CatalogSource
object is the dask array. Be sure to read
the previous section for an introduction to dask arrays
before proceeding.
The dask array format allows users to easily manipulate columns in their input data and feed any transformed data into one of the nbodykit algorithms. This provides a fast and easy way to transform the data while hiding the implementation details needed to compute these transformations internally. In this section, we’ll provide examples of some of these data transformations to get users acclimated to dask arrays quickly.
To help illustrate these operations, we’ll initialize the nbodykit “lab” and load a catalog of uniformly distributed objects.
[2]:
from nbodykit.lab import *
cat = UniformCatalog(nbar=100, BoxSize=1.0, seed=42)
/home/yfeng1/anaconda3/install/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters
Accessing Data Columns¶
Specific columns can be accessed by indexing the catalog object using the
column name, and a dask.array.Array
object is returned (see
What is a dask array? for more details on dask arrays).
[3]:
position = cat['Position']
velocity = cat['Velocity']
print(position)
print(velocity)
dask.array<array, shape=(96, 3), dtype=float64, chunksize=(96, 3)> first: [0.45470105 0.83263203 0.06905134] last: [0.62474599 0.15388738 0.84302209]
dask.array<array, shape=(96, 3), dtype=float64, chunksize=(96, 3)> first: [0.0006346 0.00675438 0.00704942] last: [0.00375581 0.00046149 0.00819726]
While in the format of the dask array, data columns can easily be manipulated by the user. For example, here we normalize the position coordinates to the range 0 to 1 by dividing by the box size:
[4]:
# normalize the position
normed_position = position / cat.attrs['BoxSize']
print(normed_position)
dask.array<truediv, shape=(96, 3), dtype=float64, chunksize=(96, 3)>
Note that the normalized position array is also a dask array and that the actual normalization operation is yet to occur. This makes these kinds of data transformations very fast for the user.
Computing Data Columns¶
Columns can be converted from dask.array.Array
objects to
numpy arrays using the compute()
function (see
Evaluating a dask array for further details on computing dask arrays).
[5]:
position, velocity = cat.compute(cat['Position'], cat['Velocity'])
print(type(position))
print(type(velocity))
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
We can also compute the max of the normalized position coordinates from the previous section:
[6]:
maxpos = normed_position.max(axis=0)
print(maxpos)
print(cat.compute(maxpos))
dask.array<amax-aggregate, shape=(3,), dtype=float64, chunksize=(3,)>
[0.9927406 0.99610592 0.99925086]
Adding New Columns¶
New columns can be easily added to a CatalogSource
object by
directly setting them:
[7]:
# no "Mass" column originally
print("contains 'Mass'? :", 'Mass' in cat)
# add a random array as the "Mass" column
cat['Mass'] = numpy.random.random(size=len(cat))
# "Mass" exists!
print("contains 'Mass'? :", 'Mass' in cat)
# can also add scalar values -- converted to correct length
cat['Type'] = b"central"
print(cat['Mass'])
print(cat['Type'])
contains 'Mass'? : False
contains 'Mass'? : True
dask.array<array, shape=(96,), dtype=float64, chunksize=(96,)> first: 0.3745401188473625 last: 0.49379559636439074
dask.array<array, shape=(96,), dtype=|S7, chunksize=(96,)> first: b'central' last: b'central'
Here, we have added two new columns to the catalog, Mass
and Type
.
Internally, nbodykit stores the new columns as dask arrays and
will automatically convert them to the correct type if they are not already.
Caveats
New columns must be either be a scalar value, or an array with the same length as the catalog. Scalar values will automatically be broadcast to the correct length.
Setting a column of the wrong length will raise an exception.
Overwriting Columns¶
The same syntax used for adding new columns can also be used to overwrite
columns that already exist in a CatalogSource
. This procedure
works as one would expect – the most up-to-date values of columns are
always used in operations.
In the example below we overwrite both the Position
and Velocity
columns, and each time the columns are accessed, the most up-to-date values
are used, as expected.
[8]:
# some fake data
data = numpy.ones(5, dtype=[
('Position', ('f4', 3)),
('Velocity', ('f4', 3))]
)
# initialize a catalog directly from the structured array
src = ArrayCatalog(data)
# overwrite the Velocity column
src['Velocity'] = src['Position'] + src['Velocity'] # 1 + 1 = 2
# overwrite the Position column
src['Position'] = src['Position'] + src['Velocity'] # 1 + 2 = 3
print("Velocity = ", src.compute(src['Velocity'])) # all equal to 2
print("Position = ", src.compute(src['Position'])) # all equal to 3
Velocity = [[2. 2. 2.]
[2. 2. 2.]
[2. 2. 2.]
[2. 2. 2.]
[2. 2. 2.]]
Position = [[3. 3. 3.]
[3. 3. 3.]
[3. 3. 3.]
[3. 3. 3.]
[3. 3. 3.]]
Adding Redshift-space Distortions¶
A useful operation in large-scale structure is the mapping of positions
in simulations from real space to redshift space, referred to
as redshift space distortions (RSD).
This operation can be easily performed by combining the Position
and
Velocity
columns to overwrite the Position
column. As first
found by Kaiser 1987,
the mapping from real to redshift space is:
where \(r\) is the line-of-sight position in real space, \(s\) is the line-of-sight position in redshift space, \(\vv\) is the velocity vector, \(\vnhat\) is the line-of-sight unit vector, \(a\) is the scale factor, and \(H\) is the Hubble parameter at \(a\).
As an example, below we add RSD along the z
axis of a simulation box:
[9]:
# apply RSD along the z axis
line_of_sight = [0,0,1]
# redshift and cosmology
redshift = 0.55; cosmo = cosmology.Cosmology(h=0.7).match(Omega0_m=0.31)
# the RSD normalization factor
rsd_factor = (1+redshift) / (100 * cosmo.efunc(redshift))
# update Position, applying RSD
src['Position'] = src['Position'] + rsd_factor * src['Velocity'] * line_of_sight
The RSD factor is known as the conformal Hubble parameter
\(\mathcal{H} = a H(a)\). This calculation requires a cosmology,
which can be specified via the
Cosmology
class. We use the
efunc()
function which returns
\(E(z)\), where the Hubble parameter is defined as \(H(z) = 100h\ E(z)\)
in units of km/s/Mpc. Note that the operation above assumes the Position
column is in units of \(\mathrm{Mpc}/h\).
For catalogs in nbodykit that generate mock data, such as the
log-normal catalogs or HOD catalogs,
there is an additional column, VelocityOffset
, available to facilitate
RSD operations. This column has units of \(\mathrm{Mpc}/h\) and
includes the rsd_factor
above. Thus, this allows users to add RSD
simply by using:
src['Position'] = src['Position'] + src['VelocityOffset'] * line_of_sight
Selecting a Subset¶
A subset of a CatalogSource
object can be selected using slice notation.
There are two ways to select a subset:
use a boolean array, which specifies which rows of the catalog to select
use a slice object specifying which rows of the catalog to select
For example,
[10]:
# boolean selection array
select = cat['Mass'] < 0.5
print("number of True entries = ", cat.compute(select.sum()))
# select only entries where select = True
subcat = cat[select]
print("size of subcat = ", subcat.size)
# select the first ten rows
subcat = cat[:10]
print("size of subcat = ", subcat.size)
# select first and last row
subcat = cat[[0, -1]]
print("size of subcat = ", subcat.size)
number of True entries = 50
size of subcat = 50
size of subcat = 10
size of subcat = 2
Caveats
When indexing with a boolean array, the array must have the same length as the
size
attribute, or an exception will be raised.Selecting a single object by indexing with an integer is not supported. If the user wishes to select a single row, a list of length one can be used to select the specific row.
Selecting a Subset of Columns from a CatalogSource
¶
A subset of columns can be selected from a CatalogSource
object by
indexing the catalog with a list of the names of the desired columns.
For example,
[11]:
print("columns in catalog = ", cat.columns)
# select Position + Mass
subcat = cat[['Position', 'Mass']]
# the selected columns + default columns
print("columns in subset = ", subcat.columns)
columns in catalog = ['Mass', 'Position', 'Selection', 'Type', 'Value', 'Velocity', 'Weight']
columns in subset = ['Mass', 'Position', 'Selection', 'Value', 'Weight']
Caveats
When selecting a subset of columns, note that in addition to the desired columns, the sub-catalog will also contain the default columns (
Weight
,Value
, andSelection
).
The nbodykit.transform
module¶
The nbodykit.transform
module includes several commonly used functions
for convenience. We describe a few of the most common use cases in the sub-sections
below.
Note
The transform
module is available to users when
from nbodykit.lab import *
is executed.
Concatenating CatalogSource
Objects¶
When CatalogSource
objects have the same columns, they can be
concatenated together into a single object using the
nbodykit.transform.ConcatenateSources()
function. For example,
[12]:
cat1 = UniformCatalog(nbar=50, BoxSize=1.0, seed=42)
cat2 = UniformCatalog(nbar=150, BoxSize=1.0, seed=42)
combined = transform.ConcatenateSources(cat1, cat2)
print("total size = %d + %d = %d" %(cat1.size, cat2.size, combined.size))
total size = 47 + 145 = 192
Stacking Columns Together¶
Another common use case is when users need to combine separate
data columns vertically, as the columns of a new array. For example, often the
Cartesian position coordinates x
, y
, and z
are stored as separate
columns, and the Position
column must be added to a catalog from these
individual columns. We provide the nbodykit.transform.StackColumns()
function for this exact purpose. For example,
[13]:
# fake position data
data = numpy.random.random(size=(5,3))
# save to a plaintext file
numpy.savetxt('csv-example.dat', data, fmt='%.7e')
# the cartesian coordinates
names =['x', 'y', 'z']
# read the data
f = CSVCatalog('csv-example.dat', names)
# make the "Position" column
f['Position'] = transform.StackColumns(f['x'], f['y'], f['z'])
print(f['Position'])
print(f.compute(f['Position']))
dask.array<transpose, shape=(5, 3), dtype=float64, chunksize=(5, 1)> first: [0.52273283 0.42754102 0.02541913] last: [0.22879817 0.07697991 0.28975145]
[[0.52273283 0.42754102 0.02541913]
[0.10789143 0.03142919 0.63641041]
[0.31435598 0.50857069 0.90756647]
[0.24929223 0.41038292 0.75555114]
[0.22879817 0.07697991 0.28975145]]
Converting from Sky to Cartesian Coordinates¶
We provide the function nbodykit.transform.SkyToCartesian()
for converting
sky coordinates, in the form of right ascension, declination, and redshift,
to Cartesian coordinates. The conversion from redshift to comoving distance
requires a cosmology instance, which can be specified via the
Cosmology
class.
Below, we initialize a catalog holding random right ascension,
declination, and redshift coordinates, and then add the Cartesian position
as the Position
column.
The random number generator provided by a RandomCatalog always generate the correct number of items for the catalog. Use the ? to see its docstring.
[14]:
src = RandomCatalog(100, seed=42)
# add random (ra, dec, z) coordinates
src['z'] = src.rng.normal(loc=0.5, scale=0.1)
src['ra'] = src.rng.uniform(low=0, high=360)
src['dec'] = src.rng.uniform(low=-180, high=180.)
# initialize a set of cosmology parameters
cosmo = cosmology.Cosmology(h=0.7)
# add the position
src['Position'] = transform.SkyToCartesian(src['ra'], src['dec'], src['z'], degrees=True, cosmo=cosmo)
Caveats
Whether the right ascension and declination arrays are in degrees (as opposed to radians) should be specified via the
degrees
keyword.The units of the returned
Position
column are \(\mathrm{Mpc}/h\).
Using the dask.array
module¶
For more general column transformations, users should take advantage of the
dask.array
module, which implements most functions in the numpy
package in a manner optimized for dask arrays. The module can be accessed from the
nbodykit.transform
module as nbodykit.transform.da
.
Important
For a full list of functions available in the dask.array
module,
please see the dask array documentation.
We strongly recommend that new users read through this documentation
and familiarize themselves with the functionality provided by
the dask.array
module.
As a simple illustration, below we convert an array holding right ascension values from
degrees to radians, compute the sine of the array, and find the min and
max values using functions available in the dask.array
module.
[15]:
ra = transform.da.deg2rad(src['ra']) # from degrees to radians
sin_ra = transform.da.sin(ra) # compute the sine
print("min(sin(ra)) = ", src.compute(sin_ra.min()))
print("max(sin(ra)) = ", src.compute(sin_ra.max()))
min(sin(ra)) = -0.999907640154132
max(sin(ra)) = 0.9988053754673173