Reading Catalogs from Disk¶
Supported Data Formats¶
nbodykit provides support for initializing
CatalogSource
objects by reading tabular data
stored on disk in a variety of formats:
In this section, we provide short examples illustrating how to read data stored in each of these formats. If your data format is not currently supported, you can either read it in yourself, then adapt it as a Catalog Adapting in memory data to a Catalog, or write your own Catalog class Reading a Custom Data Format.
Plaintext Data¶
Reading data stored as columns in plaintext files is supported via the
CSVCatalog
class. This class partitions the CSV file into chunks, and
data is only read from the relevant chunks of the file, using
the pandas.read_csv()
function. The class accepts any configuration
keywords that this function does. The partitioning step provides a significant
speed-up when reading from the end of the file, since the entirety of the data
does not need to be read first.
Caveats
By default, the class reads space-separated columns, but this can be changed by setting
delim_whitespace=False
and changing thedelimiter
keywordA
pandas
index column is not supported – all columns should represent data columns to read.Commented lines in the file are not supported – please remove all comments from the file before loading into nbodykit.
There should not be a header line in the file – column names should be passed to
CSVCatalog
via thenames
argument.
As an example, below we generate 5 columns for 100 fake objects and write to a plaintext file:
[2]:
import numpy
from nbodykit.source.catalog import CSVCatalog
# generate some fake ASCII data
data = numpy.random.random(size=(100,5))
# save to a plaintext file
numpy.savetxt('csv-example.txt', data, fmt='%.7e')
# name each of the 5 input columns
names =['x', 'y', 'z', 'w', 'v']
# read the data
f = CSVCatalog('csv-example.txt', names)
# combine x, y, z to Position, and add boxsize
f['Position'] = f['x'][:, None] * [1, 0, 0] + f['y'][:, None] * [0, 1, 0] + f['z'][:, None] * [0, 0, 1]
f.attrs['BoxSize'] = 1.0
print(f)
print("columns = ", f.columns) # default Weight,Selection also present
print("total size = ", f.csize)
CSVCatalog(size=100, FileStack(CSVFile(path=/tmp/tmpedljiijj/csv-example.txt, dataset=*, ncolumns=5, shape=(100,)>, ... 1 files))
columns = ['Position', 'Selection', 'Value', 'Weight', 'v', 'w', 'x', 'y', 'z']
total size = 100
Binary Data¶
The BinaryCatalog
object reads binary data that is stored
on disk in a column-major format. The class can read any numpy data type
and can handle arbitrary byte offsets between columns.
Caveats
Columns must be stored in consecutive order in the binary file (column-major format).
For example, below we save Position
and Velocity
columns to a binary
file and load them into a BinaryCatalog
:
[3]:
from nbodykit.source.catalog import BinaryCatalog
# generate some fake data and save to a binary file
with open('binary-example.dat', 'wb') as ff:
pos = numpy.random.random(size=(1024, 3)) # fake Position column
vel = numpy.random.random(size=(1024, 3)) # fake Velocity column
pos.tofile(ff); vel.tofile(ff); ff.seek(0)
# create the binary catalog
f = BinaryCatalog(ff.name, [('Position', ('f8', 3)), ('Velocity', ('f8', 3))], size=1024)
print(f)
print("columns = ", f.columns) # default Weight,Selection also present
print("total size = ", f.csize)
BinaryCatalog(size=1024, FileStack(BinaryFile(path=/tmp/tmpedljiijj/binary-example.dat, dataset=*, ncolumns=2, shape=(1024,)>, ... 1 files))
columns = ['Position', 'Selection', 'Value', 'Velocity', 'Weight']
total size = 1024
HDF Data¶
The HDFCatalog
object uses the h5py
module to read
HDF5 files. The class supports reading columns stored in h5py.Dataset
objects and in h5py.Group
objects, assuming that all arrays are of the
same length since catalog objects must have a fixed size. Columns stored in
different datasets or groups can be accessed via their full path in the
HDF5 file.
Caveats
HDFCatalog
attempts to load all possible datasets or groups from the HDF5 file. This can present problems if the data has different lengths. Use theexclude
keyword to explicitly exclude data that has the wrong size.
In the example below, we load fake data from both the dataset “Data1” and
from the group “Data2” in an example HDF5 file. “Data1” is a single structured
numpy array with Position
and Velocity
columns, while “Data2” is a
group storing the Position
and Velocity
columns separately. nbodykit
is able to load both types of data from HDF5 files, and the corresponding
column names are the full paths of the data in the file.
[4]:
import h5py
from nbodykit.source.catalog import HDFCatalog
# generate some fake data
dset = numpy.empty(1024, dtype=[('Position', ('f8', 3)), ('Mass', 'f8')])
dset['Position'] = numpy.random.random(size=(1024, 3))
dset['Mass'] = numpy.random.random(size=1024)
# write to a HDF5 file
with h5py.File('hdf-example.hdf5' , 'w') as ff:
ff.create_dataset('Data1', data=dset)
grp = ff.create_group('Data2')
grp.create_dataset('Position', data=dset['Position']) # column as dataset
grp.create_dataset('Mass', data=dset['Mass']) # column as dataset
# intitialize the catalog
f = HDFCatalog('hdf-example.hdf5')
print(f)
print("columns = ", f.columns) # default Weight,Selection also present
print("total size = ", f.csize)
HDFCatalog(size=1024, FileStack(HDFFile(path=/tmp/tmpedljiijj/hdf-example.hdf5, dataset=/, ncolumns=4, shape=(1024,)>, ... 1 files))
columns = ['Data1/Mass', 'Data1/Position', 'Data2/Mass', 'Data2/Position', 'Selection', 'Value', 'Weight']
total size = 1024
Bigfile Data¶
The bigfile package is a massively
parallel IO library for large, hierarchical datasets, and nbodykit supports
reading data stored in this format using BigFileCatalog
.
Caveats
Similiar to the
HDFCatalog
class, datasets of the wrong size stored in a bigfile format should be explicitly excluded using theexclude
keyword.
Below, we load Position
and Velocity
columns, stored in the
bigfile
format:
[5]:
import bigfile
from nbodykit.source.catalog import BigFileCatalog
# generate some fake data
data = numpy.empty(512, dtype=[('Position', ('f8', 3)), ('Velocity', ('f8',3))])
data['Position'] = numpy.random.random(size=(512, 3))
data['Velocity'] = numpy.random.random(size=(512,3))
# save fake data to a BigFile
with bigfile.BigFile('bigfile-example', create=True) as tmpff:
with tmpff.create("Position", dtype=('f4', 3), size=512) as bb:
bb.write(0, data['Position'])
with tmpff.create("Velocity", dtype=('f4', 3), size=512) as bb:
bb.write(0, data['Velocity'])
with tmpff.create("Header") as bb:
bb.attrs['Size'] = 512.
# initialize the catalog
f = BigFileCatalog('bigfile-example', header='Header')
print(f)
print("columns = ", f.columns) # default Weight,Selection also present
print("total size = ", f.csize)
BigFileCatalog(size=512, FileStack(BigFile(path=/tmp/tmpedljiijj/bigfile-example, dataset=./, ncolumns=2, shape=(512,)>, ... 1 files))
columns = ['Position', 'Selection', 'Value', 'Velocity', 'Weight']
total size = 512
FITS Data¶
The FITS data format is supported via the
FITSCatalog
object. nbodykit relies on the
fitsio package to perform the read
operation.
Caveats
The FITS file must contain a readable binary table of data.
Specific extensions to read can be passed via the
ext
keyword. By default, data is read from the first HDU that has readable data.
For example, below we load Position
and Velocity
data from a FITS file:
[6]:
import fitsio
from nbodykit.source.catalog import FITSCatalog
# generate some fake data
dset = numpy.empty(1024, dtype=[('Position', ('f8', 3)), ('Mass', 'f8')])
dset['Position'] = numpy.random.random(size=(1024, 3))
dset['Mass'] = numpy.random.random(size=1024)
# write to a FITS file using fitsio
fitsio.write('fits-example.fits', dset, extname='Data')
# initialize the catalog
f = FITSCatalog('fits-example.fits', ext='Data')
print(f)
print("columns = ", f.columns) # default Weight,Selection also present
print("total size = ", f.csize)
FITSCatalog(size=1024, FileStack(FITSFile(path=/tmp/tmpedljiijj/fits-example.fits, dataset=Data, ncolumns=2, shape=(1024,)>, ... 1 files))
columns = ['Mass', 'Position', 'Selection', 'Value', 'Weight']
total size = 1024
Reading Multiple Data Files at Once¶
CatalogSource
objects support reading
multiple files at once, providing a continuous view of each individual catalog
stacked together. Each file read must contain the same data types, otherwise
the data cannot be combined into a single catalog.
This becomes particularly useful when the user has data
split into multiple files in a single directory, as is often the case when
processing large amounts of data. For example, output binary snapshots from
N-body simulations, often totaling 10GB - 100GB in size, can be read into a
single BinaryCatalog
with nbodykit.
When specifying multiple files to load, the user can use either an explicit
list of file names or use an asterisk glob pattern to match files.
As an example, below, we read data from two plaintext files into a single
CSVCatalog
:
[7]:
# generate data
data = numpy.random.random(size=(100,5))
# save first 40 rows of data to file
numpy.savetxt('csv-example-1.txt', data[:40], fmt='%.7e')
# save the remaining 60 rows to another file
numpy.savetxt('csv-example-2.txt', data[40:], fmt='%.7e')
Using a glob pattern¶
[8]:
# the names of the columns in both files
names =['a', 'b', 'c', 'd', 'e']
# read with a glob pattern
f = CSVCatalog('csv-example-*', names)
print(f)
# combined catalog size is 40+60=100
print("total size = ", f.csize)
CSVCatalog(size=100, FileStack(CSVFile(path=/tmp/tmpedljiijj/csv-example-1.txt, dataset=*, ncolumns=5, shape=(40,)>, ... 2 files))
total size = 100
Using a list of file names¶
[9]:
# the names of the columns in both files
names =['a', 'b', 'c', 'd', 'e']
# read with a list of the file names
f = CSVCatalog(['csv-example-1.txt', 'csv-example-2.txt'], names)
print(f)
# combined catalog size is 40+60=100
print("total size = ", f.csize)
CSVCatalog(size=100, FileStack(CSVFile(path=csv-example-1.txt, dataset=*, ncolumns=5, shape=(40,)>, ... 2 files))
total size = 100
Adapting in memory data to a Catalog¶
A light-weight way of reading in data that nbodykit does not understand is to
read in the data with existing tools, then adapt the data into a catalog via
nbodykit.lab.ArrayCatalog
.
Here is an exampling that reads in a file with numpy’s load, then make it into a catalog.
[10]:
from nbodykit.source.catalog import ArrayCatalog
# generate the fake data
data = numpy.empty(1024, dtype=[('Position', ('f8', 3)), ('Mass', 'f8')])
data['Position'] = numpy.random.random(size=(1024, 3))
data['Mass'] = numpy.random.random(size=1024)
# save to a npy file
numpy.save("npy-example.npy", data)
data = numpy.load("npy-example.npy")
# initialize the catalog
f = ArrayCatalog(data)
print(f)
print("columns = ", f.columns) # default Weight,Selection also present
print("total size = ", f.csize)
f = ArrayCatalog({'Position' : data['Position'], 'Mass' : data['Mass'] })
print(f)
print("columns = ", f.columns) # default Weight,Selection also present
print("total size = ", f.csize)
ArrayCatalog(size=1024)
columns = ['Mass', 'Position', 'Selection', 'Value', 'Weight']
total size = 1024
ArrayCatalog(size=1024)
columns = ['Mass', 'Position', 'Selection', 'Value', 'Weight']
total size = 1024
Reading a Custom Data Format¶
Users can implement their own subclasses of CatalogSource
for reading
custom data formats with a few easy steps. The core functionality of the
CatalogSource
classes described in this section use the
nbodykit.io
module for reading data from disk. This module implements the
nbodykit.io.base.FileType
base class, which is an abstract
class that behaves like a file
-like object. For the built-in
file formats discussed in this section, we have implemented the following
subclasses of FileType
in the nbodykit.io
module: CSVFile
, BinaryFile
,
BigFile
, HDFFile
, and FITSFile
.
To make a valid subclass of FileType
, users must:
Implement the
read()
function that reads a range of the data from disk.Set the
size
in the__init__()
function, specifying the total size of the data on disk.Set the
dtype
in the__init__()
function, specifying the type of data stored on disk.
Once we have the custom subclass implemented, the
nbodykit.source.catalog.file.FileCatalogFactory()
function can
be used to automatically create a custom CatalogSource
object
from the subclass.
As a toy example, we will illustrate how this is done for data saved
using the numpy .npy
format. First, we will implement our
subclass of the FileType
class:
[11]:
from nbodykit.io.base import FileType
class NPYFile(FileType):
"""
A file-like object to read numpy ``.npy`` files
"""
def __init__(self, path):
self.path = path
self.attrs = {}
# load the data and set size and dtype
self._data = numpy.load(self.path)
self.size = len(self._data) # total size
self.dtype = self._data.dtype # data dtype
def read(self, columns, start, stop, step=1):
"""
Read the specified column(s) over the given range
"""
return self._data[start:stop:step]
And now generate the subclass of CatalogSource
:
[12]:
from nbodykit.source.catalog.file import FileCatalogFactory
NPYCatalog = FileCatalogFactory('NPYCatalog', NPYFile)
And finally, we generate some fake data, save it to a .npy
file,
and then load it with our new NPYCatalog
class:
[13]:
# generate the fake data
data = numpy.empty(1024, dtype=[('Position', ('f8', 3)), ('Mass', 'f8')])
data['Position'] = numpy.random.random(size=(1024, 3))
data['Mass'] = numpy.random.random(size=1024)
# save to a npy file
numpy.save("npy-example.npy", data)
# and now load the data
f = NPYCatalog("npy-example.npy")
print(f)
print("columns = ", f.columns) # default Weight,Selection also present
print("total size = ", f.csize)
NPYCatalog(size=1024, FileStack(NPYFile(path=/tmp/tmpedljiijj/npy-example.npy, dataset=None, ncolumns=2, shape=(1024,)>, ... 1 files))
columns = ['Mass', 'Position', 'Selection', 'Value', 'Weight']
total size = 1024
This toy example illustrates how custom data formats can be incorporated
into nbodykit, but users should take care to optimize their storage
solutions for more complex applications. In particular, data storage formats
that are stored in column-major format and allow data slices from arbitrary
locations to be read should be favored. This enables large speed-ups when
reading data in parallel. On the contrary, our simple toy example class
NPYFile
reads the entirety of the data before returning
a certain slice in the read()
function. In general, this should be
avoided if at all possible.