Reading Catalogs from Disk

Supported Data Formats

nbodykit provides support for initializing CatalogSource objects by reading tabular data stored on disk in a variety of formats:

In this section, we provide short examples illustrating how to read data stored in each of these formats. If your data format is not currently supported, you can either read it in yourself, then adapt it as a Catalog Adapting in memory data to a Catalog, or write your own Catalog class Reading a Custom Data Format.

Plaintext Data

Reading data stored as columns in plaintext files is supported via the CSVCatalog class. This class partitions the CSV file into chunks, and data is only read from the relevant chunks of the file, using the pandas.read_csv() function. The class accepts any configuration keywords that this function does. The partitioning step provides a significant speed-up when reading from the end of the file, since the entirety of the data does not need to be read first.

Caveats

  • By default, the class reads space-separated columns, but this can be changed by setting delim_whitespace=False and changing the delimiter keyword

  • A pandas index column is not supported – all columns should represent data columns to read.

  • Commented lines in the file are not supported – please remove all comments from the file before loading into nbodykit.

  • There should not be a header line in the file – column names should be passed to CSVCatalog via the names argument.

As an example, below we generate 5 columns for 100 fake objects and write to a plaintext file:

[2]:
import numpy
from nbodykit.source.catalog import CSVCatalog

# generate some fake ASCII data
data = numpy.random.random(size=(100,5))

# save to a plaintext file
numpy.savetxt('csv-example.txt', data, fmt='%.7e')

# name each of the 5 input columns
names =['x', 'y', 'z', 'w', 'v']

# read the data
f = CSVCatalog('csv-example.txt', names)

# combine x, y, z to Position, and add boxsize
f['Position'] = f['x'][:, None] * [1, 0, 0] + f['y'][:, None] * [0, 1, 0] + f['z'][:, None] * [0, 0, 1]
f.attrs['BoxSize'] = 1.0

print(f)
print("columns = ", f.columns) # default Weight,Selection also present
print("total size = ", f.csize)
CSVCatalog(size=100, FileStack(CSVFile(path=/tmp/tmpedljiijj/csv-example.txt, dataset=*, ncolumns=5, shape=(100,)>, ... 1 files))
columns =  ['Position', 'Selection', 'Value', 'Weight', 'v', 'w', 'x', 'y', 'z']
total size =  100

Binary Data

The BinaryCatalog object reads binary data that is stored on disk in a column-major format. The class can read any numpy data type and can handle arbitrary byte offsets between columns.

Caveats

  • Columns must be stored in consecutive order in the binary file (column-major format).

For example, below we save Position and Velocity columns to a binary file and load them into a BinaryCatalog:

[3]:
from nbodykit.source.catalog import BinaryCatalog

# generate some fake data and save to a binary file
with open('binary-example.dat', 'wb') as ff:
    pos = numpy.random.random(size=(1024, 3)) # fake Position column
    vel = numpy.random.random(size=(1024, 3)) # fake Velocity column
    pos.tofile(ff); vel.tofile(ff); ff.seek(0)

# create the binary catalog
f = BinaryCatalog(ff.name, [('Position', ('f8', 3)), ('Velocity', ('f8', 3))], size=1024)

print(f)
print("columns = ", f.columns) # default Weight,Selection also present
print("total size = ", f.csize)
BinaryCatalog(size=1024, FileStack(BinaryFile(path=/tmp/tmpedljiijj/binary-example.dat, dataset=*, ncolumns=2, shape=(1024,)>, ... 1 files))
columns =  ['Position', 'Selection', 'Value', 'Velocity', 'Weight']
total size =  1024

HDF Data

The HDFCatalog object uses the h5py module to read HDF5 files. The class supports reading columns stored in h5py.Dataset objects and in h5py.Group objects, assuming that all arrays are of the same length since catalog objects must have a fixed size. Columns stored in different datasets or groups can be accessed via their full path in the HDF5 file.

Caveats

  • HDFCatalog attempts to load all possible datasets or groups from the HDF5 file. This can present problems if the data has different lengths. Use the exclude keyword to explicitly exclude data that has the wrong size.

In the example below, we load fake data from both the dataset “Data1” and from the group “Data2” in an example HDF5 file. “Data1” is a single structured numpy array with Position and Velocity columns, while “Data2” is a group storing the Position and Velocity columns separately. nbodykit is able to load both types of data from HDF5 files, and the corresponding column names are the full paths of the data in the file.

[4]:
import h5py
from nbodykit.source.catalog import HDFCatalog

# generate some fake data
dset = numpy.empty(1024, dtype=[('Position', ('f8', 3)), ('Mass', 'f8')])
dset['Position'] = numpy.random.random(size=(1024, 3))
dset['Mass'] = numpy.random.random(size=1024)

# write to a HDF5 file
with h5py.File('hdf-example.hdf5' , 'w') as ff:
    ff.create_dataset('Data1', data=dset)
    grp = ff.create_group('Data2')
    grp.create_dataset('Position', data=dset['Position']) # column as dataset
    grp.create_dataset('Mass', data=dset['Mass']) # column as dataset

# intitialize the catalog
f = HDFCatalog('hdf-example.hdf5')

print(f)
print("columns = ", f.columns) # default Weight,Selection also present
print("total size = ", f.csize)
HDFCatalog(size=1024, FileStack(HDFFile(path=/tmp/tmpedljiijj/hdf-example.hdf5, dataset=/, ncolumns=4, shape=(1024,)>, ... 1 files))
columns =  ['Data1/Mass', 'Data1/Position', 'Data2/Mass', 'Data2/Position', 'Selection', 'Value', 'Weight']
total size =  1024

Bigfile Data

The bigfile package is a massively parallel IO library for large, hierarchical datasets, and nbodykit supports reading data stored in this format using BigFileCatalog.

Caveats

  • Similiar to the HDFCatalog class, datasets of the wrong size stored in a bigfile format should be explicitly excluded using the exclude keyword.

Below, we load Position and Velocity columns, stored in the bigfile format:

[5]:
import bigfile
from nbodykit.source.catalog import BigFileCatalog

# generate some fake data
data = numpy.empty(512, dtype=[('Position', ('f8', 3)), ('Velocity', ('f8',3))])
data['Position'] = numpy.random.random(size=(512, 3))
data['Velocity'] = numpy.random.random(size=(512,3))

# save fake data to a BigFile
with bigfile.BigFile('bigfile-example', create=True) as tmpff:
    with tmpff.create("Position", dtype=('f4', 3), size=512) as bb:
        bb.write(0, data['Position'])
    with tmpff.create("Velocity", dtype=('f4', 3), size=512) as bb:
        bb.write(0, data['Velocity'])
    with tmpff.create("Header") as bb:
        bb.attrs['Size'] = 512.

# initialize the catalog
f = BigFileCatalog('bigfile-example', header='Header')

print(f)
print("columns = ", f.columns) # default Weight,Selection also present
print("total size = ", f.csize)
BigFileCatalog(size=512, FileStack(BigFile(path=/tmp/tmpedljiijj/bigfile-example, dataset=./, ncolumns=2, shape=(512,)>, ... 1 files))
columns =  ['Position', 'Selection', 'Value', 'Velocity', 'Weight']
total size =  512

FITS Data

The FITS data format is supported via the FITSCatalog object. nbodykit relies on the fitsio package to perform the read operation.

Caveats

  • The FITS file must contain a readable binary table of data.

  • Specific extensions to read can be passed via the ext keyword. By default, data is read from the first HDU that has readable data.

For example, below we load Position and Velocity data from a FITS file:

[6]:
import fitsio
from nbodykit.source.catalog import FITSCatalog

# generate some fake data
dset = numpy.empty(1024, dtype=[('Position', ('f8', 3)), ('Mass', 'f8')])
dset['Position'] = numpy.random.random(size=(1024, 3))
dset['Mass'] = numpy.random.random(size=1024)

# write to a FITS file using fitsio
fitsio.write('fits-example.fits', dset, extname='Data')

# initialize the catalog
f = FITSCatalog('fits-example.fits', ext='Data')

print(f)
print("columns = ", f.columns) # default Weight,Selection also present
print("total size = ", f.csize)
FITSCatalog(size=1024, FileStack(FITSFile(path=/tmp/tmpedljiijj/fits-example.fits, dataset=Data, ncolumns=2, shape=(1024,)>, ... 1 files))
columns =  ['Mass', 'Position', 'Selection', 'Value', 'Weight']
total size =  1024

Reading Multiple Data Files at Once

CatalogSource objects support reading multiple files at once, providing a continuous view of each individual catalog stacked together. Each file read must contain the same data types, otherwise the data cannot be combined into a single catalog.

This becomes particularly useful when the user has data split into multiple files in a single directory, as is often the case when processing large amounts of data. For example, output binary snapshots from N-body simulations, often totaling 10GB - 100GB in size, can be read into a single BinaryCatalog with nbodykit.

When specifying multiple files to load, the user can use either an explicit list of file names or use an asterisk glob pattern to match files. As an example, below, we read data from two plaintext files into a single CSVCatalog:

[7]:
# generate data
data = numpy.random.random(size=(100,5))

# save first 40 rows of data to file
numpy.savetxt('csv-example-1.txt', data[:40], fmt='%.7e')

# save the remaining 60 rows to another file
numpy.savetxt('csv-example-2.txt', data[40:], fmt='%.7e')

Using a glob pattern

[8]:
# the names of the columns in both files
names =['a', 'b', 'c', 'd', 'e']

# read with a glob pattern
f = CSVCatalog('csv-example-*', names)

print(f)

# combined catalog size is 40+60=100
print("total size = ", f.csize)

CSVCatalog(size=100, FileStack(CSVFile(path=/tmp/tmpedljiijj/csv-example-1.txt, dataset=*, ncolumns=5, shape=(40,)>, ... 2 files))
total size =  100

Using a list of file names

[9]:
# the names of the columns in both files
names =['a', 'b', 'c', 'd', 'e']

# read with a list of the file names
f = CSVCatalog(['csv-example-1.txt', 'csv-example-2.txt'], names)

print(f)

# combined catalog size is 40+60=100
print("total size = ", f.csize)
CSVCatalog(size=100, FileStack(CSVFile(path=csv-example-1.txt, dataset=*, ncolumns=5, shape=(40,)>, ... 2 files))
total size =  100

Adapting in memory data to a Catalog

A light-weight way of reading in data that nbodykit does not understand is to read in the data with existing tools, then adapt the data into a catalog via nbodykit.lab.ArrayCatalog.

Here is an exampling that reads in a file with numpy’s load, then make it into a catalog.

[10]:
from nbodykit.source.catalog import ArrayCatalog

# generate the fake data
data = numpy.empty(1024, dtype=[('Position', ('f8', 3)), ('Mass', 'f8')])
data['Position'] = numpy.random.random(size=(1024, 3))
data['Mass'] = numpy.random.random(size=1024)

# save to a npy file
numpy.save("npy-example.npy", data)

data = numpy.load("npy-example.npy")

# initialize the catalog
f = ArrayCatalog(data)

print(f)
print("columns = ", f.columns) # default Weight,Selection also present
print("total size = ", f.csize)


f = ArrayCatalog({'Position' : data['Position'], 'Mass' : data['Mass'] })

print(f)
print("columns = ", f.columns) # default Weight,Selection also present
print("total size = ", f.csize)
ArrayCatalog(size=1024)
columns =  ['Mass', 'Position', 'Selection', 'Value', 'Weight']
total size =  1024
ArrayCatalog(size=1024)
columns =  ['Mass', 'Position', 'Selection', 'Value', 'Weight']
total size =  1024

Reading a Custom Data Format

Users can implement their own subclasses of CatalogSource for reading custom data formats with a few easy steps. The core functionality of the CatalogSource classes described in this section use the nbodykit.io module for reading data from disk. This module implements the nbodykit.io.base.FileType base class, which is an abstract class that behaves like a file-like object. For the built-in file formats discussed in this section, we have implemented the following subclasses of FileType in the nbodykit.io module: CSVFile, BinaryFile, BigFile, HDFFile, and FITSFile.

To make a valid subclass of FileType, users must:

  1. Implement the read() function that reads a range of the data from disk.

  2. Set the size in the __init__() function, specifying the total size of the data on disk.

  3. Set the dtype in the __init__() function, specifying the type of data stored on disk.

Once we have the custom subclass implemented, the nbodykit.source.catalog.file.FileCatalogFactory() function can be used to automatically create a custom CatalogSource object from the subclass.

As a toy example, we will illustrate how this is done for data saved using the numpy .npy format. First, we will implement our subclass of the FileType class:

[11]:
from nbodykit.io.base import FileType

class NPYFile(FileType):
    """
    A file-like object to read numpy ``.npy`` files
    """
    def __init__(self, path):
        self.path = path
        self.attrs = {}
        # load the data and set size and dtype
        self._data = numpy.load(self.path)
        self.size = len(self._data) # total size
        self.dtype = self._data.dtype # data dtype

    def read(self, columns, start, stop, step=1):
        """
        Read the specified column(s) over the given range
        """
        return self._data[start:stop:step]

And now generate the subclass of CatalogSource:

[12]:
from nbodykit.source.catalog.file import FileCatalogFactory

NPYCatalog = FileCatalogFactory('NPYCatalog', NPYFile)

And finally, we generate some fake data, save it to a .npy file, and then load it with our new NPYCatalog class:

[13]:
# generate the fake data
data = numpy.empty(1024, dtype=[('Position', ('f8', 3)), ('Mass', 'f8')])
data['Position'] = numpy.random.random(size=(1024, 3))
data['Mass'] = numpy.random.random(size=1024)

# save to a npy file
numpy.save("npy-example.npy", data)

# and now load the data
f = NPYCatalog("npy-example.npy")

print(f)
print("columns = ", f.columns) # default Weight,Selection also present
print("total size = ", f.csize)
NPYCatalog(size=1024, FileStack(NPYFile(path=/tmp/tmpedljiijj/npy-example.npy, dataset=None, ncolumns=2, shape=(1024,)>, ... 1 files))
columns =  ['Mass', 'Position', 'Selection', 'Value', 'Weight']
total size =  1024

This toy example illustrates how custom data formats can be incorporated into nbodykit, but users should take care to optimize their storage solutions for more complex applications. In particular, data storage formats that are stored in column-major format and allow data slices from arbitrary locations to be read should be favored. This enables large speed-ups when reading data in parallel. On the contrary, our simple toy example class NPYFile reads the entirety of the data before returning a certain slice in the read() function. In general, this should be avoided if at all possible.