Reading Catalogs from Disk

Supported Data Formats

nbodykit provides support for initializing CatalogSource objects by reading tabular data stored on disk in a variety of formats:

In this section, we provide short examples illustrating how to read data stored in each of these formats. If your data format is not currently supported, please see Reading a Custom Data Format.

Plaintext Data

Reading data stored as columns in plaintext files is supported via the CSVCatalog class. This class partitions the CSV file into chunks, and data is only read from the relevant chunks of the file, using the pandas.read_csv() function. The class accepts any configuration keywords that this function does. The partitioning step provides a significant speed-up when reading from the end of the file, since the entirety of the data does not need to be read first.

Caveats

  • By default, the class reads space-separated columns, but this can be changed by setting delim_whitespace=False and changing the delimiter keyword
  • A pandas index column is not supported – all columns should represent data columns to read.
  • Commented lines in the file are not supported – please remove all comments from the file before loading into nbodykit.
  • There should not be a header line in the file – column names should be passed to CSVCatalog via the names argument.

As an example, below we generate 5 columns for 100 fake objects and write to a plaintext file:

In [1]: import numpy

In [2]: from nbodykit.source.catalog import CSVCatalog

# generate some fake ASCII data
In [3]: data = numpy.random.random(size=(100,5))

# save to a plaintext file
In [4]: numpy.savetxt('csv-example.txt', data, fmt='%.7e')

# name each of the 5 input columns
In [5]: names =['a', 'b', 'c', 'd', 'e']

# read the data
In [6]: f = CSVCatalog('csv-example.txt', names)

In [7]: print(f)
CSVCatalog(size=100, file='csv-example.txt')

In [8]: print("columns = ", f.columns) # default Weight,Selection also present
columns =  ['Selection', 'Value', 'Weight', 'a', 'b', 'c', 'd', 'e']

In [9]: print("total size = ", f.csize)
total size =  100

Binary Data

The BinaryCatalog object reads binary data that is stored on disk in a column-major format. The class can read any numpy data type and can handle arbitrary byte offsets between columns.

Caveats

  • Columns must be stored in consecutive order in the binary file (column-major format).

For example, below we save Position and Velocity columns to a binary file and load them into a BinaryCatalog:

In [10]: from nbodykit.source.catalog import BinaryCatalog

# generate some fake data and save to a binary file
In [11]: with open('binary-example.dat', 'wb') as ff:
   ....:     pos = numpy.random.random(size=(1024, 3)) # fake Position column
   ....:     vel = numpy.random.random(size=(1024, 3)) # fake Velocity column
   ....:     pos.tofile(ff); vel.tofile(ff); ff.seek(0)
   ....: 

# create the binary catalog
In [12]: f = BinaryCatalog(ff.name, [('Position', ('f8', 3)), ('Velocity', ('f8', 3))], size=1024)

In [13]: print(f)
BinaryCatalog(size=1024, file='binary-example.dat')

In [14]: print("columns = ", f.columns) # default Weight,Selection also present
columns =  ['Position', 'Selection', 'Value', 'Velocity', 'Weight']

In [15]: print("total size = ", f.csize)
total size =  1024

HDF Data

The HDFCatalog object uses the h5py module to read HDF5 files. The class supports reading columns stored in h5py.Dataset objects and in h5py.Group objects, assuming that all arrays are of the same length since catalog objects must have a fixed size. Columns stored in different datasets or groups can be accessed via their full path in the HDF5 file.

Caveats

  • HDFCatalog attempts to load all possible datasets or groups from the HDF5 file. This can present problems if the data has different lengths. Use the exclude keyword to explicitly exclude data that has the wrong size.

In the example below, we load fake data from both the dataset “Data1” and from the group “Data2” in an example HDF5 file. “Data1” is a single structured numpy array with Position and Velocity columns, while “Data2” is a group storing the Position and Velocity columns separately. nbodykit is able to load both types of data from HDF5 files, and the corresponding column names are the full paths of the data in the file.

In [16]: import h5py

In [17]: from nbodykit.source.catalog import HDFCatalog

# generate some fake data
In [18]: dset = numpy.empty(1024, dtype=[('Position', ('f8', 3)), ('Mass', 'f8')])

In [19]: dset['Position'] = numpy.random.random(size=(1024, 3))

In [20]: dset['Mass'] = numpy.random.random(size=1024)

# write to a HDF5 file
In [21]: with h5py.File('hdf-example.hdf5' , 'w') as ff:
   ....:     ff.create_dataset('Data1', data=dset)
   ....:     grp = ff.create_group('Data2')
   ....:     grp.create_dataset('Position', data=dset['Position']) # column as dataset
   ....:     grp.create_dataset('Mass', data=dset['Mass']) # column as dataset
   ....: 

# intitialize the catalog
In [22]: f = HDFCatalog('hdf-example.hdf5')

In [23]: print(f)
HDFCatalog(size=1024, file='hdf-example.hdf5')

In [24]: print("columns = ", f.columns) # default Weight,Selection also present
columns =  ['Data1/Mass', 'Data1/Position', 'Data2/Mass', 'Data2/Position', 'Selection', 'Value', 'Weight']

In [25]: print("total size = ", f.csize)
total size =  1024

Bigfile Data

The bigfile package is a massively parallel IO library for large, hierarchical datasets, and nbodykit supports reading data stored in this format using BigFileCatalog.

Caveats

  • Similiar to the HDFCatalog class, datasets of the wrong size stored in a bigfile format should be explicitly excluded using the exclude keyword.

Below, we load Position and Velocity columns, stored in the bigfile format:

In [26]: import bigfile

In [27]: from nbodykit.source.catalog import BigFileCatalog

# generate some fake data
In [28]: data = numpy.empty(512, dtype=[('Position', ('f8', 3)), ('Velocity', ('f8',3))])

In [29]: data['Position'] = numpy.random.random(size=(512, 3))

In [30]: data['Velocity'] = numpy.random.random(size=(512,3))

# save fake data to a BigFile
In [31]: with bigfile.BigFile('bigfile-example', create=True) as tmpff:
   ....:     with tmpff.create("Position", dtype=('f4', 3), size=512) as bb:
   ....:         bb.write(0, data['Position'])
   ....:     with tmpff.create("Velocity", dtype=('f4', 3), size=512) as bb:
   ....:         bb.write(0, data['Velocity'])
   ....:     with tmpff.create("Header") as bb:
   ....:         bb.attrs['Size'] = 512.
   ....: 

# initialize the catalog
In [32]: f = BigFileCatalog('bigfile-example', header='Header')

In [33]: print(f)
BigFileCatalog(size=512, file='bigfile-example')

In [34]: print("columns = ", f.columns) # default Weight,Selection also present
columns =  ['Position', 'Selection', 'Value', 'Velocity', 'Weight']

In [35]: print("total size = ", f.csize)
total size =  512

FITS Data

The FITS data format is supported via the FITSCatalog object. nbodykit relies on the fitsio package to perform the read operation.

Caveats

  • The FITS file must contain a readable binary table of data.
  • Specific extensions to read can be passed via the ext keyword. By default, data is read from the first HDU that has readable data.

For example, below we load Position and Velocity data from a FITS file:

In [36]: import fitsio

In [37]: from nbodykit.source.catalog import FITSCatalog

# generate some fake data
In [38]: dset = numpy.empty(1024, dtype=[('Position', ('f8', 3)), ('Mass', 'f8')])

In [39]: dset['Position'] = numpy.random.random(size=(1024, 3))

In [40]: dset['Mass'] = numpy.random.random(size=1024)

# write to a FITS file using fitsio
In [41]: fitsio.write('fits-example.fits', dset, extname='Data')

# initialize the catalog
In [42]: f = FITSCatalog('fits-example.fits', ext='Data')

In [43]: print(f)
FITSCatalog(size=1024, file='fits-example.fits')

In [44]: print("columns = ", f.columns) # default Weight,Selection also present
columns =  ['Mass', 'Position', 'Selection', 'Value', 'Weight']

In [45]: print("total size = ", f.csize)
total size =  1024

Reading Multiple Data Files at Once

CatalogSource objects support reading multiple files at once, providing a continuous view of each individual catalog stacked together. Each file read must contain the same data types, otherwise the data cannot be combined into a single catalog.

This becomes particularly useful when the user has data split into multiple files in a single directory, as is often the case when processing large amounts of data. For example, output binary snapshots from N-body simulations, often totaling 10GB - 100GB in size, can be read into a single BinaryCatalog with nbodykit.

When specifying multiple files to load, the user can use either an explicit list of file names or use an asterisk glob pattern to match files. As an example, below, we read data from two plaintext files into a single CSVCatalog:

# generate data
In [46]: data = numpy.random.random(size=(100,5))

# save first 40 rows of data to file
In [47]: numpy.savetxt('csv-example-1.txt', data[:40], fmt='%.7e')

# save the remaining 60 rows to another file
In [48]: numpy.savetxt('csv-example-2.txt', data[40:], fmt='%.7e')

Using a glob pattern

# the names of the columns in both files
In [49]: names =['a', 'b', 'c', 'd', 'e']

# read with a glob pattern
In [50]: f = CSVCatalog('csv-example-*', names)

In [51]: print(f)
CSVCatalog(size=100, file='csv-example-*')

# combined catalog size is 40+60=100
In [52]: print("total size = ", f.csize)
total size =  100

Using a list of file names

# the names of the columns in both files
In [53]: names =['a', 'b', 'c', 'd', 'e']

# read with a list of the file names
In [54]: f = CSVCatalog(['csv-example-1.txt', 'csv-example-2.txt'], names)

In [55]: print(f)
CSVCatalog(size=100, nfiles=2)

# combined catalog size is 40+60=100
In [56]: print("total size = ", f.csize)
total size =  100

Reading a Custom Data Format

Users can implement their own subclasses of CatalogSource for reading custom data formats with a few easy steps. The core functionality of the CatalogSource classes described in this section use the nbodykit.io module for reading data from disk. This module implements the nbodykit.io.base.FileType base class, which is an abstract class that behaves like a file-like object. For the built-in file formats discussed in this section, we have implemented the following subclasses of FileType in the nbodykit.io module: CSVFile, BinaryFile, BigFile, HDFFile, and FITSFile.

To make a valid subclass of FileType, users must:

  1. Implement the read() function that reads a range of the data from disk.
  2. Set the size in the __init__() function, specifying the total size of the data on disk.
  3. Set the dtype in the __init__() function, specifying the type of data stored on disk.

Once we have the custom subclass implemented, the nbodykit.source.catalog.file.FileCatalogFactory() function can be used to automatically create a custom CatalogSource object from the subclass.

As a toy example, we will illustrate how this is done for data saved using the numpy .npy format. First, we will implement our subclass of the FileType class:

In [57]: from nbodykit.io.base import FileType

In [58]: class NPYFile(FileType):
   ....:     """
   ....:     A file-like object to read numpy ``.npy`` files
   ....:     """
   ....:     def __init__(self, path):
   ....:         self.path = path
   ....:         self.attrs = {}
   ....:         self._data = numpy.load(self.path)
   ....:         self.size = len(self._data) # total size
   ....:         self.dtype = self._data.dtype # data dtype
   ....:     def read(self, columns, start, stop, step=1):
   ....:         """
   ....:         Read the specified column(s) over the given range
   ....:         """
   ....:         return self._data[start:stop:step]
   ....: 

And now generate the subclass of CatalogSource:

In [59]: from nbodykit.source.catalog.file import FileCatalogFactory

In [60]: NPYCatalog = FileCatalogFactory('NPYCatalog', NPYFile)

And finally, we generate some fake data, save it to a .npy file, and then load it with our new NPYCatalog class:

# generate the fake data
In [61]: data = numpy.empty(1024, dtype=[('Position', ('f8', 3)), ('Mass', 'f8')])

In [62]: data['Position'] = numpy.random.random(size=(1024, 3))

In [63]: data['Mass'] = numpy.random.random(size=1024)

# save to a npy file
In [64]: numpy.save("npy-example.npy", data)

# and now load the data
In [65]: f = NPYCatalog("npy-example.npy")

In [66]: print(f)
NPYCatalog(size=1024, file='npy-example.npy')

In [67]: print("columns = ", f.columns) # default Weight,Selection also present
columns =  ['Mass', 'Position', 'Selection', 'Value', 'Weight']

In [68]: print("total size = ", f.csize)
total size =  1024

This toy example illustrates how custom data formats can be incorporated into nbodykit, but users should take care to optimize their storage solutions for more complex applications. In particular, data storage formats that are stored in column-major format and allow data slices from arbitrary locations to be read should be favored. This enables large speed-ups when reading data in parallel. On the contrary, our simple toy example class NPYFile reads the entirety of the data before returning a certain slice in the read() function. In general, this should be avoided if at all possible.