nbodykit.io.binary

Functions

getsize(filename, header_size, rowsize)

The default method to determine the size of the binary file

Classes

BinaryFile(path, dtype[, offsets, ...])

A file object to handle the reading of columns of data from a binary file.

class nbodykit.io.binary.BinaryFile(path, dtype, offsets=None, header_size=0, size=None)[source]

A file object to handle the reading of columns of data from a binary file.

Warning

This assumes the data is stored in a column-major format

Parameters
  • path (str) – the name of the binary file to load

  • dtype (numpy.dtype or list of tuples) – the dtypes of the columns to load; this should be either a numpy.dtype or be able to be converted to one via a numpy.dtype() call

  • offsets (dict, optional) – a dictionay specifying the byte offsets of each column in the binary file; if not supplied, the offsets are inferred from the dtype size of each column, assuming a fixed header size, and contiguous storage

  • header_size (int, optional) – the size of the header in bytes

  • size (int, optional) – the number of objects in the binary file; if not provided, the value is inferred from the dtype and the total size of the file in bytes

Attributes
columns

A list of the names of the columns in the file.

dtype

A numpy.dtype object holding the data types of each column in the file.

ncol

The number of data columns in the file.

ndim
shape

The shape of the file, which defaults to (size, )

size

The size of the file, i.e., number of rows

Methods

asarray()

Return a view of the file, where the fields of the structured array are stacked in columns of a single numpy array

get_dask(column[, blocksize])

Return the specified column as a dask array, which delays the explicit reading of the data until dask.compute() is called

keys()

Aliased function to return columns

read(columns, start, stop[, step])

Read the specified column(s) over the given range

__getitem__(s)

This function provides numpy-like array indexing of the file object.

It supports:

  1. integer, slice-indexing similar to arrays

  2. string indexing using column names in keys()

  3. array-like indexing using integer lists or boolean arrays

Note

If a single column is being returned, a numpy array holding the data is returned, rather than a structured array with only a single field.

asarray()

Return a view of the file, where the fields of the structured array are stacked in columns of a single numpy array

Examples

Start with a file object with three named columns, ra, dec, and z

>>> ff.dtype
dtype([('ra', '<f4'), ('dec', '<f4'), ('z', '<f4')])
>>> ff.shape
(1000,)
>>> ff.columns
['ra', 'dec', 'z']
>>> ff[:3]
array([(235.63442993164062, 59.39099884033203, 0.6225500106811523),
       (140.36181640625, -1.162310004234314, 0.5026500225067139),
       (129.96627807617188, 45.970130920410156, 0.4990200102329254)],
      dtype=(numpy.record, [('ra', '<f4'), ('dec', '<f4'), ('z', '<f4')]))

Select a subset of columns and switch the ordering and convert output to a single numpy array

>>> x = ff[['dec', 'ra']].asarray()
>>> x.dtype
dtype('float32')
>>> x.shape
(1000, 2)
>>> x.columns
['dec', 'ra']
>>> x[:3]
array([[  59.39099884,  235.63442993],
       [  -1.16231   ,  140.36181641],
       [  45.97013092,  129.96627808]], dtype=float32)

Now, select only the first column (dec)

>>> dec = x[:,0]
>>> dec[:3]
array([ 59.39099884,  -1.16231   ,  45.97013092], dtype=float32)
Returns

a file object that will return a numpy array with the columns representing the fields

Return type

FileType

property columns

A list of the names of the columns in the file.

This defaults to the named fields in the file’s dtype attribute, but differ from this if a view of the file has been returned with asarray()

property dtype

A numpy.dtype object holding the data types of each column in the file.

get_dask(column, blocksize=None)

Return the specified column as a dask array, which delays the explicit reading of the data until dask.compute() is called

The dask array is chunked into blocks of size blocksize

Parameters
  • column (str) – the name of the column to return

  • blocksize (int, optional) – the size of the chunks in the dask array

Returns

the dask array holding the column, which computes the necessary functions to read the data, but delays evaluating until the user specifies

Return type

dask.array.Array

keys()

Aliased function to return columns

property ncol

The number of data columns in the file.

read(columns, start, stop, step=1)[source]

Read the specified column(s) over the given range

‘start’ and ‘stop’ should be between 0 and size, which is the total size of the binary file (in particles)

Parameters
  • columns (str, list of str) – the name of the column(s) to return

  • start (int) – the row integer to start reading at

  • stop (int) – the row integer to stop reading at

  • step (int, optional) – the step size to use when reading; default is 1

Returns

structured array holding the requested columns over the specified range of rows

Return type

numpy.array

property shape

The shape of the file, which defaults to (size, )

Multiple dimensions can be introduced into the shape if a view of the file has been returned with asarray()

property size

The size of the file, i.e., number of rows

nbodykit.io.binary.getsize(filename, header_size, rowsize)[source]

The default method to determine the size of the binary file

The “size” is defined as the number of rows, where each row has of size of rowsize in bytes.

Notes

  • This assumes the input file is not compressed

  • This function does not depend on the layout of the binary file, i.e., if the data is formatted in actual rows or not

Raises

ValueError : – If the function determines a fractional number of rows

Parameters
  • filename (str) – the name of the binary file

  • header_size (int) – the size of the header in bytes, which will be skipped when determining the number of rows

  • rowsize (int) – the size of the data in each row in bytes