nbodykit.io.csv¶

Functions

`make_partitions`(filename, blocksize, config)	Partition a CSV file into blocks, using the preferred blocksize
`verify_data`(path, names[, nrows])	Verify the data by reading the first few lines of the specified

Classes

`CSVFile`(path, names[, blocksize, dtype, …])	A file object to handle the reading of columns of data from a CSV file.
`CSVPartition`(filename, offset, blocksize, …)	A simple class to convert byte strings of data from a CSV file

class nbodykit.io.csv.CSVFile(path, names, blocksize=33554432, dtype={}, usecols=None, delim_whitespace=True, **config)[source]¶

A file object to handle the reading of columns of data from a CSV file.

Internally, this class partitions the CSV file into chunks, and data is only read from the relevant chunks of the file, using pandas.read_csv().

This setup provides a significant speed-up when reading from the end of the file, since the entirety of the data does not need to be read first.

The class supports any of the configuration keywords that can be passed to pandas.read_csv()

Warning

This assumes the delimiter for separate lines is the newline character and that all columns in the file represent data columns (no “index” column when using pandas)

Parameters:

Parameters:	path (str) – the name of the file to load names (list of str) – the names of the columns of the csv file; this should give names of all the columns in the file – pass `usecols` to select a subset of columns blocksize (int, optional) – the file will be partitioned into blocks of bytes roughly of this size dtype (dict, str, optional) – if specified as a string, assume all columns have this dtype, otherwise; each column can have a dtype entry in the dict; if not specified, the data types will be inferred from the file usecols (list, optional) – a `pandas.read_csv`; a subset of `names` to store, ignoring all other columns delim_whitespace (bool, optional) – a `pandas.read_csv` keyword; if the CSV file is space-separated, set this to `True` **config – additional keyword arguments that will be passed to `pandas.read_csv()`; see the documentation of that function for a full list of possible options

path (str) – the name of the file to load
names (list of str) – the names of the columns of the csv file; this should give names of all the columns in the file – pass usecols to select a subset of columns
blocksize (int, optional) – the file will be partitioned into blocks of bytes roughly of this size
dtype (dict, str, optional) – if specified as a string, assume all columns have this dtype, otherwise; each column can have a dtype entry in the dict; if not specified, the data types will be inferred from the file
usecols (list, optional) – a pandas.read_csv; a subset of names to store, ignoring all other columns
delim_whitespace (bool, optional) – a pandas.read_csv keyword; if the CSV file is space-separated, set this to True
**config – additional keyword arguments that will be passed to pandas.read_csv(); see the documentation of that function for a full list of possible options

Attributes

`columns`	A list of the names of the columns in the file.
`dtype`	A `numpy.dtype` object holding the data types of each column in the file.
`ncol`	The number of data columns in the file.
`shape`	The shape of the file, which defaults to `(size, )`
`size`	The size of the file, i.e., number of rows

Methods

`asarray`()	Return a view of the file, where the fields of the
`get_dask`(column[, blocksize])	Return the specified column as a dask array, which
`keys`()	Aliased function to return `columns`
`read`(columns, start, stop[, step])	Read the specified column(s) over the given range

read(columns, start, stop, step=1)[source]¶

Read the specified column(s) over the given range

‘start’ and ‘stop’ should be between 0 and size, which is the total size of the file (in particles)

Parameters:	columns (str, list of str) – the name of the column(s) to return start (int) – the row integer to start reading at stop (int) – the row integer to stop reading at step (int, optional) – the step size to use when reading; default is 1
Returns:	structured array holding the requested columns over the specified range of rows
Return type:	numpy.array

class nbodykit.io.csv.CSVPartition(filename, offset, blocksize, delimiter, **config)[source]¶

A simple class to convert byte strings of data from a CSV file to a pandas DataFrame on demand

The DataFrame is cached as value, so only a single call to pandas.read_csv() is used

Attributes

value Return the parsed btye string as a DataFrame

__init__(filename, offset, blocksize, delimiter, **config)[source]¶

Parameters:	filename (str) – the file to read data from offset (int) – the offset in bytes to start reading at blocksize (int) – the size of the bytes block to read delimiter (byte str) – how to distinguish separate lines **config – the configuration keywords passed to `pandas.read_csv()`

value¶: Return the parsed btye string as a DataFrame

nbodykit.io.csv.make_partitions(filename, blocksize, config, delimiter='\n')[source]¶

Partition a CSV file into blocks, using the preferred blocksize in bytes, returning the partititions and number of rows in each partition

This divides the input file into partitions with size roughly equal to blocksize, reads the bytes, and counts the number of delimiters to compute the size of each block

Parameters:

Parameters:	filename (str) – the name of the CSV file to load blocksize (int) – the desired number of bytes per block delimiter (str, optional) – the character separating lines; default is the newline character config (dict) – any keyword options to pass to `pandas.read_csv()`
Returns:	partitions (list of CSVPartition) – list of objects storing the data content of each file partition, stored as a bytestring sizes (list of int) – the list of the number of rows in each partition

filename (str) – the name of the CSV file to load
blocksize (int) – the desired number of bytes per block
delimiter (str, optional) – the character separating lines; default is the newline character
config (dict) – any keyword options to pass to pandas.read_csv()

Returns:

partitions (list of CSVPartition) – list of objects storing the data content of each file partition, stored as a bytestring
sizes (list of int) – the list of the number of rows in each partition

nbodykit.io.csv.verify_data(path, names, nrows=10, **config)[source]¶

Verify the data by reading the first few lines of the specified CSV file to determine the data type

Parameters:	path (str) – the name of the CSV file to load names (list of str) – the list of the names of the columns in the CSV file nrows (int, optional) – the number of rows to read from the file in order to infer the data type; default is 10 *config (key, value pairs*) – additional keywords to pass to `pandas.read_csv()`
Returns:	dtype – dictionary holding the dtype for each name in names
Return type:	dict