nbodykit.io.csv¶
Functions
|
Partition a CSV file into blocks, using the preferred blocksize in bytes, returning the partititions and number of rows in each partition |
|
Verify the data by reading the first few lines of the specified CSV file to determine the data type |
Classes
|
A file object to handle the reading of columns of data from a CSV file. |
|
A simple class to convert byte strings of data from a CSV file to a pandas DataFrame on demand |
- class nbodykit.io.csv.CSVFile(path, names, blocksize=33554432, dtype={}, usecols=None, delim_whitespace=True, **config)[source]¶
A file object to handle the reading of columns of data from a CSV file.
Internally, this class partitions the CSV file into chunks, and data is only read from the relevant chunks of the file, using
pandas.read_csv()
.This setup provides a significant speed-up when reading from the end of the file, since the entirety of the data does not need to be read first.
The class supports any of the configuration keywords that can be passed to
pandas.read_csv()
Warning
This assumes the delimiter for separate lines is the newline character and that all columns in the file represent data columns (no “index” column when using
pandas
)- Parameters
path (str) – the name of the file to load
names (list of str) – the names of the columns of the csv file; this should give names of all the columns in the file – pass
usecols
to select a subset of columnsblocksize (int, optional) – the file will be partitioned into blocks of bytes roughly of this size
dtype (dict, str, optional) – if specified as a string, assume all columns have this dtype, otherwise; each column can have a dtype entry in the dict; if not specified, the data types will be inferred from the file
usecols (list, optional) – a
pandas.read_csv
; a subset ofnames
to store, ignoring all other columnsdelim_whitespace (bool, optional) – a
pandas.read_csv
keyword; if the CSV file is space-separated, set this toTrue
**config – additional keyword arguments that will be passed to
pandas.read_csv()
; see the documentation of that function for a full list of possible options
- Attributes
Methods
asarray
()Return a view of the file, where the fields of the structured array are stacked in columns of a single numpy array
get_dask
(column[, blocksize])Return the specified column as a dask array, which delays the explicit reading of the data until
dask.compute()
is calledkeys
()Aliased function to return
columns
read
(columns, start, stop[, step])Read the specified column(s) over the given range
- __getitem__(s)¶
This function provides numpy-like array indexing of the file object.
It supports:
integer, slice-indexing similar to arrays
string indexing using column names in
keys()
array-like indexing using integer lists or boolean arrays
Note
If a single column is being returned, a numpy array holding the data is returned, rather than a structured array with only a single field.
- asarray()¶
Return a view of the file, where the fields of the structured array are stacked in columns of a single numpy array
Examples
Start with a file object with three named columns,
ra
,dec
, andz
>>> ff.dtype dtype([('ra', '<f4'), ('dec', '<f4'), ('z', '<f4')]) >>> ff.shape (1000,) >>> ff.columns ['ra', 'dec', 'z'] >>> ff[:3] array([(235.63442993164062, 59.39099884033203, 0.6225500106811523), (140.36181640625, -1.162310004234314, 0.5026500225067139), (129.96627807617188, 45.970130920410156, 0.4990200102329254)], dtype=(numpy.record, [('ra', '<f4'), ('dec', '<f4'), ('z', '<f4')]))
Select a subset of columns and switch the ordering and convert output to a single numpy array
>>> x = ff[['dec', 'ra']].asarray() >>> x.dtype dtype('float32') >>> x.shape (1000, 2) >>> x.columns ['dec', 'ra'] >>> x[:3] array([[ 59.39099884, 235.63442993], [ -1.16231 , 140.36181641], [ 45.97013092, 129.96627808]], dtype=float32)
Now, select only the first column (
dec
)>>> dec = x[:,0] >>> dec[:3] array([ 59.39099884, -1.16231 , 45.97013092], dtype=float32)
- Returns
a file object that will return a numpy array with the columns representing the fields
- Return type
- property columns¶
A list of the names of the columns in the file.
This defaults to the named fields in the file’s
dtype
attribute, but differ from this if a view of the file has been returned withasarray()
- property dtype¶
A
numpy.dtype
object holding the data types of each column in the file.
- get_dask(column, blocksize=None)¶
Return the specified column as a dask array, which delays the explicit reading of the data until
dask.compute()
is calledThe dask array is chunked into blocks of size blocksize
- Parameters
- Returns
the dask array holding the column, which computes the necessary functions to read the data, but delays evaluating until the user specifies
- Return type
- property ncol¶
The number of data columns in the file.
- read(columns, start, stop, step=1)[source]¶
Read the specified column(s) over the given range
‘start’ and ‘stop’ should be between 0 and
size
, which is the total size of the file (in particles)- Parameters
- Returns
structured array holding the requested columns over the specified range of rows
- Return type
numpy.array
- property shape¶
The shape of the file, which defaults to
(size, )
Multiple dimensions can be introduced into the shape if a view of the file has been returned with
asarray()
- property size¶
The size of the file, i.e., number of rows
- class nbodykit.io.csv.CSVPartition(filename, offset, blocksize, delimiter, **config)[source]¶
A simple class to convert byte strings of data from a CSV file to a pandas DataFrame on demand
The DataFrame is cached as
value
, so only a single call topandas.read_csv()
is used- Attributes
value
Return the parsed btye string as a DataFrame
- __init__(filename, offset, blocksize, delimiter, **config)[source]¶
- Parameters
filename (str) – the file to read data from
offset (int) – the offset in bytes to start reading at
blocksize (int) – the size of the bytes block to read
delimiter (byte str) – how to distinguish separate lines
**config – the configuration keywords passed to
pandas.read_csv()
- property value¶
Return the parsed btye string as a DataFrame
- nbodykit.io.csv.make_partitions(filename, blocksize, config, delimiter='\n')[source]¶
Partition a CSV file into blocks, using the preferred blocksize in bytes, returning the partititions and number of rows in each partition
This divides the input file into partitions with size roughly equal to blocksize, reads the bytes, and counts the number of delimiters to compute the size of each block
- Parameters
filename (str) – the name of the CSV file to load
blocksize (int) – the desired number of bytes per block
delimiter (str, optional) – the character separating lines; default is the newline character
config (dict) – any keyword options to pass to
pandas.read_csv()
- Returns
partitions (list of CSVPartition) – list of objects storing the data content of each file partition, stored as a bytestring
sizes (list of int) – the list of the number of rows in each partition
- nbodykit.io.csv.verify_data(path, names, nrows=10, **config)[source]¶
Verify the data by reading the first few lines of the specified CSV file to determine the data type
- Parameters
path (str) – the name of the CSV file to load
names (list of str) – the list of the names of the columns in the CSV file
nrows (int, optional) – the number of rows to read from the file in order to infer the data type; default is 10
**config (key, value pairs) – additional keywords to pass to
pandas.read_csv()
- Returns
dtype – dictionary holding the dtype for each name in names
- Return type