Data I/O¶
Between client and server¶
Arkouda is designed to integrate with NumPy and Pandas, with arkouda handling large, distributed data in parallel while receiving and sending smaller input and output data to/from Python as NumPy ndarray
objects. A common arkouda workflow looks like
Load in a large dataset with arkouda
Enter or create a small NumPy array with user data to compare against the large dataset
Convert the NumPy array to an arkouda array (transferring the data to the server)
Run computations that filter or summarize the large dataset
Pass the smaller result set back to Python as a NumPy array for plotting or inspection
Below are the functions that enable both sides of this transfer.
- arkouda.array(a, dtype=None, max_bits=-1)[source]¶
Convert a Python or Numpy Iterable to a pdarray or Strings object, sending the corresponding data to the arkouda server.
- Parameters:
- Returns:
A pdarray instance stored on arkouda server or Strings instance, which is composed of two pdarrays stored on arkouda server
- Return type:
- Raises:
TypeError – Raised if a is not a pdarray, np.ndarray, or Python Iterable such as a list, array, tuple, or deque
RuntimeError – Raised if a is not one-dimensional, nbytes > maxTransferBytes, a.dtype is not supported (not in DTypes), or if the product of a size and a.itemsize > maxTransferBytes
ValueError – Raised if the returned message is malformed or does not contain the fields required to generate the array.
See also
Notes
The number of bytes in the input array cannot exceed ak.client.maxTransferBytes, otherwise a RuntimeError will be raised. This is to protect the user from overwhelming the connection between the Python client and the arkouda server, under the assumption that it is a low-bandwidth connection. The user may override this limit by setting ak.client.maxTransferBytes to a larger value, but should proceed with caution.
If the pdrray or ndarray is of type U, this method is called twice recursively to create the Strings object and the two corresponding pdarrays for string bytes and offsets, respectively.
Examples
>>> ak.array(np.arange(1,10)) array([1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> ak.array(range(1,10)) array([1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> strings = ak.array([f'string {i}' for i in range(0,5)]) >>> type(strings) <class 'arkouda.strings.Strings'>
- arkouda.pdarray.to_ndarray(self)¶
Convert the array to a np.ndarray, transferring array data from the Arkouda server to client-side Python. Note: if the pdarray size exceeds client.maxTransferBytes, a RuntimeError is raised.
- Returns:
A numpy ndarray with the same attributes and data as the pdarray
- Return type:
np.ndarray
- Raises:
RuntimeError – Raised if there is a server-side error thrown, if the pdarray size exceeds the built-in client.maxTransferBytes size limit, or if the bytes received does not match expected number of bytes
Notes
The number of bytes in the array cannot exceed
client.maxTransferBytes
, otherwise aRuntimeError
will be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting client.maxTransferBytes to a larger value, but proceed with caution.See also
array
,to_list
Examples
>>> a = ak.arange(0, 5, 1) >>> a.to_ndarray() array([0, 1, 2, 3, 4])
>>> type(a.to_ndarray()) numpy.ndarray
- arkouda.Strings.to_ndarray(self)¶
Convert the array to a np.ndarray, transferring array data from the arkouda server to Python. If the array exceeds a built-in size limit, a RuntimeError is raised.
- Returns:
A numpy ndarray with the same strings as this array
- Return type:
np.ndarray
Notes
The number of bytes in the array cannot exceed
ak.client.maxTransferBytes
, otherwise aRuntimeError
will be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting ak.client.maxTransferBytes to a larger value, but proceed with caution.See also
array
,to_list
Examples
>>> a = ak.array(["hello", "my", "world"]) >>> a.to_ndarray() array(['hello', 'my', 'world'], dtype='<U5') >>> type(a.to_ndarray()) numpy.ndarray
Large Datasets¶
Supported File Formats¶
- HDF5
Default File Format
- Parquet
Requires pyarrow
Data Preprocessing¶
Arkouda is designed to work primarily with columnar data spread across multiple files of non-uniform size. All disk-based I/O uses HDF5 or Parquet file format and associates each column of data with an HDF5/Parquet dataset present at the root level of all files.
Files are processed in parallel with one file per locale. While HDF5 has an MPI layer for concurrent reading and writing of a single file from multiple nodes, arkouda does not yet support this functionality.
Because most data does not come in HDF5/Parquet format, the arkouda developers use arkouda in conjunction with several data preprocessing pipelines. While each dataset requires a unique conversion strategy, all preprocessing should:
Transpose row-based formats (e.g. CSV) to columns and output each column as an HDF5 dataset
NOT aggregate input files too aggressively, but keep them separate to enable parallel I/O (hundreds or thousands of files is appropriate, in our experience)
Convert text to numeric types where possible
Much of this preprocessing can be accomplished with the Pandas read*
functions for ingest and the h5py
or pyarrow
module for output. See this example for ideas.
Reading data from disk¶
- arkouda.read(filenames, datasets=None, iterative=False, strictTypes=True, allow_errors=False, calc_string_offsets=False, column_delim=',', read_nested=True, has_non_float_nulls=False, fixed_len=-1)[source]¶
Read datasets from files. File Type is determined automatically.
- Parameters:
filenames (list or str) – Either a list of filenames or shell expression
datasets (list or str or None) – (List of) name(s) of dataset(s) to read (default: all available)
iterative (bool) – Iterative (True) or Single (False) function call(s) to server
strictTypes (bool) – If True (default), require all dtypes of a given dataset to have the same precision and sign. If False, allow dtypes of different precision and sign across different files. For example, if one file contains a uint32 dataset and another contains an int64 dataset with the same name, the contents of both will be read into an int64 pdarray.
allow_errors (bool) – Default False, if True will allow files with read errors to be skipped instead of failing. A warning will be included in the return containing the total number of files skipped due to failure and up to 10 filenames.
calc_string_offsets (bool) – Default False, if True this will tell the server to calculate the offsets/segments array on the server versus loading them from HDF5 files. In the future this option may be set to True as the default.
column_delim (str) – Column delimiter to be used if dataset is CSV. Otherwise, unused.
read_nested (bool) – Default True, when True, SegArray objects will be read from the file. When False, SegArray (or other nested Parquet columns) will be ignored. Ignored if datasets is not None Parquet Files only.
has_non_float_nulls (bool) – Default False. This flag must be set to True to read non-float parquet columns that contain null values.
fixed_len (int) – Default -1. This value can be set for reading Parquet string columns when the length of each string is known at runtime. This can allow for skipping byte calculation, which can have an impact on performance.
- Returns:
Dictionary of {datasetName: pdarray, String, or SegArray}
- Return type:
Returns a dictionary of Arkouda pdarrays, Arkouda Strings, or Arkouda Segarrays.
- Raises:
RuntimeError – If invalid filetype is detected
See also
Notes
If filenames is a string, it is interpreted as a shell expression (a single filename is a valid expression, so it will work) and is expanded with glob to read all matching files.
If iterative == True each dataset name and file names are passed to the server as independent sequential strings while if iterative == False all dataset names and file names are passed to the server in a single string.
If datasets is None, infer the names of datasets from the first file and read all of them. Use
get_datasets
to show the names of datasets to HDF5/Parquet files.CSV files without the Arkouda Header are not supported.
Examples
Read with file Extension >>> x = ak.read(‘path/name_prefix.h5’) # load HDF5 - processing determines file type not extension Read without file Extension >>> x = ak.read(‘path/name_prefix.parquet’) # load Parquet Read Glob Expression >>> x = ak.read(‘path/name_prefix*’) # Reads HDF5
For convenience, multiple datasets can be read in to create a dictionary of pdarrays.
HDF5/Parquet files can be queried via the server for dataset names and sizes.
- arkouda.get_datasets(filenames, allow_errors=False, column_delim=',', read_nested=True)[source]¶
Get the names of the datasets in the provide files
- Parameters:
filenames (str or List[str]) – Name of the file/s from which to return datasets
allow_errors (bool) – Default: False Whether or not to allow errors while accessing datasets
column_delim (str) – Column delimiter to be used if dataset is CSV. Otherwise, unused.
read_nested (bool) – Default True, when True, SegArray objects will be read from the file. When False, SegArray (or other nested Parquet columns) will be ignored. Only used for Parquet Files.
- Return type:
List[str] of names of the datasets
- Raises:
RuntimeError –
If no datasets are returned
Notes
This function currently supports HDF5 and Parquet formats.
Future updates to Parquet will deprecate this functionality on that format,
but similar support will be added for Parquet at that time. - If a list of files is provided, only the datasets in the first file will be returned
See also
Import/Export¶
Import allows users to import data written by pandas into arkouda. Export allows users to write arkouda data into a format pandas can read. The file formats supported are:
HDF5
Parquet
These save formats are customizable and allow for schemas to be created to fit specific needs. As a result, a file written by Arkouda is not always able to be read by other applications. The import/export features of Arkouda allow for files to be reformatted for reading by Pandas and vice versa.
Import Importing data takes a file that was saved using Pandas and reads it into Arkouda. The user is able to specify if they would like to save the result to a file that can be read by Arkouda and/or return the resulting Arkouda object.
Export Export takes a file taht was saved using Arkouda and reads it into Pandas. The user is able to specify if they would like to save the result to a file that can be read by Pandas and/or return the resulting Pandas object.
Note: If the file being read in is Parquet, the resulting file that can be read by Arkouda will also be Parquet. This is also true for HDF5.
This functionality is currently performed on the client and is assuming that dataset sizes are able to be handled in the client due to being written by Pandas. Arkouda natively verifies the size of data before writing it to the client, so exports are limited.
- arkouda.import_data(read_path, write_file=None, return_obj=True, index=False)[source]¶
Import data from a file saved by Pandas (HDF5/Parquet) to Arkouda object and/or a file formatted to be read by Arkouda.
- Parameters:
read_path (str) – path to file where pandas data is stored. This can be glob expression for parquet formats.
write_file (str, optional) – path to file to write arkouda formatted data to. Only write file if provided
return_obj (bool, optional) – Default True. When True return the Arkouda DataFrame object, otherwise return None
index (bool, optional) – Default False. When True, maintain the indexes loaded from the pandas file
- Raises:
RuntimeWarning –
Export attempted on Parquet file. Arkouda formatted Parquet files are readable by pandas.
RuntimeError –
Unsupported file type
- Returns:
When return_obj=True
- Return type:
pd.DataFrame
See also
pandas.DataFrame.to_parquet
,pandas.DataFrame.to_hdf
,pandas.DataFrame.read_parquet
,pandas.DataFrame.read_hdf
,ak.export
Notes
Import can only be performed from hdf5 or parquet files written by pandas.
- arkouda.export(read_path, dataset_name='ak_data', write_file=None, return_obj=True, index=False)[source]¶
Export data from Arkouda file (Parquet/HDF5) to Pandas object or file formatted to be readable by Pandas
- Parameters:
read_path (str) – path to file where arkouda data is stored.
dataset_name (str) – name to store dataset under
index (bool) – Default False. When True, maintain the indexes loaded from the pandas file
write_file (str, optional) – path to file to write pandas formatted data to. Only write the file if this is set
return_obj (bool, optional) – Default True. When True return the Pandas DataFrame object, otherwise return None
- Raises:
RuntimeError –
Unsupported file type
- Returns:
When return_obj=True
- Return type:
pd.DataFrame
See also
pandas.DataFrame.to_parquet
,pandas.DataFrame.to_hdf
,pandas.DataFrame.read_parquet
,pandas.DataFrame.read_hdf
,ak.import_data
Notes
If Arkouda file is exported for pandas, the format will not change. This mean parquet files will remain parquet and hdf5 will remain hdf5.
Export can only be performed from hdf5 or parquet files written by Arkouda. The result will be the same file type, but formatted to be read by Pandas.