Data I/O¶

Between client and server¶

Arkouda is designed to integrate with NumPy and Pandas, with arkouda handling large, distributed data in parallel while receiving and sending smaller input and output data to/from Python as NumPy ndarray objects. A common arkouda workflow looks like

Load in a large dataset with arkouda
Enter or create a small NumPy array with user data to compare against the large dataset
Convert the NumPy array to an arkouda array (transferring the data to the server)
Run computations that filter or summarize the large dataset
Pass the smaller result set back to Python as a NumPy array for plotting or inspection

Below are the functions that enable both sides of this transfer.

arkouda.array(a, dtype=None, copy=False, max_bits=-1)[source]¶

Convert a Python, NumPy, or Arkouda array-like into a pdarray or Strings object, transferring data to the Arkouda server.

Parameters:

a (Union[pdarray, np.ndarray, Iterable, Strings]) – The array-like input to convert. Supported types include Arkouda Strings, pdarray, NumPy ndarray, or Python iterables such as list, tuple, range, or deque.
dtype (Union[np.dtype, type, str], optional) – The target dtype to cast values to. This may be a NumPy dtype object, a NumPy scalar type (e.g. np.int64), or a string (e.g. ‘int64’, ‘str’).
copy (bool, default=False) – If True, a deep copy of the array is made. If False, no copy is made if the input is already a pdarray. Note: Arkouda does not currently support views or shallow copies. This differs from NumPy. Also, the default (False) is chosen to reduce performance overhead.
max_bits (int, optional) – The maximum number of bits for bigint arrays. Ignored for other dtypes.

Returns:

A pdarray stored on the Arkouda server, or a Strings object.

Return type:

Union[pdarray, Strings]

Raises:

TypeError –
- If a is not a pdarray, np.ndarray, or Python iterable. - If a is of string type and dtype is not ak.str_.
RuntimeError –
- If input size exceeds ak.client.maxTransferBytes. - If a.dtype is unsupported or incompatible with Arkouda. - If a.size * a.itemsize > maxTransferBytes.
ValueError –
- If a’s rank is not supported (see get_array_ranks()). - If the server response is malformed or missing required fields.

See also

pdarray.to_ndarray: Convert back from Arkouda to NumPy.

Notes

Arkouda does not currently support shallow copies or views; all copies are deep.
The number of bytes transferred to the server is limited by ak.client.maxTransferBytes. This prevents saturating the network during large transfers. To increase this limit, set ak.client.maxTransferBytes to a larger value manually.
If the input is a Unicode string array (dtype.kind == ‘U’ or dtype=’str’), this function recursively creates a Strings object from two internal `pdarray`s (one for offsets and one for concatenated string bytes).

Examples

>>> import arkouda as ak
>>> ak.array(np.arange(1, 10))
array([1 2 3 4 5 6 7 8 9])

>>> ak.array(range(1, 10))
array([1 2 3 4 5 6 7 8 9])

>>> strings = ak.array([f'string {i}' for i in range(5)])
>>> type(strings)
<class 'arkouda.numpy.strings.Strings'>

arkouda.pdarray.to_ndarray(self)¶

Convert the array to a np.ndarray, transferring array data from the Arkouda server to client-side Python. Note: if the pdarray size exceeds client.maxTransferBytes, a RuntimeError is raised.

Returns:: A numpy ndarray with the same attributes and data as the pdarray
Return type:: np.ndarray
Raises:: RuntimeError – Raised if there is a server-side error thrown, if the pdarray size exceeds the built-in client.maxTransferBytes size limit, or if the bytes received does not match expected number of bytes

Notes

The number of bytes in the array cannot exceed client.maxTransferBytes, otherwise a RuntimeError will be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting client.maxTransferBytes to a larger value, but proceed with caution.

See also

array, tolist

Examples

>>> import arkouda as ak
>>> a = ak.arange(0, 5, 1)
>>> a.to_ndarray()
array([0, 1, 2, 3, 4])
>>> type(a.to_ndarray())
<class 'numpy.ndarray'>

Large Datasets¶

Supported File Formats¶

HDF5
- Default File Format
Parquet
- Requires pyarrow

Data Preprocessing¶

Arkouda is designed to work primarily with columnar data spread across multiple files of non-uniform size. All disk-based I/O uses HDF5 or Parquet file format and associates each column of data with an HDF5/Parquet dataset present at the root level of all files.

Files are processed in parallel with one file per locale. While HDF5 has an MPI layer for concurrent reading and writing of a single file from multiple nodes, arkouda does not yet support this functionality.

Because most data does not come in HDF5/Parquet format, the arkouda developers use arkouda in conjunction with several data preprocessing pipelines. While each dataset requires a unique conversion strategy, all preprocessing should:

Transpose row-based formats (e.g. CSV) to columns and output each column as an HDF5 dataset
NOT aggregate input files too aggressively, but keep them separate to enable parallel I/O (hundreds or thousands of files is appropriate, in our experience)
Convert text to numeric types where possible

Much of this preprocessing can be accomplished with the Pandas read* functions for ingest and the h5py or pyarrow module for output. See this example for ideas.

Reading data from disk¶

arkouda.read(filenames, datasets=None, iterative=False, strictTypes=True, allow_errors=False, calc_string_offsets=False, column_delim=',', read_nested=True, has_non_float_nulls=False, fixed_len=-1)[source]¶

Read datasets from files.

File Type is determined automatically.

Parameters:

filenames (list or str) – Either a list of filenames or shell expression
datasets (list or str or None) – (List of) name(s) of dataset(s) to read (default: all available)
iterative (bool) – Iterative (True) or Single (False) function call(s) to server
strictTypes (bool) – If True (default), require all dtypes of a given dataset to have the same precision and sign. If False, allow dtypes of different precision and sign across different files. For example, if one file contains a uint32 dataset and another contains an int64 dataset with the same name, the contents of both will be read into an int64 pdarray.
allow_errors (bool) – Default False, if True will allow files with read errors to be skipped instead of failing. A warning will be included in the return containing the total number of files skipped due to failure and up to 10 filenames.
calc_string_offsets (bool) – Default False, if True this will tell the server to calculate the offsets/segments array on the server versus loading them from HDF5 files. In the future this option may be set to True as the default.
column_delim (str) – Column delimiter to be used if dataset is CSV. Otherwise, unused.
read_nested (bool) – Default True, when True, SegArray objects will be read from the file. When False, SegArray (or other nested Parquet columns) will be ignored. Ignored if datasets is not None Parquet Files only.
has_non_float_nulls (bool) – Default False. This flag must be set to True to read non-float parquet columns that contain null values.
fixed_len (int) – Default -1. This value can be set for reading Parquet string columns when the length of each string is known at runtime. This can allow for skipping byte calculation, which can have an impact on performance.

Returns:

Dictionary of {datasetName: pdarray, String, or SegArray}

Return type:

Returns a dictionary of Arkouda pdarrays, Arkouda Strings, or Arkouda Segarrays.

Raises:

RuntimeError – If invalid filetype is detected

Notes

If filenames is a string, it is interpreted as a shell expression (a single filename is a valid expression, so it will work) and is expanded with glob to read all matching files.

If iterative == True each dataset name and file names are passed to the server as independent sequential strings while if iterative == False all dataset names and file names are passed to the server in a single string.

If datasets is None, infer the names of datasets from the first file and read all of them. Use get_datasets to show the names of datasets to HDF5/Parquet files.

CSV files without the Arkouda Header are not supported.

Examples

>>> import arkouda as ak

Read with file Extension load HDF5 - processing determines file type not extension >>> x = ak.read(‘path/name_prefix.h5’) # doctest: +SKIP

Read without file Extension load Parquet >>> x = ak.read(‘path/name_prefix.parquet’) # doctest: +SKIP

Read Glob Expression Reads HDF5 >>> x = ak.read(‘path/name_prefix*’) # doctest: +SKIP

For convenience, multiple datasets can be read in to create a dictionary of pdarrays.

HDF5/Parquet files can be queried via the server for dataset names and sizes.

arkouda.get_datasets(filenames, allow_errors=False, column_delim=',', read_nested=True)[source]¶

Get the names of the datasets in the provide files.

Parameters:

filenames (str or List[str]) – Name of the file/s from which to return datasets
allow_errors (bool) – Default: False Whether or not to allow errors while accessing datasets
column_delim (str) – Column delimiter to be used if dataset is CSV. Otherwise, unused.
read_nested (bool) – Default True, when True, SegArray objects will be read from the file. When False, SegArray (or other nested Parquet columns) will be ignored. Only used for Parquet Files.

Return type:

List[str] of names of the datasets

Raises:

RuntimeError –

If no datasets are returned

Notes

This function currently supports HDF5 and Parquet formats.
Future updates to Parquet will deprecate this functionality on that format,

but similar support will be added for Parquet at that time. - If a list of files is provided, only the datasets in the first file will be returned

See also

ls

Import/Export¶

Import allows users to import data written by pandas into arkouda. Export allows users to write arkouda data into a format pandas can read. The file formats supported are:

HDF5
Parquet

These save formats are customizable and allow for schemas to be created to fit specific needs. As a result, a file written by Arkouda is not always able to be read by other applications. The import/export features of Arkouda allow for files to be reformatted for reading by Pandas and vice versa.

Import Importing data takes a file that was saved using Pandas and reads it into Arkouda. The user is able to specify if they would like to save the result to a file that can be read by Arkouda and/or return the resulting Arkouda object.

Export Export takes a file taht was saved using Arkouda and reads it into Pandas. The user is able to specify if they would like to save the result to a file that can be read by Pandas and/or return the resulting Pandas object.

Note: If the file being read in is Parquet, the resulting file that can be read by Arkouda will also be Parquet. This is also true for HDF5.

This functionality is currently performed on the client and is assuming that dataset sizes are able to be handled in the client due to being written by Pandas. Arkouda natively verifies the size of data before writing it to the client, so exports are limited.

arkouda.import_data(read_path, write_file=None, return_obj=True, index=False)[source]¶

Import data from a file saved by Pandas (HDF5/Parquet).

Import data from a file saved by Pandas (HDF5/Parquet) to Arkouda object and/or a file formatted to be read by Arkouda.

Parameters:

read_path (str) – path to file where pandas data is stored. This can be glob expression for parquet formats.
write_file (str, optional) – path to file to write arkouda formatted data to. Only write file if provided
return_obj (bool, optional) – Default True. When True return the Arkouda DataFrame object, otherwise return None
index (bool, optional) – Default False. When True, maintain the indexes loaded from the pandas file

Raises:

RuntimeWarning –
- Export attempted on Parquet file. Arkouda formatted Parquet files are readable by pandas.
RuntimeError –
- Unsupported file type

Returns:

When return_obj=True

Return type:

pd.DataFrame

See also

pandas.DataFrame.to_parquet, pandas.DataFrame.to_hdf, pandas.DataFrame.read_parquet, pandas.DataFrame.read_hdf, ak.export

Notes

Import can only be performed from hdf5 or parquet files written by pandas.

arkouda.export(read_path, dataset_name='ak_data', write_file=None, return_obj=True, index=False)[source]¶

Export data from arkouda to pandas.

Export data from Arkouda file (Parquet/HDF5) to Pandas object or file formatted to be readable by Pandas.

Parameters:

read_path (str) – path to file where arkouda data is stored.
dataset_name (str) – name to store dataset under
index (bool) – Default False. When True, maintain the indexes loaded from the pandas file
write_file (str, optional) – path to file to write pandas formatted data to. Only write the file if this is set
return_obj (bool, optional) – Default True. When True return the Pandas DataFrame object, otherwise return None

Raises:

RuntimeError –

Unsupported file type

Returns:

When return_obj=True

Return type:

pd.DataFrame

See also

pandas.DataFrame.to_parquet, pandas.DataFrame.to_hdf, pandas.DataFrame.read_parquet, pandas.DataFrame.read_hdf, ak.import_data

Notes

If Arkouda file is exported for pandas, the format will not change. This mean parquet files will remain parquet and hdf5 will remain hdf5.
Export can only be performed from hdf5 or parquet files written by Arkouda. The result will be the same file type, but formatted to be read by Pandas.