Data I/O

Between client and server

Arkouda is designed to integrate with NumPy and Pandas, with arkouda handling large, distributed data in parallel while receiving and sending smaller input and output data to/from Python as NumPy ndarray objects. A common arkouda workflow looks like

  1. Load in a large dataset with arkouda

  2. Enter or create a small NumPy array with user data to compare against the large dataset

  3. Convert the NumPy array to an arkouda array (transferring the data to the server)

  4. Run computations that filter or summarize the large dataset

  5. Pass the smaller result set back to Python as a NumPy array for plotting or inspection

Below are the functions that enable both sides of this transfer.

arkouda.array(a, dtype=None, max_bits=-1)[source]

Convert a Python or Numpy Iterable to a pdarray or Strings object, sending the corresponding data to the arkouda server.

Parameters:
  • a (Union[pdarray, np.ndarray]) – Rank-1 array of a supported dtype

  • dtype (np.dtype, type, or str) – The target dtype to cast values to

  • max_bits (int) – Specifies the maximum number of bits; only used for bigint pdarrays

Returns:

A pdarray instance stored on arkouda server or Strings instance, which is composed of two pdarrays stored on arkouda server

Return type:

pdarray or Strings

Raises:
  • TypeError – Raised if a is not a pdarray, np.ndarray, or Python Iterable such as a list, array, tuple, or deque

  • RuntimeError – Raised if a is not one-dimensional, nbytes > maxTransferBytes, a.dtype is not supported (not in DTypes), or if the product of a size and a.itemsize > maxTransferBytes

  • ValueError – Raised if the returned message is malformed or does not contain the fields required to generate the array.

Notes

The number of bytes in the input array cannot exceed ak.client.maxTransferBytes, otherwise a RuntimeError will be raised. This is to protect the user from overwhelming the connection between the Python client and the arkouda server, under the assumption that it is a low-bandwidth connection. The user may override this limit by setting ak.client.maxTransferBytes to a larger value, but should proceed with caution.

If the pdrray or ndarray is of type U, this method is called twice recursively to create the Strings object and the two corresponding pdarrays for string bytes and offsets, respectively.

Examples

>>> ak.array(np.arange(1,10))
array([1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> ak.array(range(1,10))
array([1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> strings = ak.array([f'string {i}' for i in range(0,5)])
>>> type(strings)
<class 'arkouda.strings.Strings'>
arkouda.pdarray.to_ndarray(self)

Convert the array to a np.ndarray, transferring array data from the Arkouda server to client-side Python. Note: if the pdarray size exceeds client.maxTransferBytes, a RuntimeError is raised.

Returns:

A numpy ndarray with the same attributes and data as the pdarray

Return type:

np.ndarray

Raises:

RuntimeError – Raised if there is a server-side error thrown, if the pdarray size exceeds the built-in client.maxTransferBytes size limit, or if the bytes received does not match expected number of bytes

Notes

The number of bytes in the array cannot exceed client.maxTransferBytes, otherwise a RuntimeError will be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting client.maxTransferBytes to a larger value, but proceed with caution.

See also

array, to_list

Examples

>>> a = ak.arange(0, 5, 1)
>>> a.to_ndarray()
array([0, 1, 2, 3, 4])
>>> type(a.to_ndarray())
numpy.ndarray
arkouda.Strings.to_ndarray(self)

Convert the array to a np.ndarray, transferring array data from the arkouda server to Python. If the array exceeds a built-in size limit, a RuntimeError is raised.

Returns:

A numpy ndarray with the same strings as this array

Return type:

np.ndarray

Notes

The number of bytes in the array cannot exceed ak.client.maxTransferBytes, otherwise a RuntimeError will be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting ak.client.maxTransferBytes to a larger value, but proceed with caution.

See also

array, to_list

Examples

>>> a = ak.array(["hello", "my", "world"])
>>> a.to_ndarray()
array(['hello', 'my', 'world'], dtype='<U5')
>>> type(a.to_ndarray())
numpy.ndarray

Large Datasets

Supported File Formats

  • HDF5
    • Default File Format

  • Parquet
    • Requires pyarrow

Data Preprocessing

Arkouda is designed to work primarily with columnar data spread across multiple files of non-uniform size. All disk-based I/O uses HDF5 or Parquet file format and associates each column of data with an HDF5/Parquet dataset present at the root level of all files.

Files are processed in parallel with one file per locale. While HDF5 has an MPI layer for concurrent reading and writing of a single file from multiple nodes, arkouda does not yet support this functionality.

Because most data does not come in HDF5/Parquet format, the arkouda developers use arkouda in conjunction with several data preprocessing pipelines. While each dataset requires a unique conversion strategy, all preprocessing should:

  • Transpose row-based formats (e.g. CSV) to columns and output each column as an HDF5 dataset

  • NOT aggregate input files too aggressively, but keep them separate to enable parallel I/O (hundreds or thousands of files is appropriate, in our experience)

  • Convert text to numeric types where possible

Much of this preprocessing can be accomplished with the Pandas read* functions for ingest and the h5py or pyarrow module for output. See this example for ideas.

Reading data from disk

arkouda.read(filenames, datasets=None, iterative=False, strictTypes=True, allow_errors=False, calc_string_offsets=False, column_delim=',', read_nested=True, has_non_float_nulls=False, fixed_len=-1)[source]

Read datasets from files. File Type is determined automatically.

Parameters:
  • filenames (list or str) – Either a list of filenames or shell expression

  • datasets (list or str or None) – (List of) name(s) of dataset(s) to read (default: all available)

  • iterative (bool) – Iterative (True) or Single (False) function call(s) to server

  • strictTypes (bool) – If True (default), require all dtypes of a given dataset to have the same precision and sign. If False, allow dtypes of different precision and sign across different files. For example, if one file contains a uint32 dataset and another contains an int64 dataset with the same name, the contents of both will be read into an int64 pdarray.

  • allow_errors (bool) – Default False, if True will allow files with read errors to be skipped instead of failing. A warning will be included in the return containing the total number of files skipped due to failure and up to 10 filenames.

  • calc_string_offsets (bool) – Default False, if True this will tell the server to calculate the offsets/segments array on the server versus loading them from HDF5 files. In the future this option may be set to True as the default.

  • column_delim (str) – Column delimiter to be used if dataset is CSV. Otherwise, unused.

  • read_nested (bool) – Default True, when True, SegArray objects will be read from the file. When False, SegArray (or other nested Parquet columns) will be ignored. Ignored if datasets is not None Parquet Files only.

  • has_non_float_nulls (bool) – Default False. This flag must be set to True to read non-float parquet columns that contain null values.

  • fixed_len (int) – Default -1. This value can be set for reading Parquet string columns when the length of each string is known at runtime. This can allow for skipping byte calculation, which can have an impact on performance.

Returns:

Dictionary of {datasetName: pdarray, String, or SegArray}

Return type:

Returns a dictionary of Arkouda pdarrays, Arkouda Strings, or Arkouda Segarrays.

Raises:

RuntimeError – If invalid filetype is detected

Notes

If filenames is a string, it is interpreted as a shell expression (a single filename is a valid expression, so it will work) and is expanded with glob to read all matching files.

If iterative == True each dataset name and file names are passed to the server as independent sequential strings while if iterative == False all dataset names and file names are passed to the server in a single string.

If datasets is None, infer the names of datasets from the first file and read all of them. Use get_datasets to show the names of datasets to HDF5/Parquet files.

CSV files without the Arkouda Header are not supported.

Examples

Read with file Extension >>> x = ak.read(‘path/name_prefix.h5’) # load HDF5 - processing determines file type not extension Read without file Extension >>> x = ak.read(‘path/name_prefix.parquet’) # load Parquet Read Glob Expression >>> x = ak.read(‘path/name_prefix*’) # Reads HDF5

For convenience, multiple datasets can be read in to create a dictionary of pdarrays.

HDF5/Parquet files can be queried via the server for dataset names and sizes.

arkouda.get_datasets(filenames, allow_errors=False, column_delim=',', read_nested=True)[source]

Get the names of the datasets in the provide files

Parameters:
  • filenames (str or List[str]) – Name of the file/s from which to return datasets

  • allow_errors (bool) – Default: False Whether or not to allow errors while accessing datasets

  • column_delim (str) – Column delimiter to be used if dataset is CSV. Otherwise, unused.

  • read_nested (bool) – Default True, when True, SegArray objects will be read from the file. When False, SegArray (or other nested Parquet columns) will be ignored. Only used for Parquet Files.

Return type:

List[str] of names of the datasets

Raises:

RuntimeError

  • If no datasets are returned

Notes

  • This function currently supports HDF5 and Parquet formats.

  • Future updates to Parquet will deprecate this functionality on that format,

but similar support will be added for Parquet at that time. - If a list of files is provided, only the datasets in the first file will be returned

See also

ls

Import/Export

Import allows users to import data written by pandas into arkouda. Export allows users to write arkouda data into a format pandas can read. The file formats supported are:

  • HDF5

  • Parquet

These save formats are customizable and allow for schemas to be created to fit specific needs. As a result, a file written by Arkouda is not always able to be read by other applications. The import/export features of Arkouda allow for files to be reformatted for reading by Pandas and vice versa.

Import Importing data takes a file that was saved using Pandas and reads it into Arkouda. The user is able to specify if they would like to save the result to a file that can be read by Arkouda and/or return the resulting Arkouda object.

Export Export takes a file taht was saved using Arkouda and reads it into Pandas. The user is able to specify if they would like to save the result to a file that can be read by Pandas and/or return the resulting Pandas object.

Note: If the file being read in is Parquet, the resulting file that can be read by Arkouda will also be Parquet. This is also true for HDF5.

This functionality is currently performed on the client and is assuming that dataset sizes are able to be handled in the client due to being written by Pandas. Arkouda natively verifies the size of data before writing it to the client, so exports are limited.

arkouda.import_data(read_path, write_file=None, return_obj=True, index=False)[source]

Import data from a file saved by Pandas (HDF5/Parquet) to Arkouda object and/or a file formatted to be read by Arkouda.

Parameters:
  • read_path (str) – path to file where pandas data is stored. This can be glob expression for parquet formats.

  • write_file (str, optional) – path to file to write arkouda formatted data to. Only write file if provided

  • return_obj (bool, optional) – Default True. When True return the Arkouda DataFrame object, otherwise return None

  • index (bool, optional) – Default False. When True, maintain the indexes loaded from the pandas file

Raises:
  • RuntimeWarning

    • Export attempted on Parquet file. Arkouda formatted Parquet files are readable by pandas.

  • RuntimeError

    • Unsupported file type

Returns:

When return_obj=True

Return type:

pd.DataFrame

See also

pandas.DataFrame.to_parquet, pandas.DataFrame.to_hdf, pandas.DataFrame.read_parquet, pandas.DataFrame.read_hdf, ak.export

Notes

  • Import can only be performed from hdf5 or parquet files written by pandas.

arkouda.export(read_path, dataset_name='ak_data', write_file=None, return_obj=True, index=False)[source]

Export data from Arkouda file (Parquet/HDF5) to Pandas object or file formatted to be readable by Pandas

Parameters:
  • read_path (str) – path to file where arkouda data is stored.

  • dataset_name (str) – name to store dataset under

  • index (bool) – Default False. When True, maintain the indexes loaded from the pandas file

  • write_file (str, optional) – path to file to write pandas formatted data to. Only write the file if this is set

  • return_obj (bool, optional) – Default True. When True return the Pandas DataFrame object, otherwise return None

Raises:

RuntimeError

  • Unsupported file type

Returns:

When return_obj=True

Return type:

pd.DataFrame

See also

pandas.DataFrame.to_parquet, pandas.DataFrame.to_hdf, pandas.DataFrame.read_parquet, pandas.DataFrame.read_hdf, ak.import_data

Notes

  • If Arkouda file is exported for pandas, the format will not change. This mean parquet files will remain parquet and hdf5 will remain hdf5.

  • Export can only be performed from hdf5 or parquet files written by Arkouda. The result will be the same file type, but formatted to be read by Pandas.