arkouda.pandas.io

Input/output utilities for Arkouda.

The arkouda.io module provides an interface for reading from and writing to several file formats including HDF5, Parquet, CSV, and Zarr. It supports importing and exporting data between Arkouda and pandas, checkpointing data, and snapshot/restore functionality for Arkouda server state.

Core functionality includes

  • File format detection and dataset inspection

  • Reading and writing structured datasets using HDF5 and Parquet

  • CSV read/write support with header parsing

  • Zarr format support for chunked array storage

  • pandas interoperability via import_data and export

  • Checkpointing via save_checkpoint and load_checkpoint

  • Serialization and deserialization of Arkouda objects via snapshot and restore

  • Dataset tagging for provenance tracking during read operations

  • Transferring arrays and DataFrames between Arkouda server instances (receive and receive_dataframe)

Supported data types include pdarray, Strings, SegArray, Categorical, DataFrame, Index, and MultiIndex. Many operations also support compatibility with standard pandas file formats for interoperability.

Functions

File inspection

get_filetype, ls, ls_csv, get_datasets, get_columns

Data import/export

read_hdf, read_parquet, read_csv, read_zarr, read, to_hdf, to_parquet, to_csv, to_zarr, import_data, export

Snapshotting

snapshot, restore, save_checkpoint, load_checkpoint

Advanced features

update_hdf, load, load_all, read_tagged_data, receive, receive_dataframe

Examples

>>> import arkouda as ak
>>> from arkouda.pandas.io import to_parquet, read_parquet
>>> import os.path
>>> from pathlib import Path
>>> my_path = os.path.join(os.getcwd(), "output")
>>> Path(my_path).mkdir(parents=True, exist_ok=True)

Create and save a DataFrame:

>>> data = [ak.arange(10), ak.linspace(0, 1, 10)]
>>> Path(my_path + "/parquet_data").mkdir(parents=True, exist_ok=True)
>>> to_parquet(data, my_path + "/parquet_data/data.parquet")

Load the DataFrame back:

>>> data2 = read_parquet(my_path + "/parquet_data/data*")

Save to HDF5:

>>> ak.to_hdf(data, my_path + "data.hdf5")

Read from HDF5 with explicit dataset name:

>>> data3 = ak.read_hdf(my_path + "data*")

Export to pandas-compatible Parquet:

>>> df = ak.DataFrame({"a": ak.arange(10), "b": ak.linspace(0, 1, 10)})
>>> df2 = ak.export(my_path + "/parquet_data/data.parquet")

See also

arkouda.DataFrame, arkouda.pdarray, arkouda.strings.Strings, arkouda.segarray.SegArray, arkouda.categorical.Categorical, arkouda.index.Index, arkouda.index.MultiIndex

Functions

export(read_path[, dataset_name, write_file, ...])

Export data from arkouda to pandas.

get_columns(→ List[str])

Get a list of column names from CSV file(s).

get_datasets(→ List[str])

Get the names of the datasets in the provide files.

get_filetype(→ str)

Get the type of a file accessible to the server.

get_null_indices(...)

Get null indices of a string column in a Parquet file.

import_data(read_path[, write_file, return_obj, index])

Import data from a file saved by Pandas (HDF5/Parquet).

load(→ Union[Mapping[str, ...)

Load objects previously saved with pdarray.save().

load_all(→ Mapping[str, ...)

Load multiple pdarrays, Strings, SegArrays, or Categoricals previously saved with save_all().

load_checkpoint(name[, path])

Load server's state.

ls(→ List[str])

List the contents of an HDF5 or Parquet file on the Arkouda server.

ls_csv(→ List[str])

List the datasets within a file when a CSV does not have a header.

read(→ Union[Mapping[str, ...)

Read datasets from files.

read_csv(→ Union[Mapping[str, ...)

Read CSV file(s) into Arkouda objects.

read_hdf(→ Union[Mapping[str, ...)

Read Arkouda objects from HDF5 files.

read_parquet(→ Union[Mapping[str, ...)

Read Arkouda objects from Parquet files.

read_tagged_data(filenames[, datasets, strictTypes, ...])

Read datasets from files and tag each record with the file it was read from.

read_zarr(store_path, ndim, dtype)

Read a Zarr store from disk into a pdarray.

receive(hostname, port)

Receive a pdarray sent by pdarray.transfer().

receive_dataframe(hostname, port)

Receive a pdarray sent by dataframe.transfer().

restore(filename)

Return data saved using ak.snapshot.

save_checkpoint([name, path, mode])

Save the server's state.

snapshot(filename)

Create a snapshot of the current Arkouda namespace.

to_csv(columns, prefix_path[, names, col_delim, overwrite])

Write Arkouda object(s) to CSV file(s).

to_hdf(→ None)

Save multiple named pdarrays to HDF5 files.

to_parquet(→ None)

Save multiple named arrays to Parquet files.

to_zarr(store_path, arr, chunk_shape)

Write a pdarray to disk as a Zarr store.

update_hdf(columns, prefix_path[, names, repack])

Overwrite the datasets with name appearing in names or keys in columns if columns is a dictionary.

Module Contents

arkouda.pandas.io.export(read_path: str, dataset_name: str = 'ak_data', write_file: str | None = None, return_obj: bool = True, index: bool = False)[source]

Export data from arkouda to pandas.

Export data from Arkouda file (Parquet/HDF5) to Pandas object or file formatted to be readable by Pandas.

Parameters:
  • read_path (str) – path to file where arkouda data is stored.

  • dataset_name (str) – name to store dataset under

  • write_file (str) – path to file to write pandas formatted data to. Only write the file if this is set. Default is None.

  • return_obj (bool) – When True (default) return the Pandas DataFrame object, otherwise return None.

  • index (bool) – Default False. When True, maintain the indexes loaded from the pandas file

Raises:

RuntimeError

  • Unsupported file type

Returns:

When return_obj=True

Return type:

pd.DataFrame

See also

pandas.DataFrame.to_parquet, pandas.DataFrame.to_hdf, pandas.DataFrame.read_parquet, pandas.DataFrame.read_hdf, ak.import_data

Notes

  • If Arkouda file is exported for pandas, the format will not change. This mean parquet files will remain parquet and hdf5 will remain hdf5.

  • Export can only be performed from hdf5 or parquet files written by Arkouda. The result will be the same file type, but formatted to be read by Pandas.

arkouda.pandas.io.get_columns(filenames: str | List[str], col_delim: str = ',', allow_errors: bool = False) List[str][source]

Get a list of column names from CSV file(s).

arkouda.pandas.io.get_datasets(filenames: str | List[str], allow_errors: bool = False, column_delim: str = ',', read_nested: bool = True) List[str][source]

Get the names of the datasets in the provide files.

Parameters:
  • filenames (str or List[str]) – Name of the file/s from which to return datasets

  • allow_errors (bool) – Default: False Whether or not to allow errors while accessing datasets

  • column_delim (str) – Column delimiter to be used if dataset is CSV. Otherwise, unused.

  • read_nested (bool) – Default True, when True, SegArray objects will be read from the file. When False, SegArray (or other nested Parquet columns) will be ignored. Only used for Parquet Files.

Return type:

List[str] of names of the datasets

Raises:

RuntimeError

  • If no datasets are returned

Notes

  • This function currently supports HDF5 and Parquet formats.

  • Future updates to Parquet will deprecate this functionality on that format,

but similar support will be added for Parquet at that time. - If a list of files is provided, only the datasets in the first file will be returned

See also

ls

arkouda.pandas.io.get_filetype(filenames: str | List[str]) str[source]

Get the type of a file accessible to the server.

Supported file types and possible return strings are ‘HDF5’ and ‘Parquet’.

Parameters:

filenames (Union[str, List[str]]) – A file or list of files visible to the arkouda server

Returns:

Type of the file returned as a string, either ‘HDF5’, ‘Parquet’ or ‘CSV

Return type:

str

Raises:

ValueError – Raised if filename is empty or contains only whitespace

Notes

  • When list provided, it is assumed that all files are the same type

  • CSV Files without the Arkouda Header are not supported

arkouda.pandas.io.get_null_indices(filenames: str | List[str], datasets: str | List[str] | None = None) arkouda.numpy.pdarrayclass.pdarray | Mapping[str, arkouda.numpy.pdarrayclass.pdarray][source]

Get null indices of a string column in a Parquet file.

Parameters:
  • filenames (list or str) – Either a list of filenames or shell expression

  • datasets (list or str or None) – (List of) name(s) of dataset(s) to read. Each dataset must be a string column. There is no default value for this function, the datasets to be read must be specified.

Returns:

Dictionary of {datasetName: pdarray}

Return type:

returns a dictionary of Arkouda pdarrays

Raises:
  • RuntimeError – Raised if one or more of the specified files cannot be opened.

  • TypeError – Raised if we receive an unknown arkouda_type returned from the server

See also

get_datasets, ls

arkouda.pandas.io.import_data(read_path: str, write_file: str | None = None, return_obj: bool = True, index: bool = False)[source]

Import data from a file saved by Pandas (HDF5/Parquet).

Import data from a file saved by Pandas (HDF5/Parquet) to Arkouda object and/or a file formatted to be read by Arkouda.

Parameters:
  • read_path (str) – path to file where pandas data is stored. This can be glob expression for parquet formats.

  • write_file (str, optional) – path to file to write arkouda formatted data to. Only write file if provided

  • return_obj (bool) – If True (default), return the Arkouda DataFrame object. If False, return None.

  • index (bool) – If True, maintain the indexes loaded from the pandas file. Default is False.

Raises:
  • RuntimeWarning

    • Export attempted on Parquet file. Arkouda formatted Parquet files are readable by pandas.

  • RuntimeError

    • Unsupported file type

Returns:

When return_obj=True

Return type:

pd.DataFrame

See also

pandas.DataFrame.to_parquet, pandas.DataFrame.to_hdf, pandas.DataFrame.read_parquet, pandas.DataFrame.read_hdf, ak.export

Notes

  • Import can only be performed from hdf5 or parquet files written by pandas.

arkouda.pandas.io.load(path_prefix: str, file_format: str = 'INFER', dataset: str = 'array', calc_string_offsets: bool = False, column_delim: str = ',') Mapping[str, arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings | arkouda.numpy.segarray.SegArray | arkouda.pandas.categorical.Categorical | arkouda.pandas.dataframe.DataFrame | arkouda.client_dtypes.IPv4 | arkouda.numpy.timeclass.Datetime | arkouda.numpy.timeclass.Timedelta | arkouda.pandas.index.Index][source]

Load objects previously saved with pdarray.save().

Parameters:
  • path_prefix (str) – Filename prefix used when saving the original object.

  • file_format (str, default="INFER") – File format to load. One of "INFER", "HDF5", or "Parquet". If "INFER", the format will be detected automatically.

  • dataset (str, default="array") – Dataset name where the object was saved.

  • calc_string_offsets (bool, default=False) – If True, the server ignores the segmented Strings offsets array and derives offsets from null-byte terminators.

  • column_delim (str, default=",") – Column delimiter used if the dataset is CSV. Otherwise unused.

Returns:

DataFrame, IPv4, Datetime, Timedelta, Index]] Dictionary mapping datasetName to the loaded object. The values may be pdarray, Strings, SegArray, Categorical, DataFrame, IPv4, Datetime, Timedelta, or Index.

Return type:

Mapping[str, Union[pdarray, Strings, SegArray, Categorical,

Raises:
  • TypeError – Raised if either path_prefix or dataset is not a str.

  • ValueError – Raised if an invalid file_format is given, if the dataset is not present in all HDF5 files, or if path_prefix does not correspond to files accessible to Arkouda.

  • RuntimeError – Raised if the HDF5 files are present but an error occurs while opening one or more of them.

Notes

If a previously saved Parquet file raises a FileNotFoundError, try loading it with .parquet appended to path_prefix. Older versions of Arkouda always stored Parquet files with a .parquet extension.

ak.load does not support loading a single file. To load a single HDF5 file without the _LOCALE#### suffix, use ak.read().

CSV files without the Arkouda header are not supported.

Examples

>>> import arkouda as ak

Loading from file without extension:

>>> obj = ak.load("path/prefix")

This loads the array from numLocales files with the name cwd/path/name_prefix_LOCALE####. The file type is inferred automatically.

Loading with an extension (HDF5):

>>> obj = ak.load("path/prefix.test")

This loads the object from numLocales files with the name cwd/path/name_prefix_LOCALE####.test where #### corresponds to each locale number. Because the file type is inferred, the extension does not need to correspond to a specific format.

arkouda.pandas.io.load_all(path_prefix: str, file_format: str = 'INFER', column_delim: str = ',', read_nested: bool = True) Mapping[str, arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings | arkouda.numpy.segarray.SegArray | arkouda.pandas.categorical.Categorical | arkouda.pandas.dataframe.DataFrame | arkouda.client_dtypes.IPv4 | arkouda.numpy.timeclass.Datetime | arkouda.numpy.timeclass.Timedelta | arkouda.pandas.index.Index][source]

Load multiple pdarrays, Strings, SegArrays, or Categoricals previously saved with save_all().

Parameters:
  • path_prefix (str) – Filename prefix used to save the original pdarray

  • file_format (str) – ‘INFER’, ‘HDF5’, ‘Parquet’, or ‘CSV’. Defaults to ‘INFER’. Indicates the format being loaded. When ‘INFER’ the processing will detect the format Defaults to ‘INFER’

  • column_delim (str) – Column delimiter to be used if dataset is CSV. Otherwise, unused.

  • read_nested (bool) – Default True, when True, SegArray objects will be read from the file. When False, SegArray (or other nested Parquet columns) will be ignored. Parquet files only

Returns:

Dictionary of {datsetName: Union[pdarray, Strings, SegArray, Categorical]} with the previously saved pdarrays, Strings, SegArrays, or Categoricals

Return type:

Mapping[str, Union[pdarray, Strings, SegArray, Categorical]]

Raises:
  • TypeError – Raised if path_prefix is not a str

  • ValueError – Raised if file_format/extension is encountered that is not hdf5 or parquet or if all datasets are not present in all hdf5/parquet files or if the path_prefix does not correspond to files accessible to Arkouda

  • RuntimeError – Raised if the hdf5 files are present but there is an error in opening one or more of them

See also

to_parquet, to_hdf, load, read

Notes

This function has been updated to determine the file extension based on the file format variable

This function will be deprecated when glob flags are added to read_* methods

CSV files without the Arkouda Header are not supported.

arkouda.pandas.io.load_checkpoint(name, path='.akdata')[source]

Load server’s state.

The server metadata must match the current configuration (e.g. same number of locales must be used).

Parameters:
  • name (str) – Name of the checkpoint. <path>/<name> must be a directory.

  • path (str) – The directory to save the checkpoint.

Returns:

The checkpoint name, which will be the same as the name argument.

Return type:

str

Examples

>>> import arkouda as ak
>>> arr = ak.zeros(10, int)
>>> arr[2] = 2
>>> arr[2]
np.int64(2)
>>> cp_name = ak.save_checkpoint()
>>> arr[2] = 3
>>> arr[2]
np.int64(3)
>>> ak.load_checkpoint(cp_name)
>>> arr[2]
np.int64(3)

See also

save_checkpoint

arkouda.pandas.io.ls(filename: str, col_delim: str = ',', read_nested: bool = True) List[str][source]

List the contents of an HDF5 or Parquet file on the Arkouda server.

This function invokes the HDF5 h5ls utility on a file visible to the Arkouda server, or simulates a similar listing for Parquet files. For CSV files without headers, see ls_csv.

Parameters:
  • filename (str) – Path to the file on the Arkouda server. Must be a non-empty string.

  • col_delim (str) – Delimiter to use when interpreting CSV files. Default is “,”.

  • read_nested (bool) – If True, include nested Parquet columns (e.g., SegArray). If False, nested columns are ignored. Only applies to Parquet files. Default is True.

Returns:

A list of lines describing each dataset or column in the file.

Return type:

List[str]

Raises:
  • TypeError – If filename is not a string.

  • ValueError – If filename is empty or contains only whitespace.

  • RuntimeError – If an error occurs when running h5ls or simulating the Parquet listing.

Notes

  • Parquet support is limited and may change in future releases.

  • Output lines mirror the format of the HDF5 h5ls output.

  • For CSV files lacking headers, use ls_csv.

See also

ls_csv

List the contents of CSV files without headers.

arkouda.pandas.io.ls_csv(filename: str, col_delim: str = ',') List[str][source]

List the datasets within a file when a CSV does not have a header.

Parameters:
  • filename (str) – The name of the file to pass to the server

  • col_delim (str) – The delimiter used to separate columns if the file is a csv

Returns:

The string output of the datasets from the server

Return type:

str

See also

ls

arkouda.pandas.io.read(filenames: str | List[str], datasets: str | List[str] | None = None, iterative: bool = False, strictTypes: bool = True, allow_errors: bool = False, calc_string_offsets: bool = False, column_delim: str = ',', read_nested: bool = True, has_non_float_nulls: bool = False, fixed_len: int = -1) Mapping[str, arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings | arkouda.numpy.segarray.SegArray | arkouda.pandas.categorical.Categorical | arkouda.pandas.dataframe.DataFrame | arkouda.client_dtypes.IPv4 | arkouda.numpy.timeclass.Datetime | arkouda.numpy.timeclass.Timedelta | arkouda.pandas.index.Index][source]

Read datasets from files.

The file type is determined automatically.

Parameters:
  • filenames (Union[str, List[str]]) – Either a list of filenames or a shell expression.

  • datasets (Optional[Union[str, List[str]]], default=None) – Name or list of names of datasets to read. If None, all available datasets are read.

  • iterative (bool, default=False) – If True, make iterative function calls to the server. If False, make a single function call to the server.

  • strictTypes (bool, default=True) – If True, require all dtypes of a given dataset to have the same precision and sign. If False, allow dtypes of different precision and sign across different files. For example, if one file contains a uint32 dataset and another contains an int64 dataset with the same name, the contents of both will be read into an int64 pdarray.

  • allow_errors (bool, default=False) – If True, files with read errors may be skipped instead of causing the operation to fail. A warning will be included in the return containing the total number of files skipped due to failure and up to 10 filenames.

  • calc_string_offsets (bool, default=False) – If True, instruct the server to calculate the offsets or segments array instead of loading it from HDF5 files.

  • column_delim (str, default=",") – Column delimiter to use if the dataset is CSV. Otherwise unused.

  • read_nested (bool, default=True) – If True, SegArray objects are read from the file. If False, SegArray objects and other nested Parquet columns are ignored. Ignored if datasets is not None. Parquet only.

  • has_non_float_nulls (bool, default=False) – Must be set to True to read non-float Parquet columns that contain null values.

  • fixed_len (int, default=-1) – Fixed string length to use when reading Parquet string columns if the length of each string is known at runtime. This can avoid byte calculation and may improve performance.

Returns:

DataFrame, IPv4, Datetime, Timedelta, Index]] Dictionary mapping datasetName to the loaded object. The values may be pdarray, Strings, SegArray, Categorical, DataFrame, IPv4, Datetime, Timedelta, or Index.

Return type:

Mapping[str, Union[pdarray, Strings, SegArray, Categorical,

Raises:

RuntimeError – Raised if an invalid file type is detected.

Notes

If filenames is a string, it is interpreted as a shell expression. A single filename is a valid expression, so it will also work. The expression is expanded with glob to read all matching files.

If iterative=True, each dataset name and filename is passed to the server independently in sequence. If iterative=False, all dataset names and filenames are passed to the server in a single string.

If datasets is None, dataset names are inferred from the first file and all datasets are read. Use get_datasets to show the names of datasets in HDF5 or Parquet files.

CSV files without the Arkouda header are not supported.

Examples

>>> import arkouda as ak

Read a file with an extension:

>>> x = ak.read("path/name_prefix.h5")

The file type is determined from file contents, not the extension.

Read a Parquet file:

>>> x = ak.read("path/name_prefix.parquet")

Read files matching a glob expression:

>>> x = ak.read("path/name_prefix*")
arkouda.pandas.io.read_csv(filenames: str | List[str], datasets: str | List[str] | None = None, column_delim: str = ',', allow_errors: bool = False) Mapping[str, arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings | arkouda.numpy.segarray.SegArray | arkouda.pandas.categorical.Categorical | arkouda.pandas.dataframe.DataFrame | arkouda.client_dtypes.IPv4 | arkouda.numpy.timeclass.Datetime | arkouda.numpy.timeclass.Timedelta | arkouda.pandas.index.Index][source]

Read CSV file(s) into Arkouda objects.

If more than one dataset is found, the objects will be returned in a dictionary mapping the dataset name to the Arkouda object containing the data. If the file contains the appropriately formatted header, typed data will be returned. Otherwise, all data will be returned as a Strings object.

Parameters:
  • filenames (str or List[str]) – The filenames to read data from

  • datasets (str or List[str] (Optional)) – names of the datasets to read. When None, all datasets will be read.

  • column_delim (str) – The delimiter for column names and data. Defaults to “,”.

  • allow_errors (bool) – Default False, if True will allow files with read errors to be skipped instead of failing. A warning will be included in the return containing the total number of files skipped due to failure and up to 10 filenames.

Returns:

Dictionary of {datasetName: pdarray, String, or SegArray}

Return type:

Returns a dictionary of Arkouda pdarrays, Arkouda Strings, or Arkouda Segarrays.

Raises:
  • ValueError – Raised if all datasets are not present in all parquet files or if one or more of the specified files do not exist

  • RuntimeError – Raised if one or more of the specified files cannot be opened. If allow_errors is true this may be raised if no values are returned from the server.

  • TypeError – Raised if we receive an unknown arkouda_type returned from the server

See also

to_csv

Notes

  • CSV format is not currently supported by load/load_all operations

  • The column delimiter is expected to be the same for column names and data

  • Be sure that column delimiters are not found within your data.

  • All CSV files must delimit rows using newline (\\n) at this time.

  • Unlike other file formats, CSV files store Strings as their UTF-8 format instead of storing bytes as uint(8).

arkouda.pandas.io.read_hdf(filenames: str | List[str], datasets: str | List[str] | None = None, iterative: bool = False, strict_types: bool = True, allow_errors: bool = False, calc_string_offsets: bool = False, tag_data: bool = False) Mapping[str, arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings | arkouda.numpy.segarray.SegArray | arkouda.pandas.categorical.Categorical | arkouda.pandas.dataframe.DataFrame | arkouda.client_dtypes.IPv4 | arkouda.numpy.timeclass.Datetime | arkouda.numpy.timeclass.Timedelta | arkouda.pandas.index.Index][source]

Read Arkouda objects from HDF5 files.

Parameters:
  • filenames (Union[str, List[str]]) – Filename or list of filenames to read objects from.

  • datasets (Optional[Union[str, List[str]]], default=None) – Dataset name or list of dataset names to read from the provided files. If None, all datasets are read.

  • iterative (bool, default=False) – If True, make iterative function calls to the server. If False, make a single function call to the server.

  • strict_types (bool, default=True) – If True, require all dtypes of a given dataset to have the same precision and sign. If False, allow dtypes of different precision and sign across different files. For example, if one file contains a uint32 dataset and another contains an int64 dataset with the same name, the contents of both will be read into an int64 pdarray.

  • allow_errors (bool, default=False) – If True, files with read errors may be skipped instead of causing the operation to fail. A warning will be included in the return containing the total number of files skipped due to failure and up to 10 filenames.

  • calc_string_offsets (bool, default=False) – If True, instruct the server to calculate the offsets or segments array instead of loading it from HDF5 files. In the future, this option may become the default.

  • tag_data (bool, default=False) – If True, tag the returned data with the code associated with the filename from which it was read.

Returns:

  • Mapping[ – str, Union[

    pdarray, Strings, SegArray, Categorical, DataFrame, IPv4, Datetime, Timedelta, Index,

    ],

  • ]

  • Dictionary mapping datasetName to the loaded object. The values

  • may be pdarray, Strings, SegArray, Categorical,

  • DataFrame, IPv4, Datetime, Timedelta, or Index. – Dictionary mapping datasetName to the loaded object. The values may be pdarray, Strings, SegArray, Categorical, DataFrame, IPv4, Datetime, Timedelta, or Index.

Raises:
  • ValueError – Raised if not all datasets are present in all HDF5 files or if one or more of the specified files do not exist.

  • RuntimeError – Raised if one or more of the specified files cannot be opened. If allow_errors is True, this may also be raised if no values are returned from the server.

  • TypeError – Raised if an unknown Arkouda type is returned from the server.

Notes

If filenames is a string, it is interpreted as a shell expression. A single filename is a valid expression, so it will also work. The expression is expanded with glob to read all matching files.

If iterative=True, each dataset name and filename is passed to the server independently in sequence. If iterative=False, all dataset names and filenames are passed to the server in a single string.

If datasets is None, dataset names are inferred from the first file and all datasets are read. Use get_datasets to show dataset names in HDF5 files.

See also

read_tagged_data

Examples

>>> import arkouda as ak

Read a file with an extension:

>>> x = ak.read_hdf("path/name_prefix.h5")

Read files matching a glob expression:

>>> x = ak.read_hdf("path/name_prefix*")
arkouda.pandas.io.read_parquet(filenames: str | List[str], datasets: str | List[str] | None = None, iterative: bool = False, strict_types: bool = True, allow_errors: bool = False, tag_data: bool = False, read_nested: bool = True, has_non_float_nulls: bool = False, null_handling: str | None = None, fixed_len: int = -1) Mapping[str, arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings | arkouda.numpy.segarray.SegArray | arkouda.pandas.categorical.Categorical | arkouda.pandas.dataframe.DataFrame | arkouda.client_dtypes.IPv4 | arkouda.numpy.timeclass.Datetime | arkouda.numpy.timeclass.Timedelta | arkouda.pandas.index.Index][source]

Read Arkouda objects from Parquet files.

Parameters:
  • filenames (Union[str, List[str]]) – Filename or list of filenames to read objects from.

  • datasets (Optional[Union[str, List[str]]], default=None) – Dataset name or list of dataset names to read from the provided files. If None, all datasets are read.

  • iterative (bool, default=False) – If True, make iterative function calls to the server. If False, make a single function call to the server.

  • strict_types (bool, default=True) – If True, require all dtypes of a given dataset to have the same precision and sign. If False, allow dtypes of different precision and sign across different files. For example, if one file contains a uint32 dataset and another contains an int64 dataset with the same name, the contents of both will be read into an int64 pdarray.

  • allow_errors (bool, default=False) – If True, files with read errors may be skipped instead of causing the operation to fail. A warning will be included in the return containing the total number of files skipped due to failure and up to 10 filenames.

  • tag_data (bool, default=False) – If True, tag the data with the code associated with the filename from which the data was read.

  • read_nested (bool, default=True) – If True, SegArray objects are read from the file. If False, SegArray objects and other nested Parquet columns are ignored. If datasets is not None, this parameter is ignored.

  • has_non_float_nulls (bool, default=False) –

    Deprecated. Use null_handling instead.

    This flag must be set to True to read non-float Parquet columns that contain null values.

  • null_handling (Optional[str], default=None) –

    Null-handling mode. Supported values are "none", "only floats", and "all". If None, the default is "only floats".

    If "none", the data is assumed to contain no nulls. This gives the best performance, but behavior is undefined if nulls are present.

    If "only floats", only floating-point columns may contain nulls. This improves performance for other data types.

    If "all", any column may contain nulls. This is the most general mode, but it is slower overall.

  • fixed_len (int, default=-1) – Fixed string length to use when reading Parquet string columns if the length of each string is known at runtime. This can avoid byte calculation and may improve performance.

Returns:

DataFrame, IPv4, Datetime, Timedelta, Index]] Dictionary mapping datasetName to the loaded object. The values may be pdarray, Strings, SegArray, Categorical, DataFrame, IPv4, Datetime, Timedelta, or Index.

Return type:

Mapping[str, Union[pdarray, Strings, SegArray, Categorical,

Raises:
  • ValueError – Raised if not all datasets are present in all Parquet files or if one or more of the specified files do not exist.

  • RuntimeError – Raised if one or more of the specified files cannot be opened. If allow_errors is True, this may also be raised if no values are returned from the server.

  • TypeError – Raised if an unknown Arkouda type is returned from the server.

Notes

If filenames is a string, it is interpreted as a shell expression. A single filename is a valid expression, so it will also work. The expression is expanded with glob to read all matching files.

If iterative=True, each dataset name and filename is passed to the server independently in sequence. If iterative=False, all dataset names and filenames are passed to the server in a single string.

If datasets is None, dataset names are inferred from the first file and all datasets are read. Use get_datasets to show the names of datasets in Parquet files.

Parquet currently always recomputes offsets. This note should be updated when the Parquet workflow changes.

See also

read_tagged_data

Examples

>>> import arkouda as ak

Read a Parquet file:

>>> x = ak.read_parquet("path/name_prefix.parquet")

Read files matching a glob expression:

>>> x = ak.read_parquet("path/name_prefix*")
arkouda.pandas.io.read_tagged_data(filenames: str | List[str], datasets: str | List[str] | None = None, strictTypes: bool = True, allow_errors: bool = False, calc_string_offsets: bool = False, read_nested: bool = True, has_non_float_nulls: bool = False)[source]

Read datasets from files and tag each record with the file it was read from.

The file type is determined automatically.

Parameters:
  • filenames (Union[str, List[str]]) – Either a list of filenames or a shell expression.

  • datasets (Optional[Union[str, List[str]]], default=None) – Dataset name or list of dataset names to read. If None, all available datasets are read.

  • strictTypes (bool, default=True) – If True, require all dtypes of a given dataset to have the same precision and sign. If False, allow dtypes of different precision and sign across different files. For example, if one file contains a uint32 dataset and another contains an int64 dataset with the same name, the contents of both will be read into an int64 pdarray.

  • allow_errors (bool, default=False) – If True, files with read errors may be skipped instead of causing the operation to fail. A warning will be included in the return containing the total number of files skipped due to failure and up to 10 filenames.

  • calc_string_offsets (bool, default=False) – If True, instruct the server to calculate the offsets or segments array instead of loading it from HDF5 files. In the future, this option may become the default.

  • read_nested (bool, default=True) – If True, SegArray objects are read from the file. If False, SegArray objects and other nested Parquet columns are ignored. Ignored if datasets is not None. Parquet only.

  • has_non_float_nulls (bool, default=False) – Must be set to True to read non-float Parquet columns that contain null values.

Notes

This function is not currently supported for Categorical or GroupBy datasets.

Examples

>>> import arkouda as ak

Read files and return the data along with tagging information:

>>> data, cat = ak.read_tagged_data("path/name")

The codes in cat map each record in data to the file it came from. The returned data includes a "Filename_Codes" array.

>>> data
{"Filename_Codes": array([0 3 6 9 12]), "col_name": array([0 0 0 1])}
arkouda.pandas.io.read_zarr(store_path: str, ndim: int, dtype)[source]

Read a Zarr store from disk into a pdarray.

Supports multi-dimensional pdarrays of numeric types. To use this function, ensure you have installed the blosc dependency (make install-blosc) and have included ZarrMsg.chpl in the ServerModules.cfg file.

Parameters:
  • store_path (str) – The path to the Zarr store. The path must be to a directory that contains a .zarray file containing the Zarr store metadata.

  • ndim (int) – The number of dimensions in the array

  • dtype (str) – The data type of the array

Returns:

The pdarray read from the Zarr store.

Return type:

pdarray

arkouda.pandas.io.receive(hostname: str, port)[source]

Receive a pdarray sent by pdarray.transfer().

Parameters:
  • hostname (str) – The hostname of the pdarray that sent the array

  • port (int_scalars) – The port to send the array over. This needs to be an open port (i.e., not one that the Arkouda server is running on). This will open up numLocales ports, each of which in succession, so will use ports of the range {port..(port+numLocales)} (e.g., running an Arkouda server of 4 nodes, port 1234 is passed as port, Arkouda will use ports 1234, 1235, 1236, and 1237 to send the array data). This port much match the port passed to the call to pdarray.transfer().

Returns:

The pdarray sent from the sending server to the current receiving server.

Return type:

pdarray

Raises:
  • ValueError – Raised if the op is not within the pdarray.BinOps set

  • TypeError – Raised if other is not a pdarray or the pdarray.dtype is not a supported dtype

arkouda.pandas.io.receive_dataframe(hostname: str, port)[source]

Receive a pdarray sent by dataframe.transfer().

Parameters:
  • hostname (str) – The hostname of the dataframe that sent the array

  • port (int_scalars) – The port to send the dataframe over. This needs to be an open port (i.e., not one that the Arkouda server is running on). This will open up numLocales ports, each of which in succession, so will use ports of the range {port..(port+numLocales)} (e.g., running an Arkouda server of 4 nodes, port 1234 is passed as port, Arkouda will use ports 1234, 1235, 1236, and 1237 to send the array data). This port much match the port passed to the call to pdarray.send_array().

Returns:

The dataframe sent from the sending server to the current receiving server.

Return type:

pdarray

Raises:
  • ValueError – Raised if the op is not within the pdarray.BinOps set

  • TypeError – Raised if other is not a pdarray or the pdarray.dtype is not a supported dtype

arkouda.pandas.io.restore(filename)[source]

Return data saved using ak.snapshot.

Parameters:

filename (str) – Name used to create snapshot to be read

Return type:

Dict

Notes

Unlike other save/load methods using snapshot restore will save DataFrames alongside other objects in HDF5. Thus, they are returned within the dictionary as a dataframe.

arkouda.pandas.io.save_checkpoint(name='', path='.akdata', mode: Literal['overwrite', 'preserve_previous', 'error'] = 'overwrite')[source]

Save the server’s state.

Records some metadata about the server, and saves all pdarrays into parquet files.

Parameters:
  • name (str) – Name of the checkpoint. The default will be the server session ID, which is typically in format id_<hash>_. A directory will be created in path with this name.

  • path (str) – The directory to save the checkpoint. If the directory doesn’t exist, it will be created. If it exists, a new directory for the checkpoint instance will be created inside this directory.

  • mode ({'overwrite', 'preserve_previous', 'error'}) –

    How to handle an existing checkpoint with the same name. - 'overwrite' (default): overwrite the checkpoint files. - 'preserve_previous': rename existing checkpoint to <name>.prev,

    overwriting that if it exists.

    • 'error': raise an error if the checkpoint exists.

Notes

Only ``pdarray``s are saved. Other data structures will not be recorded. We expect to expand the coverage in the future.

Returns:

The checkpoint name, which will be the same as the name argument if it was passed.

Return type:

str

Examples

>>> import arkouda as ak
>>> arr = ak.zeros(10, int)
>>> arr[2] = 2
>>> arr[2]
np.int64(2)
>>> cp_name = ak.save_checkpoint()
>>> arr[2] = 3
>>> arr[2]
np.int64(3)
>>> ak.load_checkpoint(cp_name)
>>> arr[2]
np.int64(3)

See also

load_checkpoint

arkouda.pandas.io.snapshot(filename)[source]

Create a snapshot of the current Arkouda namespace.

All currently accessible variables containing Arkouda objects will be written to an HDF5 file.

Unlike other save/load functions, this maintains the integrity of dataframes.

Current Variable names are used as the dataset name when saving.

Parameters:

filename (str) – Name to use when storing file

See also

ak.restore

arkouda.pandas.io.to_csv(columns: Mapping[str, arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings] | List[arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings], prefix_path: str, names: List[str] | None = None, col_delim: str = ',', overwrite: bool = False)[source]

Write Arkouda object(s) to CSV file(s).

All CSV Files written by Arkouda include a header denoting data types of the columns.

Parameters:
  • columns (Mapping[str, pdarray] or List[pdarray]) – The objects to be written to CSV file. If a mapping is used and names is None the keys of the mapping will be used as the dataset names.

  • prefix_path (str) – The filename prefix to be used for saving files. Files will have _LOCALE#### appended when they are written to disk.

  • names (List[str] (Optional)) – names of dataset to be written. Order should correspond to the order of data provided in columns.

  • col_delim (str) – Defaults to “,”. Value to be used to separate columns within the file. Please be sure that the value used DOES NOT appear in your dataset.

  • overwrite (bool) – Defaults to False. If True, any existing files matching your provided prefix_path will be overwritten. If False, an error will be returned if existing files are found.

Raises:
  • ValueError – Raised if any datasets are present in all csv files or if one or more of the specified files do not exist

  • RuntimeError – Raised if one or more of the specified files cannot be opened. If allow_errors is true this may be raised if no values are returned from the server.

  • TypeError – Raised if we receive an unknown arkouda_type returned from the server

See also

read_csv

Notes

  • CSV format is not currently supported by load/load_all operations

  • The column delimiter is expected to be the same for column names and data

  • Be sure that column delimiters are not found within your data.

  • All CSV files must delimit rows using newline (\\n) at this time.

  • Unlike other file formats, CSV files store Strings as their UTF-8 format instead of storing bytes as uint(8).

arkouda.pandas.io.to_hdf(columns: Mapping[str, arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings | arkouda.numpy.segarray.SegArray] | List[arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings | arkouda.numpy.segarray.SegArray], prefix_path: str, names: List[str] | None = None, mode: Literal['truncate', 'append'] = 'truncate', file_type: Literal['single', 'distribute'] = 'distribute') None[source]

Save multiple named pdarrays to HDF5 files.

Parameters:
  • columns (dict or list of pdarrays) – Collection of arrays to save

  • prefix_path (str) – Directory and filename prefix for output files

  • names (list of str) – Dataset names for the pdarrays

  • mode ({"truncate", "append"}) – By default, truncate (overwrite) the output files if they exist. If ‘append’, attempt to create new dataset in existing files.

  • file_type ({"single", "distribute"}) – Default: distribute Single writes the dataset to a single file Distribute writes the dataset to a file per locale.

Raises:
  • ValueError – Raised if (1) the lengths of columns and values differ or (2) the mode is not ‘truncate’ or ‘append’

  • RuntimeError – Raised if a server-side error is thrown saving the pdarray

Notes

Creates one file per locale containing that locale’s chunk of each pdarray. If columns is a dictionary, the keys are used as the HDF5 dataset names. Otherwise, if no names are supplied, 0-up integers are used. By default, any existing files at path_prefix will be overwritten, unless the user specifies the ‘append’ mode, in which case arkouda will attempt to add <columns> as new datasets to existing files. If the wrong number of files is present or dataset names already exist, a RuntimeError is raised.

Examples

>>> import arkouda as ak
>>> a = ak.arange(25)
>>> b = ak.arange(25)

Save with mapping defining dataset names

>>> ak.to_hdf({'a': a, 'b': b}, 'path/name_prefix')

Save using names instead of mapping

>>> ak.to_hdf([a, b], 'path/name_prefix', names=['a', 'b'])
arkouda.pandas.io.to_parquet(columns: Mapping[str, arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings | arkouda.numpy.segarray.SegArray] | List[arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings | arkouda.numpy.segarray.SegArray], prefix_path: str, names: List[str] | None = None, mode: Literal['truncate', 'append'] = 'truncate', compression: str | None = None, convert_categoricals: bool = False) None[source]

Save multiple named arrays to Parquet files.

Parameters:
  • columns (Union[Mapping[str, Union[pdarray, Strings, SegArray]],) – List[Union[pdarray, Strings, SegArray]]] Collection of arrays to save.

  • prefix_path (str) – Directory and filename prefix for the output files.

  • names (Optional[List[str]], default=None) – Dataset names for the arrays when columns is provided as a list.

  • mode (Literal["truncate", "append"], default="truncate") –

    If "truncate", overwrite any existing output files. If "append", attempt to create a new dataset in existing files.

    "append" is deprecated. Use the multi-column write instead.

  • compression (Optional[str], default=None) – Compression type to use when writing the file. Supported values include "snappy", "gzip", "brotli", "zstd", and "lz4".

  • convert_categoricals (bool, default=False) – Parquet requires all columns to have the same size, and Categorical objects do not satisfy that requirement. If set to True, write the equivalent Strings in place of any Categorical columns.

Raises:
  • ValueError – Raised if the lengths of columns and names differ, or if mode is not "truncate" or "append".

  • RuntimeError – Raised if a server-side error occurs while saving the arrays.

See also

to_hdf, load, load_all, read

Notes

Creates one file per locale containing that locale’s chunk of each array.

If columns is a dictionary, its keys are used as the Parquet column names. Otherwise, if no names are supplied, integer names starting at 0 are used.

By default, any existing files at prefix_path are deleted regardless of whether they would be overwritten. If mode="append", Arkouda attempts to add columns as new datasets to existing files. If the wrong number of files is present or dataset names already exist, a RuntimeError is raised.

Examples

>>> import arkouda as ak
>>> a = ak.arange(25)
>>> b = ak.arange(25)

Save with a mapping defining dataset names:

>>> ak.to_parquet({"a": a, "b": b}, "path/name_prefix")

Save using names instead of a mapping:

>>> ak.to_parquet([a, b], "path/name_prefix", names=["a", "b"])
arkouda.pandas.io.to_zarr(store_path: str, arr: arkouda.numpy.pdarrayclass.pdarray, chunk_shape)[source]

Write a pdarray to disk as a Zarr store.

Supports multi-dimensional pdarrays of numeric types. To use this function, ensure you have installed the blosc dependency (make install-blosc) and have included ZarrMsg.chpl in the ServerModules.cfg file.

Parameters:
  • store_path (str) – The path at which Zarr store should be written

  • arr (pdarray) – The pdarray to be written to disk

  • chunk_shape (tuple) – The shape of the chunks to be used in the Zarr store

Raises:

ValueError – Raised if the number of dimensions in the chunk shape does not match the number of dimensions in the array or if the array is not a 32 or 64 bit numeric type

arkouda.pandas.io.update_hdf(columns: Mapping[str, arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings | arkouda.numpy.segarray.SegArray] | List[arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings | arkouda.numpy.segarray.SegArray], prefix_path: str, names: List[str] | None = None, repack: bool = True)[source]

Overwrite the datasets with name appearing in names or keys in columns if columns is a dictionary.

Parameters:
  • columns (dict or list of pdarrays) – Collection of arrays to save

  • prefix_path (str) – Directory and filename prefix for output files

  • names (list of str) – Dataset names for the pdarrays

  • repack (bool) – Default: True HDF5 does not release memory on delete. When True, the inaccessible data (that was overwritten) is removed. When False, the data remains, but is inaccessible. Setting to false will yield better performance, but will cause file sizes to expand.

Raises:

RuntimeError – Raised if a server-side error is thrown saving the datasets

Notes

  • If file does not contain File_Format attribute to indicate how it was saved, the file name is checked for _LOCALE#### to determine if it is distributed.

  • If the datasets provided do not exist, they will be added

  • Because HDF5 deletes do not release memory, this will create a copy of the file with the new data

  • This workflow is slightly different from to_hdf to prevent reading and creating a copy of the file for each dataset