arkouda

Arkouda: Exploratory data science at scale.

Arkouda is a Python API for exploratory data analysis on massive datasets. It leverages a Chapel-based backend to enable high-performance computing on distributed systems, while exposing a familiar NumPy- and Pandas-like interface to Python users.

Key Features

  • pdarray and Strings types for working with large numeric and string arrays.

  • Categorical, Series, DataFrame, and Index for labeled data analysis.

  • High-performance GroupBy, reductions, and broadcasting operations.

  • Interoperability with NumPy and Pandas for ease of use.

  • A scalable architecture suitable for HPC and cloud environments.

Example:

>>> import arkouda as ak
>>> ak.connect()
>>> a = ak.array([1, 2, 3])
>>> b = a + 5
>>> print(b)
array([6 7 8])

For full documentation, visit: https://bears-r-us.github.io/arkouda/

Submodules

Attributes

Classes

ArkoudaArray

Arkouda-backed numeric/bool pandas ExtensionArray.

ArkoudaBigintDtype

Arkouda-backed arbitrary-precision integer dtype.

ArkoudaBoolDtype

Arkouda-backed boolean dtype.

ArkoudaCategorical

Arkouda-backed categorical pandas ExtensionArray.

ArkoudaCategoricalDtype

Arkouda-backed categorical dtype.

ArkoudaFloat64Dtype

Arkouda-backed 64-bit floating-point dtype.

ArkoudaInt64Dtype

Extension dtype for Arkouda-backed 64-bit integers.

ArkoudaStringArray

Arkouda-backed string pandas ExtensionArray.

ArkoudaStringDtype

Arkouda-backed string dtype.

ArkoudaUint64Dtype

Arkouda-backed unsigned 64-bit integer dtype.

ArkoudaUint8Dtype

Arkouda-backed unsigned 8-bit integer dtype.

CachedAccessor

Descriptor for caching namespace-based accessors.

Categorical

Represents an array of values belonging to named categories.

DataFrame

A DataFrame structure based on Arkouda arrays.

DataFrameGroupBy

A DataFrame that has been grouped by a subset of columns.

DatetimeAccessor

Accessor for datetime-like operations on Arkouda Series.

DiffAggregate

A column in a GroupBy that has been differenced.

Index

Sequence used for indexing and alignment.

LogLevel

Enum for defining valid log levels used by ArkoudaLogger.

MultiIndex

A multi-level, or hierarchical, index object for Arkouda DataFrames and Series.

Power_divergenceResult

The results of a power divergence statistical test.

Properties

Base class for accessor implementations in Arkouda.

Row

Dictionary-like representation of a single row in an Arkouda DataFrame.

Series

One-dimensional Arkouda array with axis labels.

StringAccessor

Accessor for string operations on Arkouda Series.

Functions

apply(→ arkouda.numpy.pdarrayclass.pdarray)

Apply a python function to a pdarray.

assert_almost_equal(→ None)

Check that the left and right objects are approximately equal.

assert_almost_equivalent(→ None)

Check that two objects are approximately equal.

assert_arkouda_array_equal(→ None)

Check that two Arkouda arrays are equivalent. Supports pdarray, Strings,

assert_arkouda_array_equivalent(→ None)

Check that two Arkouda-compatible arrays are equal.

assert_arkouda_pdarray_equal(→ None)

Check that two Arkouda pdarray objects are equivalent.

assert_arkouda_segarray_equal(→ None)

Check that two Arkouda SegArray objects are equivalent.

assert_arkouda_strings_equal(→ None)

Check that two ak.Strings arrays are equivalent.

assert_attr_equal(→ None)

Check that attributes are equal. Both objects must have the given attribute.

assert_categorical_equal(→ None)

Test that Categoricals are equivalent.

assert_class_equal(→ None)

Check classes are equal.

assert_contains_all(→ None)

Assert that a dictionary contains all the elements of an iterable.

assert_copy(→ None)

Check that the elements are equal, but not the same object.

assert_dict_equal(→ None)

Assert that two dictionaries are equal.

assert_equal(→ None)

Wrap tm.assert_*_equal to dispatch to the appropriate test function.

assert_equivalent(→ None)

Dispatch to the appropriate assertion function depending on object types.

assert_frame_equal(→ None)

Check that left and right DataFrame are equal.

assert_frame_equivalent(→ None)

Check that two DataFrames are equal.

assert_index_equal(→ None)

Check that left and right Index are equal.

assert_index_equivalent(→ None)

Check that two Index objects are equal.

assert_is_sorted(→ None)

Assert that the sequence is sorted.

assert_series_equal(→ None)

Check that left and right Series are equal.

assert_series_equivalent(→ None)

Check that two Series are equal.

chisquare(f_obs[, f_exp, ddof])

Computes the chi square statistic and p-value.

compute_join_size(→ Tuple[int, int])

Compute the internal size of a hypothetical join between a and b. Returns

date_operators(cls)

Add common datetime operation methods to a DatetimeAccessor class.

disable_verbose(→ None)

Disables verbose logging.

enable_verbose(→ None)

Enable verbose logging (DEBUG log level) for all ArkoudaLoggers.

export(read_path[, dataset_name, write_file, ...])

Export data from arkouda to pandas.

from_series(...)

Convert a pandas Series to an Arkouda pdarray or Strings.

gen_ranges(starts, ends[, stride, return_lengths])

Generate a segmented array of variable-length, contiguous ranges between pairs of

get_columns(→ List[str])

Get a list of column names from CSV file(s).

get_datasets(→ List[str])

Get the names of the datasets in the provide files.

get_filetype(→ str)

Get the type of a file accessible to the server.

get_null_indices(...)

Get null indices of a string column in a Parquet file.

import_data(read_path[, write_file, return_obj, index])

Import data from a file saved by Pandas (HDF5/Parquet).

information(→ str)

Return a JSON formatted string containing information about the objects in names.

intersect(a, b[, positions, unique])

Find the intersection of two arkouda arrays.

intx(a, b)

Find all the rows that are in both dataframes.

invert_permutation(perm)

Find the inverse of a permutation array.

join_on_eq_with_dt(...)

Inner-join on equality between two integer arrays where the time-window predicate is also true.

list_registry([detailed])

Return a list containing the names of all registered objects.

list_symbol_table(→ List[str])

Return a list containing the names of all objects in the symbol table.

load(→ Union[Mapping[str, ...)

Load objects previously saved with pdarray.save().

load_all(→ Mapping[str, ...)

Load multiple pdarrays, Strings, SegArrays, or Categoricals previously saved with save_all().

load_checkpoint(name[, path])

Load server's state.

ls(→ List[str])

List the contents of an HDF5 or Parquet file on the Arkouda server.

ls_csv(→ List[str])

List the datasets within a file when a CSV does not have a header.

merge(→ DataFrame)

Merge Arkouda DataFrames with a database-style join.

power_divergence(f_obs[, f_exp, ddof, lambda_])

Computes the power divergence statistic and p-value.

pretty_print_information(→ None)

Print verbose information for each object in names in a human readable format.

read(→ Union[Mapping[str, ...)

Read datasets from files.

read_csv(→ Union[Mapping[str, ...)

Read CSV file(s) into Arkouda objects.

read_hdf(→ Union[Mapping[str, ...)

Read Arkouda objects from HDF5 files.

read_parquet(→ Union[Mapping[str, ...)

Read Arkouda objects from Parquet files.

read_tagged_data(filenames[, datasets, strictTypes, ...])

Read datasets from files and tag each record with the file it was read from.

read_zarr(store_path, ndim, dtype)

Read a Zarr store from disk into a pdarray.

receive(hostname, port)

Receive a pdarray sent by pdarray.transfer().

receive_dataframe(hostname, port)

Receive a pdarray sent by dataframe.transfer().

restore(filename)

Return data saved using ak.snapshot.

save_checkpoint([name, path, mode])

Save the server's state.

snapshot(filename)

Create a snapshot of the current Arkouda namespace.

string_operators(cls)

Add common string operation methods to a StringAccessor class.

to_csv(columns, prefix_path[, names, col_delim, overwrite])

Write Arkouda object(s) to CSV file(s).

to_hdf(→ None)

Save multiple named pdarrays to HDF5 files.

to_parquet(→ None)

Save multiple named arrays to Parquet files.

to_zarr(store_path, arr, chunk_shape)

Write a pdarray to disk as a Zarr store.

update_hdf(columns, prefix_path[, names, repack])

Overwrite the datasets with name appearing in names or keys in columns if columns is a dictionary.

write_log(log_msg[, tag, log_lvl])

Allow the user to write custom logs.

xlogy(x, y)

Computes x * log(y).

Package Contents

arkouda.AllSymbols = '__AllSymbols__'
class arkouda.ArkoudaArray(data: arkouda.numpy.pdarrayclass.pdarray | numpy.ndarray | Sequence[Any] | ArkoudaArray, dtype: Any = None, copy: bool = False)[source]

Bases: arkouda.pandas.extension._arkouda_extension_array.ArkoudaExtensionArray, pandas.api.extensions.ExtensionArray

Arkouda-backed numeric/bool pandas ExtensionArray.

Wraps or converts supported inputs into an Arkouda pdarray to serve as the backing store. Ensures the underlying array is 1-D and lives on the Arkouda server.

Parameters:
  • data (pdarray | ndarray | Sequence[Any] | ArkoudaArray) –

    Input to wrap or convert. - If an Arkouda pdarray, it is used directly unless dtype is given

    or copy=True, in which case a new array is created via ak.array.

    • If a NumPy array, it is transferred to Arkouda via ak.array.

    • If a Python sequence, it is converted to NumPy then to Arkouda.

    • If another ArkoudaArray, its underlying pdarray is reused.

  • dtype (Any, optional) – Desired dtype to cast to (NumPy dtype or Arkouda dtype string). If omitted, dtype is inferred from data.

  • copy (bool) – If True, attempt to copy the underlying data when converting/wrapping. Default is False.

Raises:
  • TypeError – If data cannot be interpreted as an Arkouda array-like object.

  • ValueError – If the resulting array is not one-dimensional.

default_fill_value

Sentinel used when filling missing values (default: -1).

Type:

int

Examples

>>> import arkouda as ak
>>> from arkouda.pandas.extension import ArkoudaArray
>>> ArkoudaArray(ak.arange(5))
ArkoudaArray([0 1 2 3 4])
>>> ArkoudaArray([10, 20, 30])
ArkoudaArray([10 20 30])
all(axis=0, skipna=True, **kwargs)[source]

Return whether all elements are True.

This is mainly to support pandas’ BaseExtensionArray.equals, which calls .all() on the result of a boolean expression.

any(axis=0, skipna=True, **kwargs)[source]

Return whether any element is True.

Added for symmetry with .all() and to support potential pandas boolean-reduction calls.

astype(dtype: numpy.dtype[Any], copy: bool = True) numpy.typing.NDArray[Any][source]
astype(dtype: pandas.core.dtypes.dtypes.ExtensionDtype, copy: bool = True) pandas.api.extensions.ExtensionArray
astype(dtype: Any, copy: bool = True) pandas.api.extensions.ExtensionArray | numpy.typing.NDArray[Any]

Cast the array to a specified dtype.

Casting rules:

  • If dtype requests object, returns a NumPy NDArray[Any] of dtype object containing the array values.

  • Otherwise, the target dtype is normalized using Arkouda’s dtype resolution rules.

  • If the normalized dtype matches the current dtype and copy=False, returns self.

  • In all other cases, casts the underlying Arkouda array to the target dtype and returns an Arkouda-backed ArkoudaExtensionArray.

Parameters:
  • dtype (Any) – Target dtype. May be a NumPy dtype, pandas dtype, Arkouda dtype, or any dtype-like object accepted by Arkouda.

  • copy (bool) – Whether to force a copy when the target dtype matches the current dtype. Default is True.

Returns:

The cast result. Returns a NumPy array only when casting to object; otherwise returns an Arkouda-backed ExtensionArray.

Return type:

Union[ExtensionArray, NDArray[Any]]

Examples

Basic numeric casting returns an Arkouda-backed array:

>>> import arkouda as ak
>>> from arkouda.pandas.extension import ArkoudaArray
>>> a = ArkoudaArray(ak.array([1, 2, 3], dtype="int64"))
>>> a.astype("float64").to_ndarray()
array([1., 2., 3.])

Casting to the same dtype with copy=False returns the original object:

>>> b = a.astype("int64", copy=False)
>>> b is a
True

Forcing a copy when the dtype is unchanged returns a new array:

>>> c = a.astype("int64", copy=True)
>>> c is a
False
>>> c.to_ndarray()
array([1, 2, 3])

Casting to object materializes the data to a NumPy array:

>>> a.astype(object)
array([1, 2, 3], dtype=object)

NumPy and pandas dtype objects are also accepted:

>>> import numpy as np
>>> a.astype(np.dtype("bool")).to_ndarray()
array([ True,  True,  True])
default_fill_value: int = -1
property dtype

An instance of ExtensionDtype.

See also

api.extensions.ExtensionDtype

Base class for extension dtypes.

api.extensions.ExtensionArray

Base class for extension array types.

api.extensions.ExtensionArray.dtype

The dtype of an ExtensionArray.

Series.dtype

The dtype of a Series.

DataFrame.dtype

The dtype of a DataFrame.

Examples

>>> pd.array([1, 2, 3]).dtype
Int64Dtype()
equals(other)[source]

Return if another array is equivalent to this array.

Equivalent means that both arrays have the same shape and dtype, and all values compare equal. Missing values in the same location are considered equal (in contrast with normal equality).

Parameters:

other (ExtensionArray) – Array to compare to this Array.

Returns:

Whether the arrays are equivalent.

Return type:

boolean

See also

numpy.array_equal

Equivalent method for numpy array.

Series.equals

Equivalent method for Series.

DataFrame.equals

Equivalent method for DataFrame.

Examples

>>> arr1 = pd.array([1, 2, np.nan])
>>> arr2 = pd.array([1, 2, np.nan])
>>> arr1.equals(arr2)
True
>>> arr1 = pd.array([1, 3, np.nan])
>>> arr2 = pd.array([1, 2, np.nan])
>>> arr1.equals(arr2)
False
isna() numpy.ndarray[source]

Return a boolean mask indicating missing values.

This method implements the pandas ExtensionArray.isna contract and always returns a NumPy ndarray of dtype bool with the same length as the array.

Returns:

A boolean mask where True marks elements considered missing.

Return type:

np.ndarray

Raises:

TypeError – If the underlying data buffer does not support missing-value detection or cannot produce a boolean mask.

isnull()[source]

Alias for isna().

property nbytes

The number of bytes needed to store this object in memory.

See also

ExtensionArray.shape

Return a tuple of the array dimensions.

ExtensionArray.size

The number of elements in the array.

Examples

>>> pd.array([1, 2, 3]).nbytes
27
value_counts(dropna: bool = True) pandas.Series[source]

Return counts of unique values as a pandas Series.

This method computes the frequency of each distinct value in the underlying Arkouda array and returns the result as a pandas Series, with the unique values as the index and their counts as the data.

Parameters:

dropna (bool) – Whether to exclude missing values. Currently, missing-value handling is supported only for floating-point data, where NaN values are treated as missing. Default is True.

Returns:

A Series containing the counts of unique values. The index is an ArkoudaArray of unique values, and the values are an ArkoudaArray of counts.

Return type:

pd.Series

Notes

  • Only dropna=True is supported.

  • The following pandas options are not yet implemented: normalize, sort, and bins.

  • Counting is performed server-side in Arkouda; only the small result (unique values and counts) is materialized on the client.

Examples

>>> import arkouda as ak
>>> from arkouda.pandas.extension import ArkoudaArray
>>>
>>> a = ArkoudaArray(ak.array([1, 2, 1, 3, 2, 1]))
>>> a.value_counts()
1    3
2    2
3    1
dtype: int64

Floating-point data with NaN values:

>>> b = ArkoudaArray(ak.array([1.0, 2.0, float("nan"), 1.0]))
>>> b.value_counts()
1.0    2
2.0    1
dtype: int64
arkouda.ArkoudaArrayLike
class arkouda.ArkoudaBigintDtype[source]

Bases: _ArkoudaBaseDtype

Arkouda-backed arbitrary-precision integer dtype.

This dtype integrates Arkouda’s server-backed pdarray<bigint> with the pandas ExtensionArray interface via ArkoudaArray. It enables pandas objects (Series, DataFrame) to hold and operate on very large integers that exceed 64-bit precision, while keeping the data distributed on the Arkouda server.

construct_array_type()[source]

Returns the ArkoudaArray class used for storage.

classmethod construct_array_type()[source]

Return the ExtensionArray subclass that handles storage for this dtype.

Returns:

The ArkoudaArray class associated with this dtype.

Return type:

type

kind = 'O'

A character code (one of ‘biufcmMOSUV’), default ‘O’

This should match the NumPy dtype used when the array is converted to an ndarray, which is probably ‘O’ for object if the extension type cannot be represented as a built-in NumPy type.

See also

numpy.dtype.kind

na_value = -1

Default NA value to use for this type.

This is used in e.g. ExtensionArray.take. This should be the user-facing “boxed” version of the NA value, not the physical NA value for storage. e.g. for JSONArray, this is an empty dictionary.

name = 'bigint'

A string identifying the data type.

Will be used for display in, e.g. Series.dtype

type

The scalar type for the array, e.g. int

It’s expected ExtensionArray[item] returns an instance of ExtensionDtype.type for scalar item, assuming that value is valid (not NA). NA values do not need to be instances of type.

class arkouda.ArkoudaBoolDtype[source]

Bases: _ArkoudaBaseDtype

Arkouda-backed boolean dtype.

This dtype integrates Arkouda’s server-backed pdarray<bool> with the pandas ExtensionArray interface via ArkoudaArray. It allows pandas objects (Series, DataFrame) to store and manipulate distributed boolean arrays without materializing them on the client.

construct_array_type()[source]

Returns the ArkoudaArray class used for storage.

classmethod construct_array_type()[source]

Return the ExtensionArray subclass that handles storage for this dtype.

Returns:

The ArkoudaArray class associated with this dtype.

Return type:

type

kind = 'b'

A character code (one of ‘biufcmMOSUV’), default ‘O’

This should match the NumPy dtype used when the array is converted to an ndarray, which is probably ‘O’ for object if the extension type cannot be represented as a built-in NumPy type.

See also

numpy.dtype.kind

na_value = False

Default NA value to use for this type.

This is used in e.g. ExtensionArray.take. This should be the user-facing “boxed” version of the NA value, not the physical NA value for storage. e.g. for JSONArray, this is an empty dictionary.

name = 'bool_'

A string identifying the data type.

Will be used for display in, e.g. Series.dtype

type

The scalar type for the array, e.g. int

It’s expected ExtensionArray[item] returns an instance of ExtensionDtype.type for scalar item, assuming that value is valid (not NA). NA values do not need to be instances of type.

class arkouda.ArkoudaCategorical(data: arkouda.pandas.categorical.Categorical | ArkoudaCategorical | numpy.ndarray | Sequence[Any])[source]

Bases: arkouda.pandas.extension._arkouda_extension_array.ArkoudaExtensionArray, pandas.api.extensions.ExtensionArray

Arkouda-backed categorical pandas ExtensionArray.

Ensures the underlying data is an Arkouda Categorical. Accepts an existing Categorical or converts from Python/NumPy sequences of labels.

Parameters:

data (Categorical | ArkoudaCategorical | ndarray | Sequence[Any]) – Input to wrap or convert. - If Categorical, used directly. - If another ArkoudaCategorical, its backing object is reused. - If list/tuple/ndarray, converted via ak.Categorical(ak.array(data)).

Raises:

TypeError – If data cannot be converted to Arkouda Categorical.

default_fill_value

Sentinel used when filling missing values (default: “”).

Type:

str

add_categories(*args, **kwargs)[source]
as_ordered(*args, **kwargs)[source]
as_unordered(*args, **kwargs)[source]
astype(dtype: numpy.dtype[Any], copy: bool = True) numpy.typing.NDArray[Any][source]
astype(dtype: pandas.core.dtypes.dtypes.ExtensionDtype, copy: bool = True) pandas.api.extensions.ExtensionArray
astype(dtype: Any, copy: bool = True) pandas.api.extensions.ExtensionArray | numpy.typing.NDArray[Any]

Cast to a specified dtype.

  • If dtype is categorical (pandas category / CategoricalDtype / ArkoudaCategoricalDtype), returns an Arkouda-backed ArkoudaCategorical (optionally copied).

  • If dtype requests object, returns a NumPy ndarray of dtype object containing the category labels (materialized to the client).

  • If dtype requests a string dtype, returns an Arkouda-backed ArkoudaStringArray containing the labels as strings.

  • Otherwise, casts the labels (as strings) to the requested dtype and returns an Arkouda-backed ExtensionArray.

Parameters:
  • dtype (Any) – Target dtype.

  • copy (bool) – Whether to force a copy when possible. If categorical-to-categorical and copy=True, attempts to copy the underlying Arkouda Categorical (if supported). Default is True.

Returns:

The cast result. Returns a NumPy array only when casting to object; otherwise returns an Arkouda-backed ExtensionArray.

Return type:

Union[ExtensionArray, NDArray[Any]]

Examples

Casting to category returns an Arkouda-backed categorical array:

>>> import arkouda as ak
>>> from arkouda.pandas.extension import ArkoudaCategorical
>>> c = ArkoudaCategorical(ak.Categorical(ak.array(["x", "y", "x"])))
>>> out = c.astype("category")
>>> out is c
False

Forcing a copy when casting to the same categorical dtype returns a new array:

>>> out2 = c.astype("category", copy=True)
>>> out2 is c
False
>>> out2.to_ndarray()
array(['x', 'y', 'x'], dtype='<U...')

Casting to object materializes the category labels to a NumPy object array:

>>> c.astype(object)
array(['x', 'y', 'x'], dtype=object)

Casting to a string dtype returns an Arkouda-backed string array of labels:

>>> s = c.astype("string")
>>> s.to_ndarray()
array(['x', 'y', 'x'], dtype='<U1')

Casting to another dtype casts the labels-as-strings and returns an Arkouda-backed array:

>>> c_num = ArkoudaCategorical(ak.Categorical(ak.array(["1", "2", "3"])))
>>> a = c_num.astype("int64")
>>> a.to_ndarray()
array([1, 2, 3])
check_for_ordered(*args, **kwargs)[source]
default_fill_value: str = ''
describe(*args, **kwargs)[source]
property dtype

An instance of ExtensionDtype.

See also

api.extensions.ExtensionDtype

Base class for extension dtypes.

api.extensions.ExtensionArray

Base class for extension array types.

api.extensions.ExtensionArray.dtype

The dtype of an ExtensionArray.

Series.dtype

The dtype of a Series.

DataFrame.dtype

The dtype of a DataFrame.

Examples

>>> pd.array([1, 2, 3]).dtype
Int64Dtype()
classmethod from_codes(*args, **kwargs)[source]
Abstractmethod:

isna() numpy.ndarray[source]

# Return a boolean mask indicating missing values.

# This implements the pandas ExtensionArray.isna contract and returns a # NumPy ndarray[bool] of the same length as this categorical array.

# Returns # ——- # np.ndarray # Boolean mask where True indicates a missing value.

# Raises # —— # TypeError # If the underlying categorical cannot expose its codes or if missing # detection is unsupported. #

isnull()[source]

Alias for isna().

max(*args, **kwargs)[source]
memory_usage(*args, **kwargs)[source]
min(*args, **kwargs)[source]
notna(*args, **kwargs)[source]
notnull(*args, **kwargs)[source]
remove_categories(*args, **kwargs)[source]
remove_unused_categories(*args, **kwargs)[source]
rename_categories(*args, **kwargs)[source]
reorder_categories(*args, **kwargs)[source]
set_categories(*args, **kwargs)[source]
set_ordered(*args, **kwargs)[source]
sort_values(*args, **kwargs)[source]
to_list(*args, **kwargs)[source]
value_counts(dropna: bool = True) pandas.Series[source]

Return counts of categories as a pandas Series.

This method computes category frequencies from the underlying Arkouda Categorical and returns them as a pandas Series, where the index contains the category labels and the values contain the corresponding counts.

Parameters:

dropna (bool) – Whether to drop missing values from the result. When True, the result is filtered using the categorical’s na_value. When False, all categories returned by the underlying computation are included. Default is True.

Returns:

A Series containing category counts. The index is an ArkoudaStringArray of category labels and the values are an ArkoudaArray of counts.

Return type:

pd.Series

Notes

  • The result is computed server-side in Arkouda; only the (typically small) output of categories and counts is materialized for the pandas Series.

  • This method does not yet support pandas options such as normalize, sort, or bins.

  • The handling of missing values depends on the Arkouda Categorical definition of na_value.

Examples

>>> import arkouda as ak
>>> from arkouda.pandas.extension import ArkoudaCategorical
>>>
>>> a = ArkoudaCategorical(["a", "b", "a", "c", "b", "a"])
>>> a.value_counts()
a    3
b    2
c    1
dtype: int64
class arkouda.ArkoudaCategoricalDtype[source]

Bases: _ArkoudaBaseDtype

Arkouda-backed categorical dtype.

This dtype integrates Arkouda’s distributed Categorical type with the pandas ExtensionArray interface via ArkoudaCategorical. It enables pandas objects (Series, DataFrame) to hold categorical data stored and processed on the Arkouda server, while exposing familiar pandas APIs.

construct_array_type()[source]

Returns the ArkoudaCategorical used as the storage class.

classmethod construct_array_type()[source]

Return the ExtensionArray subclass that handles storage for this dtype.

Returns:

The ArkoudaCategorical class associated with this dtype.

Return type:

type

kind = 'O'

A character code (one of ‘biufcmMOSUV’), default ‘O’

This should match the NumPy dtype used when the array is converted to an ndarray, which is probably ‘O’ for object if the extension type cannot be represented as a built-in NumPy type.

See also

numpy.dtype.kind

na_value = -1

Default NA value to use for this type.

This is used in e.g. ExtensionArray.take. This should be the user-facing “boxed” version of the NA value, not the physical NA value for storage. e.g. for JSONArray, this is an empty dictionary.

name = 'category'

A string identifying the data type.

Will be used for display in, e.g. Series.dtype

type

The scalar type for the array, e.g. int

It’s expected ExtensionArray[item] returns an instance of ExtensionDtype.type for scalar item, assuming that value is valid (not NA). NA values do not need to be instances of type.

class arkouda.ArkoudaFloat64Dtype[source]

Bases: _ArkoudaBaseDtype

Arkouda-backed 64-bit floating-point dtype.

This dtype integrates Arkouda’s server-backed pdarray<float64> with the pandas ExtensionArray interface via ArkoudaArray. It allows pandas objects (Series, DataFrame) to store and manipulate large distributed float64 arrays without materializing them on the client.

construct_array_type()[source]

Returns the ArkoudaArray class used for storage.

classmethod construct_array_type()[source]

Return the ExtensionArray subclass that handles storage for this dtype.

Returns:

The ArkoudaArray class associated with this dtype.

Return type:

type

kind = 'f'

A character code (one of ‘biufcmMOSUV’), default ‘O’

This should match the NumPy dtype used when the array is converted to an ndarray, which is probably ‘O’ for object if the extension type cannot be represented as a built-in NumPy type.

See also

numpy.dtype.kind

na_value

Default NA value to use for this type.

This is used in e.g. ExtensionArray.take. This should be the user-facing “boxed” version of the NA value, not the physical NA value for storage. e.g. for JSONArray, this is an empty dictionary.

name = 'float64'

A string identifying the data type.

Will be used for display in, e.g. Series.dtype

type

The scalar type for the array, e.g. int

It’s expected ExtensionArray[item] returns an instance of ExtensionDtype.type for scalar item, assuming that value is valid (not NA). NA values do not need to be instances of type.

class arkouda.ArkoudaInt64Dtype[source]

Bases: _ArkoudaBaseDtype

Extension dtype for Arkouda-backed 64-bit integers.

This dtype allows seamless use of Arkouda’s distributed int64 arrays inside pandas objects (Series, Index, DataFrame). It is backed by arkouda.pdarray with dtype='int64' and integrates with pandas via the ArkoudaArray extension array.

construct_array_type()[source]

Return the associated extension array class (ArkoudaArray).

classmethod construct_array_type()[source]

Return the associated pandas ExtensionArray type.

This is part of the pandas ExtensionDtype interface and is used internally by pandas when constructing arrays of this dtype. It ensures that operations like Series(..., dtype=ArkoudaInt64Dtype()) produce the correct Arkouda-backed extension array.

Returns:

The ArkoudaArray class that implements the storage and behavior for this dtype.

Return type:

type

Notes

  • This hook tells pandas which ExtensionArray to instantiate whenever this dtype is requested.

  • All Arkouda dtypes defined in this module will return ArkoudaArray (or a subclass thereof).

Examples

>>> from arkouda.pandas.extension import ArkoudaInt64Dtype
>>> ArkoudaInt64Dtype.construct_array_type()
<class 'arkouda.pandas.extension._arkouda_array.ArkoudaArray'>
kind = 'i'

A character code (one of ‘biufcmMOSUV’), default ‘O’

This should match the NumPy dtype used when the array is converted to an ndarray, which is probably ‘O’ for object if the extension type cannot be represented as a built-in NumPy type.

See also

numpy.dtype.kind

na_value = -1

Default NA value to use for this type.

This is used in e.g. ExtensionArray.take. This should be the user-facing “boxed” version of the NA value, not the physical NA value for storage. e.g. for JSONArray, this is an empty dictionary.

name = 'int64'

A string identifying the data type.

Will be used for display in, e.g. Series.dtype

type

The scalar type for the array, e.g. int

It’s expected ExtensionArray[item] returns an instance of ExtensionDtype.type for scalar item, assuming that value is valid (not NA). NA values do not need to be instances of type.

class arkouda.ArkoudaStringArray(data: arkouda.numpy.strings.Strings | numpy.ndarray | Sequence[Any] | ArkoudaStringArray)[source]

Bases: arkouda.pandas.extension._arkouda_extension_array.ArkoudaExtensionArray, pandas.api.extensions.ExtensionArray

Arkouda-backed string pandas ExtensionArray.

Ensures the underlying data is an Arkouda Strings object. Accepts existing Strings or converts from NumPy arrays and Python sequences of strings.

Parameters:

data (Strings | ndarray | Sequence[Any] | ArkoudaStringArray) – Input to wrap or convert. - If Strings, used directly. - If NumPy/sequence, converted via ak.array. - If another ArkoudaStringArray, its backing Strings is reused.

Raises:

TypeError – If data cannot be converted to Arkouda Strings.

default_fill_value

Sentinel used when filling missing values (default: “”).

Type:

str

all(*args, **kwargs)[source]
any(*args, **kwargs)[source]
argpartition(*args, **kwargs)[source]
astype(dtype: numpy.dtype[Any], copy: bool = True) numpy.typing.NDArray[Any][source]
astype(dtype: pandas.core.dtypes.dtypes.ExtensionDtype, copy: bool = True) pandas.api.extensions.ExtensionArray
astype(dtype: Any, copy: bool = True) pandas.api.extensions.ExtensionArray | numpy.typing.NDArray[Any]

Cast to a specified dtype.

Casting rules:

  • If dtype requests object, returns a NumPy NDArray[Any] of dtype object containing the string values.

  • If dtype is a string dtype (e.g. pandas StringDtype, NumPy unicode, or Arkouda string dtype), returns an ArkoudaStringArray. If copy=True, attempts to copy the underlying Arkouda Strings data.

  • For all other dtypes, casts the underlying Arkouda Strings using Strings.astype and returns an Arkouda-backed ArkoudaExtensionArray constructed from the result.

Parameters:
  • dtype (Any) – Target dtype. May be a NumPy dtype, pandas dtype, or Arkouda dtype.

  • copy (bool) – Whether to force a copy when the result is an ArkoudaStringArray. Default is True.

Returns:

The cast result. Returns a NumPy array only when casting to object; otherwise returns an Arkouda-backed ExtensionArray.

Return type:

Union[ExtensionArray, NDArray[Any]]

Examples

Casting to a string dtype returns an Arkouda-backed string array:

>>> import arkouda as ak
>>> from arkouda.pandas.extension import ArkoudaStringArray
>>> s = ArkoudaStringArray(ak.array(["a", "b", "c"]))
>>> out = s.astype("string")
>>> out is s
False

Forcing a copy when casting to a string dtype returns a new array:

>>> out2 = s.astype("string", copy=True)
>>> out2 is s
False
>>> out2.to_ndarray()
array(['a', 'b', 'c'], dtype='<U1')

Casting to object materializes the data to a NumPy array:

>>> s.astype(object)
array(['a', 'b', 'c'], dtype=object)

Casting to a non-string dtype uses Arkouda to cast the underlying strings and returns an Arkouda-backed ExtensionArray:

>>> s_num = ArkoudaStringArray(ak.array(["1", "2", "3"]))
>>> a = s_num.astype("int64")
>>> a.to_ndarray()
array([1, 2, 3])

NumPy and pandas dtype objects are also accepted:

>>> import numpy as np
>>> a = s_num.astype(np.dtype("float64"))
>>> a.to_ndarray()
array([1., 2., 3.])
byteswap(*args, **kwargs)[source]
choose(*args, **kwargs)[source]
clip(*args, **kwargs)[source]
compress(*args, **kwargs)[source]
conj(*args, **kwargs)[source]
conjugate(*args, **kwargs)[source]
cumprod(*args, **kwargs)[source]
cumsum(*args, **kwargs)[source]
default_fill_value: str = ''
diagonal(*args, **kwargs)[source]
dot(*args, **kwargs)[source]
property dtype

An instance of ExtensionDtype.

See also

api.extensions.ExtensionDtype

Base class for extension dtypes.

api.extensions.ExtensionArray

Base class for extension array types.

api.extensions.ExtensionArray.dtype

The dtype of an ExtensionArray.

Series.dtype

The dtype of a Series.

DataFrame.dtype

The dtype of a DataFrame.

Examples

>>> pd.array([1, 2, 3]).dtype
Int64Dtype()
dump(*args, **kwargs)[source]
dumps(*args, **kwargs)[source]
fill(*args, **kwargs)[source]
flatten(*args, **kwargs)[source]
getfield(*args, **kwargs)[source]
isna()[source]

A 1-D array indicating if each value is missing.

Returns:

In most cases, this should return a NumPy ndarray. For exceptional cases like SparseArray, where returning an ndarray would be expensive, an ExtensionArray may be returned.

Return type:

numpy.ndarray or pandas.api.extensions.ExtensionArray

See also

ExtensionArray.dropna

Return ExtensionArray without NA values.

ExtensionArray.fillna

Fill NA/NaN values using the specified method.

Notes

If returning an ExtensionArray, then

  • na_values._is_boolean should be True

  • na_values should implement ExtensionArray._reduce()

  • na_values should implement ExtensionArray._accumulate()

  • na_values.any and na_values.all should be implemented

Examples

>>> arr = pd.array([1, 2, np.nan, np.nan])
>>> arr.isna()
array([False, False,  True,  True])
item(*args, **kwargs)[source]

Return the array element at the specified position as a Python scalar.

Parameters:

index (int, optional) – Position of the element. If not provided, the array must contain exactly one element.

Returns:

The element at the specified position.

Return type:

scalar

Raises:
  • ValueError – If no index is provided and the array does not have exactly one element.

  • IndexError – If the specified position is out of bounds.

See also

numpy.ndarray.item

Return the item of an array as a scalar.

Examples

>>> arr = pd.array([1], dtype="Int64")
>>> arr.item()
np.int64(1)
>>> arr = pd.array([1, 2, 3], dtype="Int64")
>>> arr.item(0)
np.int64(1)
>>> arr.item(2)
np.int64(3)
max(*args, **kwargs)[source]
mean(*args, **kwargs)[source]
min(*args, **kwargs)[source]
nonzero(*args, **kwargs)[source]
partition(*args, **kwargs)[source]
prod(*args, **kwargs)[source]
put(*args, **kwargs)[source]
resize(*args, **kwargs)[source]
round(*args, **kwargs)[source]
setfield(*args, **kwargs)[source]
setflags(*args, **kwargs)[source]
sort(*args, **kwargs)[source]
std(*args, **kwargs)[source]
sum(*args, **kwargs)[source]
swapaxes(*args, **kwargs)[source]
to_device(*args, **kwargs)[source]
tobytes(*args, **kwargs)[source]
tofile(*args, **kwargs)[source]
trace(*args, **kwargs)[source]
value_counts(dropna: bool = True) pandas.Series[source]

Return counts of unique strings as a pandas Series.

This method computes the frequency of each distinct string value in the underlying Arkouda Strings object and returns the result as a pandas Series, with the unique string values as the index and their counts as the data.

Parameters:

dropna (bool) – Whether to exclude missing values. Missing-value handling for Arkouda string arrays is not yet implemented, so this parameter is accepted for pandas compatibility but currently has no effect. Default is True.

Returns:

A Series containing the counts of unique string values. The index is an ArkoudaStringArray of unique values, and the values are an ArkoudaArray of counts.

Return type:

pd.Series

Notes

  • The following pandas options are not yet implemented: normalize, sort, and bins.

  • Counting is performed server-side in Arkouda; only the small result (unique values and counts) is materialized on the client.

Examples

Basic usage:

>>> import arkouda as ak
>>> from arkouda.pandas.extension import ArkoudaStringArray
>>>
>>> s = ArkoudaStringArray(["red", "blue", "red", "green", "blue", "red"])
>>> s.value_counts()
red      3
blue     2
green    1
dtype: int64

Empty input:

>>> empty = ArkoudaStringArray([])
>>> empty.value_counts()
Series([], dtype: int64)
var(*args, **kwargs)[source]
class arkouda.ArkoudaStringDtype[source]

Bases: _ArkoudaBaseDtype

Arkouda-backed string dtype.

This dtype integrates Arkouda’s distributed Strings type with the pandas ExtensionArray interface via ArkoudaStringArray. It enables pandas objects (Series, DataFrame) to hold large, server-backed string columns without converting to NumPy or Python objects.

construct_array_type()[source]

Returns the ArkoudaStringArray used as the storage class.

classmethod construct_array_type()[source]

Return the ExtensionArray subclass that handles storage for this dtype.

Returns:

The ArkoudaStringArray class associated with this dtype.

Return type:

type

kind = 'O'

A character code (one of ‘biufcmMOSUV’), default ‘O’

This should match the NumPy dtype used when the array is converted to an ndarray, which is probably ‘O’ for object if the extension type cannot be represented as a built-in NumPy type.

See also

numpy.dtype.kind

na_value = ''

Default NA value to use for this type.

This is used in e.g. ExtensionArray.take. This should be the user-facing “boxed” version of the NA value, not the physical NA value for storage. e.g. for JSONArray, this is an empty dictionary.

name = 'string'

A string identifying the data type.

Will be used for display in, e.g. Series.dtype

type

The scalar type for the array, e.g. int

It’s expected ExtensionArray[item] returns an instance of ExtensionDtype.type for scalar item, assuming that value is valid (not NA). NA values do not need to be instances of type.

class arkouda.ArkoudaUint64Dtype[source]

Bases: _ArkoudaBaseDtype

Arkouda-backed unsigned 64-bit integer dtype.

This dtype integrates Arkouda’s uint64 arrays with pandas, allowing users to create pandas.Series or pandas.DataFrame objects that store their data on the Arkouda server while still conforming to the pandas ExtensionArray API.

construct_array_type()[source]

Return the ArkoudaArray class used as the storage container for this dtype.

Examples

>>> import arkouda as ak
>>> import pandas as pd
>>> from arkouda.pandas.extension import ArkoudaUint64Dtype, ArkoudaArray
>>> arr = ArkoudaArray(ak.array([1, 2, 3], dtype="uint64"))
>>> s = pd.Series(arr, dtype=ArkoudaUint64Dtype())
>>> s
0    1
1    2
2    3
dtype: uint64
classmethod construct_array_type()[source]

Return the ExtensionArray class associated with this dtype.

This is required by the pandas ExtensionDtype API. It tells pandas which ExtensionArray subclass should be used to hold data of this dtype inside a pandas.Series or pandas.DataFrame.

Returns:

The ArkoudaArray class, which implements the storage and operations for Arkouda-backed arrays.

Return type:

type

kind = 'u'

A character code (one of ‘biufcmMOSUV’), default ‘O’

This should match the NumPy dtype used when the array is converted to an ndarray, which is probably ‘O’ for object if the extension type cannot be represented as a built-in NumPy type.

See also

numpy.dtype.kind

na_value = -1

Default NA value to use for this type.

This is used in e.g. ExtensionArray.take. This should be the user-facing “boxed” version of the NA value, not the physical NA value for storage. e.g. for JSONArray, this is an empty dictionary.

name = 'uint64'

A string identifying the data type.

Will be used for display in, e.g. Series.dtype

type

The scalar type for the array, e.g. int

It’s expected ExtensionArray[item] returns an instance of ExtensionDtype.type for scalar item, assuming that value is valid (not NA). NA values do not need to be instances of type.

class arkouda.ArkoudaUint8Dtype[source]

Bases: _ArkoudaBaseDtype

Arkouda-backed unsigned 8-bit integer dtype.

This dtype integrates Arkouda’s uint8 arrays with the pandas ExtensionArray API, allowing pandas Series and DataFrame objects to store and operate on Arkouda-backed unsigned 8-bit integers. The underlying storage is an Arkouda pdarray<uint8>, exposed through the ArkoudaArray extension array.

construct_array_type()[source]

Returns the ArkoudaArray type that provides the storage and behavior for this dtype.

classmethod construct_array_type()[source]

Return the ExtensionArray subclass that handles storage for this dtype.

This method is required by the pandas ExtensionDtype interface. It tells pandas which ExtensionArray class to use when creating arrays of this dtype (for example, when calling Series(..., dtype="arkouda.uint8")).

Returns:

The ArkoudaArray class associated with this dtype.

Return type:

type

kind = 'u'

A character code (one of ‘biufcmMOSUV’), default ‘O’

This should match the NumPy dtype used when the array is converted to an ndarray, which is probably ‘O’ for object if the extension type cannot be represented as a built-in NumPy type.

See also

numpy.dtype.kind

na_value = -1

Default NA value to use for this type.

This is used in e.g. ExtensionArray.take. This should be the user-facing “boxed” version of the NA value, not the physical NA value for storage. e.g. for JSONArray, this is an empty dictionary.

name = 'uint8'

A string identifying the data type.

Will be used for display in, e.g. Series.dtype

type

The scalar type for the array, e.g. int

It’s expected ExtensionArray[item] returns an instance of ExtensionDtype.type for scalar item, assuming that value is valid (not NA). NA values do not need to be instances of type.

class arkouda.CachedAccessor(name: str, accessor)[source]

Descriptor for caching namespace-based accessors.

This custom property-like object enables lazy initialization of accessors (e.g., .str, .dt) on Series-like objects, similar to pandas-style extension accessors.

Parameters:
  • name (str) – The name of the namespace to be accessed (e.g., df.foo).

  • accessor (type) – A class implementing the accessor logic.

Notes

The accessor class’s __init__ method must accept a single positional argument, which should be one of Series, DataFrame, or Index.

class arkouda.Categorical(values, **kwargs)[source]

Represents an array of values belonging to named categories.

Converting a Strings object to Categorical often saves memory and speeds up operations, especially if there are many repeated values, at the cost of some one-time work in initialization.

Parameters:
  • values (Strings, Categorical, pd.Categorical) – Values to convert to categories

  • na_value (str scalar) – The value to use to represent missing/null data

categories

The set of category labels (determined automatically)

Type:

Strings

codes

The category indices of the values or -1 for N/A

Type:

pdarray, int64

permutation

The permutation that groups the values in the same order as categories

Type:

pdarray, int64

segments

When values are grouped, the starting offset of each group

Type:

Union[pdarray, None]

size

The number of items in the array

Type:

int_scalars

nlevels

The number of distinct categories

Type:

int_scalars

ndim

The rank of the array (currently only rank 1 arrays supported)

Type:

int_scalars

shape

The sizes of each dimension of the array

Type:

tuple

BinOps
RegisterablePieces
RequiredPieces
argsort(algorithm: arkouda.numpy.sorting.SortingAlgorithm = SortingAlgorithm.RadixSortLSD, ascending: bool = True) arkouda.numpy.pdarrayclass.pdarray[source]

Return the permutation of indices that would sort the Categorical.

Sorting is based on the order of the Categorical’s categories, not on the underlying codes.

Parameters:
  • algorithm (SortingAlgorithm, default SortingAlgorithm.RadixSortLSD) – The sorting algorithm to use.

  • ascending (bool, default True) – Whether to return indices that would sort the Categorical in ascending category order. If False, returns indices for descending order.

Returns:

An array of indices such that self[index] is sorted by category order.

Return type:

pdarray

Examples

>>> import arkouda as ak
>>> cat = ak.Categorical(ak.array(['dog', 'cat', 'dog', 'bird']))
>>> cat.argsort()
array([3 1 0 2])
>>> cat.argsort(ascending=False)
array([2 0 1 3])

The result can be used to reorder the Categorical: >>> sorted_cat = cat[cat.argsort()] >>> sorted_cat array([‘bird’, ‘cat’, ‘dog’, ‘dog’])

categories: arkouda.numpy.strings.Strings
codes: arkouda.numpy.pdarrayclass.pdarray
concatenate(others: Sequence[Categorical], ordered: bool = True) Categorical[source]

Merge this Categorical with other Categorical objects in the array.

Merge this Categorical with other Categorical objects in the array, concatenating the arrays and synchronizing the categories.

Parameters:
  • others (Sequence[Categorical]) – The Categorical arrays to concatenate and merge with this one

  • ordered (bool) – If True (default), the arrays will be appended in the order given. If False, array data may be interleaved in blocks, which can greatly improve performance but results in non-deterministic ordering of elements.

Returns:

The merged Categorical object

Return type:

Categorical

Raises:

TypeError – Raised if any others array objects are not Categorical objects

Notes

This operation can be expensive – slower than concatenating Strings.

contains(substr: bytes | arkouda.numpy.dtypes.str_scalars, regex: bool = False) arkouda.numpy.pdarrayclass.pdarray[source]

Check whether each element contains the given substring.

Parameters:
  • substr (Union[bytes, str_scalars]) – The substring to search for

  • regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)

Returns:

True for elements that contain substr, False otherwise

Return type:

pdarray

Raises:
  • TypeError – Raised if the substr parameter is not bytes or str_scalars

  • ValueError – Rasied if substr is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

Notes

This method can be significantly faster than the corresponding method on Strings objects, because it searches the unique category labels instead of the full array.

copy() Categorical[source]

Return an copy of the given Categorical.

Returns:

A deep copy of the Categorical.

Return type:

Categorical

dtype
endswith(substr: bytes | arkouda.numpy.dtypes.str_scalars, regex: bool = False) arkouda.numpy.pdarrayclass.pdarray[source]

Check whether each element ends with the given substring.

Parameters:
  • substr (Union[bytes, str_scalars]) – The substring to search for

  • regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)

Returns:

True for elements that end with substr, False otherwise

Return type:

pdarray

Raises:
  • TypeError – Raised if the substr parameter is not bytes or str_scalars

  • ValueError – Rasied if substr is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

Notes

This method can be significantly faster than the corresponding method on Strings objects, because it searches the unique category labels instead of the full array.

equals(other) arkouda.numpy.dtypes.bool_scalars[source]

Whether Categoricals are the same size and all entries are equal.

Parameters:

other (object) – object to compare.

Returns:

True if the Categoricals are the same, o.w. False.

Return type:

bool_scalars

Examples

>>> import arkouda as ak
>>> c = Categorical(ak.array(["a", "b", "c"]))
>>> c_cpy = Categorical(ak.array(["a", "b", "c"]))
>>> c.equals(c_cpy)
np.True_
>>> c2 = Categorical(ak.array(["a", "x", "c"]))
>>> c.equals(c2)
np.False_
classmethod from_codes(codes: arkouda.numpy.pdarrayclass.pdarray, categories: arkouda.numpy.strings.Strings, permutation=None, segments=None, **kwargs) Categorical[source]

Make a Categorical from codes and categories arrays.

If codes and categories have already been pre-computed, this constructor saves time. If not, please use the normal constructor.

Parameters:
  • codes (pdarray, int64) – Category indices of each value

  • categories (Strings) – Unique category labels

  • permutation (pdarray, int64) – The permutation that groups the values in the same order as categories

  • segments (pdarray, int64) – When values are grouped, the starting offset of each group

Returns:

The Categorical object created from the input parameters

Return type:

Categorical

Raises:

TypeError – Raised if codes is not a pdarray of int64 objects or if categories is not a Strings object

classmethod from_return_msg(rep_msg) Categorical[source]

Create categorical from return message from server.

Notes

This is currently only used when reading a Categorical from HDF5 files.

group() arkouda.numpy.pdarrayclass.pdarray[source]

Return the permutation that groups the array, placing equivalent categories together.

All instances of the same category are guaranteed to lie in one contiguous block of the permuted array, but the blocks are not necessarily ordered.

Returns:

The permutation that groups the array by value

Return type:

pdarray

See also

GroupBy, unique

Notes

This method is faster than the corresponding Strings method. If the Categorical was created from a Strings object, then this function simply returns the cached permutation. Even if the Categorical was created using from_codes(), this function will be faster than Strings.group() because it sorts dense integer values, rather than 128-bit hash values.

hash() Tuple[arkouda.numpy.pdarrayclass.pdarray, arkouda.numpy.pdarrayclass.pdarray][source]

Compute a 128-bit hash of each element of the Categorical.

Returns:

A tuple of two int64 pdarrays. The ith hash value is the concatenation of the ith values from each array.

Return type:

Tuple[pdarray,pdarray]

Notes

The implementation uses SipHash128, a fast and balanced hash function (used by Python for dictionaries and sets). For realistic numbers of strings (up to about 10**15), the probability of a collision between two 128-bit hash values is negligible.

in1d(test: arkouda.numpy.strings.Strings | Categorical) arkouda.numpy.pdarrayclass.pdarray[source]

Whether each element is also present in the test Strings or Categorical object.

Returns a boolean array the same length as self that is True where an element of self is in test and False otherwise.

Parameters:

test (Union[Strings,Categorical]) – The values against which to test each value of ‘self`.

Returns:

The values self[in1d] are in the test Strings or Categorical object.

Return type:

pdarray

Raises:

TypeError – Raised if test is not a Strings or Categorical object

See also

unique, intersect1d, union1d

Notes

in1d can be considered as an element-wise function version of the python keyword in, for 1-D sequences. in1d(a, b) is logically equivalent to ak.array([item in b for item in a]), but is much faster and scales to arbitrarily large a.

Examples

>>> import arkouda as ak
>>> strings = ak.array([f'String {i}' for i in range(0,5)])
>>> cat = ak.Categorical(strings)
>>> ak.in1d(cat,strings)
array([True True True True True])
>>> strings = ak.array([f'String {i}' for i in range(5,9)])
>>> catTwo = ak.Categorical(strings)
>>> ak.in1d(cat,catTwo)
array([False False False False False])
property inferred_type: str

Return a string of the type inferred from the values.

info() str[source]

Return a JSON formatted string containing information about all components of self.

Returns:

JSON string containing information about all components of self

Return type:

str

is_registered() numpy.bool_[source]

Return True iff the object is contained in the registry or is a component of a registered object.

Returns:

Indicates if the object is contained in the registry

Return type:

numpy.bool

Raises:

RegistrationError – Raised if there’s a server-side error or a mis-match of registered components

See also

register, attach, unregister, unregister_categorical_by_name

Notes

Objects registered with the server are immune to deletion until they are unregistered.

isna()[source]

Find where values are missing or null (as defined by self.na_value).

logger
property nbytes

The size of the Categorical in bytes.

Returns:

The size of the Categorical in bytes.

Return type:

int

ndim: arkouda.numpy.dtypes.int_scalars
nlevels: arkouda.numpy.dtypes.int_scalars
objType = 'Categorical'
permutation: arkouda.numpy.pdarrayclass.pdarray | None
pretty_print_info() None[source]

Print information about all components of self in a human-readable format.

register(user_defined_name: str) Categorical[source]

Register this Categorical object and underlying components with the Arkouda server.

Parameters:

user_defined_name (str) – user defined name the Categorical is to be registered under, this will be the root name for underlying components

Returns:

The same Categorical which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different Categoricals with the same name.

Return type:

Categorical

Raises:
  • TypeError – Raised if user_defined_name is not a str

  • RegistrationError – If the server was unable to register the Categorical with the user_defined_name

See also

unregister, attach, unregister_categorical_by_name, is_registered

Notes

Objects registered with the server are immune to deletion until they are unregistered.

registered_name: str | None = None
reset_categories() Categorical[source]

Recompute the category labels, discarding any unused labels.

This method is often useful after slicing or indexing a Categorical array, when the resulting array only contains a subset of the original categories. In this case, eliminating unused categories can speed up other operations.

Returns:

A Categorical object generated from the current instance

Return type:

Categorical

segments: arkouda.numpy.pdarrayclass.pdarray | None
set_categories(new_categories, na_value=None)[source]

Set categories to user-defined values.

Parameters:
  • new_categories (Strings) – The array of new categories to use. Must be unique.

  • na_value (str scalar) – The value to use to represent missing/null data

Returns:

A new Categorical with the user-defined categories. Old values present in new categories will appear unchanged. Old values not present will be assigned the NA value.

Return type:

Categorical

shape: tuple
size: arkouda.numpy.dtypes.int_scalars
sort_values()[source]

Return a sorted Categorical by category labels.

Returns:

A new Categorical with values sorted by category.

Return type:

Categorical

classmethod standardize_categories(arrays, na_value='N/A')[source]

Standardize an array of Categoricals so that they share the same categories.

Parameters:
  • arrays (sequence of Categoricals) – The Categoricals to standardize

  • na_value (str scalar) – The value to use to represent missing/null data

Returns:

A list of the original Categoricals remapped to the shared categories.

Return type:

List of Categoricals

startswith(substr: bytes | arkouda.numpy.dtypes.str_scalars, regex: bool = False) arkouda.numpy.pdarrayclass.pdarray[source]

Check whether each element starts with the given substring.

Parameters:
  • substr (Union[bytes, str_scalars]) – The substring to search for

  • regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)

Returns:

True for elements that start with substr, False otherwise

Return type:

pdarray

Raises:
  • TypeError – Raised if the substr parameter is not bytes or str_scalars

  • ValueError – Rasied if substr is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

Notes

This method can be significantly faster than the corresponding method on Strings objects, because it searches the unique category labels instead of the full array.

to_hdf(prefix_path: str, dataset: str = 'categorical_array', mode: Literal['truncate', 'append'] = 'truncate', file_type: Literal['single', 'distribute'] = 'distribute')[source]

Save the Categorical to HDF5.

The result is a collection of HDF5 files, one file per locale of the arkouda server, where each filename starts with prefix_path.

Parameters:
  • prefix_path (str) – Directory and filename prefix that all output files will share

  • dataset (str) – Name prefix for saved data within the HDF5 file

  • mode ({'truncate', 'append'}) – By default, truncate (overwrite) output files, if they exist. If ‘append’, add data as a new column to existing files.

  • file_type ({"single", "distribute"}) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale.

See also

load

to_ndarray() numpy.ndarray[source]

Convert the array to a np.ndarray.

Convert the array to a np.ndarray, transferring array data from the arkouda server to Python. This conversion discards category information and produces an ndarray of strings. If the arrays exceeds a built-in size limit, a RuntimeError is raised.

Returns:

A numpy ndarray of strings corresponding to the values in this array

Return type:

np.ndarray

Notes

The number of bytes in the array cannot exceed ak.core.client.maxTransferBytes, otherwise a RuntimeError will be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting ak.core.client.maxTransferBytes to a larger value, but proceed with caution.

to_pandas() pandas.Categorical[source]

Return the equivalent Pandas Categorical.

to_parquet(prefix_path: str, dataset: str = 'categorical_array', mode: str = 'truncate', compression: str | None = None) str[source]

[Not Yet Implemented] Save the Categorical to a Parquet dataset.

!!! This method is currently not supported and will raise a RuntimeError. !!! Parquet support for Categorical is under development.

When implemented, this method will write the Categorical to a set of Parquet files, one file per locale on the Arkouda server. Each file will be named using the prefix_path with locale-specific suffixes.

Parameters:
  • prefix_path (str) – The directory and filename prefix shared by all output files.

  • dataset (str, default="categorical_array") – The dataset name to use to create the Parquet files.

  • mode ({'truncate', 'append'}, default='truncate') – Specifies write behavior. Use ‘truncate’ to overwrite existing files or ‘append’ to add to them. (Appending is not yet efficient.)

  • compression (str, optional) – Compression algorithm to use when writing the file. Supported values include: ‘snappy’, ‘gzip’, ‘brotli’, ‘zstd’, ‘lz4’. Default is None (no compression).

Returns:

A message indicating the result of the operation.

Return type:

str

Raises:

RuntimeError – Always raised. Parquet export for Categorical is not yet supported.

Notes

  • The specified prefix_path must be writable and accessible to the Arkouda server.

  • The user must have write permission.

  • Output files will be named as <prefix_path>_LOCALE<i> for each locale i.

  • Appending mode requires that the existing files already match the server’s locale layout.

  • Appending mode is supported, but is not efficient.

  • File extensions are not used to determine file type.

  • If any of the output files already exist and

the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, a RuntimeError will result.

See also

to_hdf

Save the Categorical to HDF5 format (currently supported).

to_strings() arkouda.numpy.strings.Strings[source]

Convert the Categorical to Strings.

Returns:

A Strings object corresponding to the values in this Categorical.

Return type:

Strings

Examples

>>> import arkouda as ak
>>> a = ak.array(["a","b","c"])
>>> a
array(['a', 'b', 'c'])
>>> c = ak.Categorical(a)
>>> c.to_strings()
array(['a', 'b', 'c'])
>>> isinstance(c.to_strings(), ak.Strings)
True
tolist() List[str][source]

Convert the Categorical to a list.

Convert the Categorical to a list, transferring data from the arkouda server to Python. This conversion discards category information and produces a list of strings. If the arrays exceeds a built-in size limit, a RuntimeError is raised.

Returns:

A list of strings corresponding to the values in this Categorical

Return type:

List[str]

Notes

The number of bytes in the Categorical cannot exceed ak.core.client.maxTransferBytes, otherwise a RuntimeError will be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting ak.core.client.maxTransferBytes to a larger value, but proceed with caution.

transfer(hostname: str, port: arkouda.numpy.dtypes.int_scalars)[source]

Send a Categorical object to a different Arkouda server.

Parameters:
  • hostname (str) – The hostname where the Arkouda server intended to receive the Categorical is running.

  • port (int_scalars) – The port to send the array over. This needs to be an open port (i.e., not one that the Arkouda server is running on). This will open up numLocales ports, each of which in succession, so will use ports of the range {port..(port+numLocales)} (e.g., running an Arkouda server of 4 nodes, port 1234 is passed as port, Arkouda will use ports 1234, 1235, 1236, and 1237 to send the array data). This port much match the port passed to the call to ak.receive_array().

Return type:

A message indicating a complete transfer

Raises:
  • ValueError – Raised if the op is not within the pdarray.BinOps set

  • TypeError – Raised if other is not a pdarray or the pdarray.dtype is not a supported dtype

unique() Categorical[source]

Return the unique category values in the Categorical.

Returns:

A new Categorical containing only the unique category labels in use.

Return type:

Categorical

unregister() None[source]

Unregister this Categorical object.

Unregister this Categorical object in the arkouda server which was previously registered using register() and/or attached to using attach()

Raises:

RegistrationError – If the object is already unregistered or if there is a server error when attempting to unregister

See also

register, attach, unregister_categorical_by_name, is_registered

Notes

Objects registered with the server are immune to deletion until they are unregistered.

update_hdf(prefix_path, dataset='categorical_array', repack=True)[source]

Overwrite the dataset with the name provided with this Categorical object.

If the dataset does not exist it is added.

Parameters:
  • prefix_path (str) – Directory and filename prefix that all output files share

  • dataset (str) – Name of the dataset to create in files

  • repack (bool) – Default: True HDF5 does not release memory on delete. When True, the inaccessible data (that was overwritten) is removed. When False, the data remains, but is inaccessible. Setting to false will yield better performance, but will cause file sizes to expand.

Raises:

RuntimeError – Raised if a server-side error is thrown saving the Categorical

Notes

  • If file does not contain File_Format attribute to indicate how it was saved, the file name is checked for _LOCALE#### to determine if it is distributed.

  • If the dataset provided does not exist, it will be added

  • Because HDF5 deletes do not release memory, the repack option allows for automatic creation of a file without the inaccessible data.

class arkouda.DataFrame(initialdata=None, index=None, columns=None)[source]

Bases: collections.UserDict

A DataFrame structure based on Arkouda arrays.

Parameters:
  • initialdata (list or dict of lists, tuples, or pdarrays) – Each list or dictionary entry corresponds to one column of data and should be a homogeneous type. Different columns may have different types. If using a dictionary, keys should be strings.

  • index (Index, pdarray, or Strings) – Index for the resulting frame. Defaults to an integer range.

  • columns (list, tuple, pdarray, or Strings) – Column labels to use if the data does not include them. Elements must be strings. Defaults to a stringified integer range.

Examples

Create an empty DataFrame and add a column of data:

>>> import arkouda as ak
>>> df = ak.DataFrame()
>>> df["a"] = ak.array([1, 2, 3])
>>> df
   a
0  1
1  2
2  3 (3 rows x 1 columns)

Create a new DataFrame using a dictionary of data:

>>> userName = ak.array(["Alice", "Bob", "Alice", "Carol", "Bob", "Alice"])
>>> userID = ak.array([111, 222, 111, 333, 222, 111])
>>> item = ak.array([0, 0, 1, 1, 2, 0])
>>> day = ak.array([5, 5, 6, 5, 6, 6])
>>> amount = ak.array([0.5, 0.6, 1.1, 1.2, 4.3, 0.6])
>>> df = ak.DataFrame(
...     {
...         "userName": userName,
...         "userID": userID,
...         "item": item,
...         "day": day,
...         "amount": amount,
...     }
... )
>>> df
  userName  userID  item  day  amount
0    Alice     111     0    5     0.5
1      Bob     222     0    5     0.6
2    Alice     111     1    6     1.1
3    Carol     333     1    5     1.2
4      Bob     222     2    6     4.3
5    Alice     111     0    6     0.6 (6 rows x 5 columns)

Indexing works slightly differently than with pandas:

>>> df[0]
{'userName': np.str_('Alice'), 'userID': np.int64(111), 'item': np.int64(0),
'day': np.int64(5), 'amount': np.float64(0.5)}
>>> df["userID"]
array([111 222 111 333 222 111])
>>> df["userName"]
array(['Alice', 'Bob', 'Alice', 'Carol', 'Bob', 'Alice'])

Select rows by integer array:

>>> df[ak.array([1, 3, 5])]
  userName  userID  item  day  amount
1      Bob     222     0    5     0.6
3    Carol     333     1    5     1.2
5    Alice     111     0    6     0.6 (3 rows x 5 columns)

Slice rows:

>>> df[1:5:1]
  userName  userID  item  day  amount
1      Bob     222     0    5     0.6
2    Alice     111     1    6     1.1
3    Carol     333     1    5     1.2
4      Bob     222     2    6     4.3 (4 rows x 5 columns)
>>> df[ak.array([1, 2, 3])]
  userName  userID  item  day  amount
1      Bob     222     0    5     0.6
2    Alice     111     1    6     1.1
3    Carol     333     1    5     1.2 (3 rows x 5 columns)

Select columns by name:

>>> df[["userID", "day"]]
   userID  day
0     111    5
1     222    5
2     111    6
3     333    5
4     222    6
5     111    6 (6 rows x 2 columns)
all(axis=0) arkouda.pandas.series.Series | bool[source]

Return whether all elements are True, potentially over an axis.

Returns True unless there at least one element along a Dataframe axis that is False.

Currently, will ignore any columns that are not type bool. This is equivalent to the pandas option bool_only=True.

Parameters:

axis ({0 or ‘index’, 1 or ‘columns’, None}, default = 0) –

Indicate which axis or axes should be reduced.

0 / ‘index’ : reduce the index, return a Series whose index is the original column labels.

1 / ‘columns’ : reduce the columns, return a Series whose index is the original index.

None : reduce all axes, return a scalar.

Return type:

arkouda.pandas.series.Series or bool

Raises:

ValueError – Raised if axis does not have a value in {0 or ‘index’, 1 or ‘columns’, None}.

Examples

>>> import arkouda as ak
>>> df = ak.DataFrame({"A":[True,True,True,False],"B":[True,True,True,False],
...          "C":[True,False,True,False],"D":[True,True,True,True]})
>>> df
       A      B      C     D
0   True   True   True  True
1   True   True  False  True
2   True   True   True  True
3  False  False  False  True (4 rows x 4 columns)
>>> df.all(axis=0)
A    False
B    False
C    False
D     True
dtype: bool
>>> df.all(axis=1)
0     True
1    False
2     True
3    False
dtype: bool
>>> df.all(axis=None)
False
any(axis=0) arkouda.pandas.series.Series | bool[source]

Return whether any element is True, potentially over an axis.

Returns False unless there is at least one element along a Dataframe axis that is True.

Currently, will ignore any columns that are not type bool. This is equivalent to the pandas option bool_only=True.

Parameters:

axis ({0 or ‘index’, 1 or ‘columns’, None}, default = 0) –

Indicate which axis or axes should be reduced.

0 / ‘index’ : reduce the index, return a Series whose index is the original column labels.

1 / ‘columns’ : reduce the columns, return a Series whose index is the original index.

None : reduce all axes, return a scalar.

Return type:

arkouda.pandas.series.Series or bool

Raises:

ValueError – Raised if axis does not have a value in {0 or ‘index’, 1 or ‘columns’, None}.

Examples

>>> import arkouda as ak
>>> df = ak.DataFrame({"A":[True,True,True,False],"B":[True,True,True,False],
...          "C":[True,False,True,False],"D":[False,False,False,False]})
>>> df
       A      B      C      D
0   True   True   True  False
1   True   True  False  False
2   True   True   True  False
3  False  False  False  False (4 rows x 4 columns)
>>> df.any(axis=0)
A     True
B     True
C     True
D    False
dtype: bool
>>> df.any(axis=1)
0     True
1     True
2     True
3    False
dtype: bool
>>> df.any(axis=None)
True
append(other, ordered=True)[source]

Concatenate data from ‘other’ onto the end of this DataFrame, in place.

Explicitly, use the arkouda concatenate function to append the data from each column in other to the end of self. This operation is done in place, in the sense that the underlying pdarrays are updated from the result of the arkouda concatenate function, rather than returning a new DataFrame object containing the result.

Parameters:
  • other (DataFrame) – The DataFrame object whose data will be appended to this DataFrame.

  • ordered (bool, default=True) – If False, allow rows to be interleaved for better performance (but data within a row remains together). By default, append all rows to the end, in input order.

Returns:

Appending occurs in-place, but result is returned for compatibility.

Return type:

self

Examples

>>> import arkouda as ak
>>> df1 = ak.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
>>> df1
    col1  col2
0     1     3
1     2     4 (2 rows x 2 columns)
>>> df2 = ak.DataFrame({'col1': [3], 'col2': [5]})
>>> df2
    col1  col2
0     3     5 (1 rows x 2 columns)
>>> df1.append(df2)
    col1  col2
0     1     3
1     2     4
2     3     5 (3 rows x 2 columns)
>>> df1
    col1  col2
0     1     3
1     2     4
2     3     5 (3 rows x 2 columns)
apply_permutation(perm)[source]

Apply a permutation to an entire DataFrame.

The operation is done in place and the original DataFrame will be modified.

This may be useful if you want to unsort an DataFrame, or even to apply an arbitrary permutation such as the inverse of a sorting permutation.

Parameters:

perm (pdarray) – A permutation array. Should be the same size as the data arrays, and should consist of the integers [0,size-1] in some order. Very minimal testing is done to ensure this is a permutation.

See also

sort

Examples

>>> import arkouda as ak
>>> df = ak.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
>>> df
   col1  col2
0     1     4
1     2     5
2     3     6 (3 rows x 2 columns)
>>> perm_arry = ak.array([0, 2, 1])
>>> df.apply_permutation(perm_arry)
>>> df
   col1  col2
0     1     4
2     3     6
1     2     5 (3 rows x 2 columns)
argsort(key, ascending=True)[source]

Return the permutation that sorts the dataframe by key.

Parameters:
  • key (str) – The key to sort on.

  • ascending (bool, default = True) – If true, sort the key in ascending order. Otherwise, sort the key in descending order.

Returns:

The permutation array that sorts the data on key.

Return type:

arkouda.numpy.pdarrayclass.pdarray

See also

coargsort

Examples

>>> import arkouda as ak
>>> df = ak.DataFrame({'col1': [1.1, 3.1, 2.1], 'col2': [6, 5, 4]})
>>> df
   col1  col2
0   1.1     6
1   3.1     5
2   2.1     4 (3 rows x 2 columns)
>>> df.argsort('col1')
array([0 2 1])
>>> sorted_df1 = df[df.argsort('col1')]
>>> sorted_df1
   col1  col2
0   1.1     6
2   2.1     4
1   3.1     5 (3 rows x 2 columns)
>>> df.argsort('col2')
array([2 1 0])
>>> sorted_df2 = df[df.argsort('col2')]
>>> sorted_df2
   col1  col2
2   2.1     4
1   3.1     5
0   1.1     6 (3 rows x 2 columns)
assign(**kwargs) DataFrame[source]

Assign new columns to a DataFrame.

Return a new object with all original columns in addition to the newly assigned ones. Existing columns that are reassigned will be overwritten.

Parameters:

**kwargs (dict of {str: callable or Series}) –

The column names are the keyword arguments. If the values are callable, they are computed on the DataFrame and assigned to the new columns. The callable must not modify the input DataFrame.

If the values are not callable (for example a Series, scalar, or array), they are directly assigned.

Returns:

A new DataFrame with the new columns added alongside the existing columns.

Return type:

DataFrame

Notes

Assigning multiple columns within the same assign call is supported. Later items in **kwargs may refer to newly created or modified columns in df. Items are computed and assigned to df in order.

Examples

>>> import arkouda as ak
>>> df = ak.DataFrame(
...     {"temp_c": [17.0, 25.0]},
...     index=["Portland", "Berkeley"],
... )
>>> df
          temp_c
Portland    17.0
Berkeley    25.0 (2 rows x 1 columns)

When the value is a callable, it is evaluated on df:

>>> df.assign(temp_f=lambda x: x.temp_c * 9 / 5 + 32)
          temp_c  temp_f
Portland    17.0    62.6
Berkeley    25.0    77.0 (2 rows x 2 columns)

The same behavior can be achieved by referencing an existing column directly:

>>> df.assign(temp_f=df["temp_c"] * 9 / 5 + 32)
          temp_c  temp_f
Portland    17.0    62.6
Berkeley    25.0    77.0 (2 rows x 2 columns)

Multiple columns can be created in a single call where one depends on another defined within the same assign:

>>> df.assign(
...     temp_f=lambda x: x["temp_c"] * 9 / 5 + 32,
...     temp_k=lambda x: (x["temp_f"] + 459.67) * 5 / 9,
... )
          temp_c  temp_f  temp_k
Portland    17.0    62.6  290.15
Berkeley    25.0    77.0  298.15 (2 rows x 3 columns)
coargsort(keys, ascending=True)[source]

Return the permutation that sorts the dataframe by keys.

Note: Sorting using Strings may not yield correct sort order.

Parameters:

keys (list of str) – The keys to sort on.

Returns:

The permutation array that sorts the data on keys.

Return type:

arkouda.numpy.pdarrayclass.pdarray

Example

>>> import arkouda as ak
>>> df = ak.DataFrame({'col1': [2, 2, 1], 'col2': [3, 4, 3], 'col3':[5, 6, 7]})
>>> df
   col1  col2  col3
0     2     3     5
1     2     4     6
2     1     3     7 (3 rows x 3 columns)
>>> df.coargsort(['col1', 'col2'])
array([2 0 1])
>>>
property columns

An Index where the values are the column names of the dataframe.

Returns:

The values of the index are the column names of the dataframe.

Return type:

arkouda.pandas.index.Index

Examples

>>> import arkouda as ak
>>> df = ak.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
>>> df
   col1  col2
0     1     3
1     2     4 (2 rows x 2 columns)
>>> df.columns
Index(['col1', 'col2'], dtype='<U0')
classmethod concat(items, ordered=True)[source]

Essentially an append, but different formatting.

copy(deep=True)[source]

Make a copy of this object’s data.

When deep = True (default), a new object will be created with a copy of the calling object’s data. Modifications to the data of the copy will not be reflected in the original object.

When deep = False a new object will be created without copying the calling object’s data. Any changes to the data of the original object will be reflected in the shallow copy, and vice versa.

Parameters:

deep (bool, default=True) – When True, return a deep copy. Otherwise, return a shallow copy.

Returns:

A deep or shallow copy according to caller specification.

Return type:

DataFrame

Examples

>>> import arkouda as ak
>>> df = ak.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
>>> df
   col1  col2
0     1     3
1     2     4 (2 rows x 2 columns)
>>> df_deep = df.copy(deep=True)
>>> df_deep['col1'] +=1
>>> df
   col1  col2
0     1     3
1     2     4 (2 rows x 2 columns)
>>> df_shallow = df.copy(deep=False)
>>> df_shallow['col1'] +=1
>>> df
   col1  col2
0     2     3
1     3     4 (2 rows x 2 columns)
corr() DataFrame[source]

Return new DataFrame with pairwise correlation of columns.

Returns:

Arkouda DataFrame containing correlation matrix of all columns.

Return type:

DataFrame

Raises:

RuntimeError – Raised if there’s a server-side error thrown.

See also

pdarray.corr

Notes

Generate the correlation matrix using Pearson R for all columns.

Attempts to convert to numeric values where possible for inclusion in the matrix.

Example

>>> import arkouda as ak
>>> df = ak.DataFrame({'col1': [1, 2], 'col2': [-1, -2]})
>>> df
   col1  col2
0     1    -1
1     2    -2 (2 rows x 2 columns)
>>> corr = df.corr()
>>> corr
      col1  col2
col1   1.0  -1.0
col2  -1.0   1.0 (2 rows x 2 columns)
count(axis: int | str = 0, numeric_only=False) arkouda.pandas.series.Series[source]

Count non-NA cells for each column or row.

The values np.NaN are considered NA.

Parameters:
  • axis ({0 or 'index', 1 or 'columns'}, default 0) – If 0 or ‘index’ counts are generated for each column. If 1 or ‘columns’ counts are generated for each row.

  • numeric_only (bool = False) – Include only float, int or boolean data.

Returns:

For each column/row the number of non-NA/null entries.

Return type:

Series

Raises:

ValueError – Raised if axis is not 0, 1, ‘index’, or ‘columns’.

See also

GroupBy.count

Examples

>>> import arkouda as ak
>>> import numpy as np
>>> df = ak.DataFrame({'col_A': ak.array([7, np.nan]), 'col_B':ak.array([1, 9])})
>>> df
   col_A  col_B
0    7.0      1
1    NaN      9 (2 rows x 2 columns)
>>> df.count()
col_A    1
col_B    2
dtype: int64
>>> df = ak.DataFrame({'col_A': ak.array(["a","b","c"]), 'col_B':ak.array([1, np.nan, np.nan])})
>>> df
  col_A  col_B
0     a    1.0
1     b    NaN
2     c    NaN (3 rows x 2 columns)
>>> df.count()
col_A    3
col_B    1
dtype: int64
>>> df.count(numeric_only=True)
col_B    1
dtype: int64
>>> df.count(axis=1)
0    2
1    1
2    1
dtype: int64
drop(keys: str | int | List[str | int], axis: str | int = 0, inplace: bool = False) None | DataFrame[source]

Drop rows or columns from the DataFrame.

Parameters:
  • keys (Union[str, int, List[Union[str, int]]]) – Label or list of labels to drop along the specified axis.

  • axis (Union[str, int], default=0) –

    Axis along which to drop.

    • 0 or "index" — drop rows

    • 1 or "columns" — drop columns

  • inplace (bool, default=False) – If True, perform the operation on the calling object. If False, return a new object.

Returns:

DataFrame when inplace=False; otherwise None.

Return type:

Union[None, DataFrame]

Examples

>>> import arkouda as ak
>>> df = ak.DataFrame({"col1": [1, 2], "col2": [3, 4]})
>>> df
   col1  col2
0     1     3
1     2     4 (2 rows x 2 columns)

Drop a column:

>>> df.drop("col1", axis=1)
   col2
0     3
1     4 (2 rows x 1 columns)

Drop a row:

>>> df.drop(0, axis=0)
   col1  col2
1     2     4 (1 rows x 2 columns)
drop_duplicates(subset=None, keep='first')[source]

Drop duplcated rows and returns resulting DataFrame.

If a subset of the columns are provided then only one instance of each duplicated row will be returned (keep determines which row).

Parameters:
  • subset (Iterable) – Iterable of column names to use to dedupe.

  • keep ({'first', 'last'}, default='first') – Determines which duplicates (if any) to keep.

Returns:

DataFrame with duplicates removed.

Return type:

DataFrame

Example

>>> import arkouda as ak
>>> df = ak.DataFrame({'col1': [1, 2, 2, 3], 'col2': [4, 5, 5, 6]})
>>> df
   col1  col2
0     1     4
1     2     5
2     2     5
3     3     6 (4 rows x 2 columns)
>>> df.drop_duplicates()
   col1  col2
0     1     4
1     2     5
3     3     6 (3 rows x 2 columns)
dropna(axis: int | str = 0, how: str | None = None, thresh: int | None = None, ignore_index: bool = False) DataFrame[source]

Remove missing values.

Parameters:
  • axis ({0 or 'index', 1 or 'columns'}, default = 0) –

    Determine if rows or columns which contain missing values are removed.

    0, or ‘index’: Drop rows which contain missing values.

    1, or ‘columns’: Drop columns which contain missing value.

    Only a single axis is allowed.

  • how ({'any', 'all'}, default='any') –

    Determine if row or column is removed from DataFrame, when we have at least one NA or all NA.

    ’any’: If any NA values are present, drop that row or column.

    ’all’: If all values are NA, drop that row or column.

  • thresh (int, optional) – Require that many non - NA values.Cannot be combined with how.

  • ignore_index (bool, default False) – If True, the resulting axis will be labeled 0, 1, …, n - 1.

Returns:

DataFrame with NA entries dropped from it.

Return type:

DataFrame

Examples

>>> import arkouda as ak
>>> import numpy as np
>>> df = ak.DataFrame(
...    {
...        "A": [True, True, True, True],
...        "B": [1, np.nan, 2, np.nan],
...        "C": [1, 2, 3, np.nan],
...        "D": [False, False, False, False],
...        "E": [1, 2, 3, 4],
...        "F": ["a", "b", "c", "d"],
...        "G": [1, 2, 3, 4],
...    }
...   )
>>> df
      A    B    C      D  E  F  G
0  True  1.0  1.0  False  1  a  1
1  True  NaN  2.0  False  2  b  2
2  True  2.0  3.0  False  3  c  3
3  True  NaN  NaN  False  4  d  4 (4 rows x 7 columns)
>>> df.dropna()
      A    B    C      D  E  F  G
0  True  1.0  1.0  False  1  a  1
2  True  2.0  3.0  False  3  c  3 (2 rows x 7 columns)
>>> df.dropna(axis=1)
      A      D  E  F  G
0  True  False  1  a  1
1  True  False  2  b  2
2  True  False  3  c  3
3  True  False  4  d  4 (4 rows x 5 columns)
>>> df.dropna(axis=1, thresh=3)
      A    C      D  E  F  G
0  True  1.0  False  1  a  1
1  True  2.0  False  2  b  2
2  True  3.0  False  3  c  3
3  True  NaN  False  4  d  4 (4 rows x 6 columns)
>>> df.dropna(axis=1, how="all")
      A    B    C      D  E  F  G
0  True  1.0  1.0  False  1  a  1
1  True  NaN  2.0  False  2  b  2
2  True  2.0  3.0  False  3  c  3
3  True  NaN  NaN  False  4  d  4 (4 rows x 7 columns)
property dtypes

The dtypes of the dataframe.

Returns:

dtypes – The dtypes of the dataframe.

Return type:

arkouda.pandas.row.Row

Examples

>>> import arkouda as ak
>>> df = ak.DataFrame({'col1': [1, 2], 'col2': ["a", "b"]})
>>> df
   col1 col2
0     1    a
1     2    b (2 rows x 2 columns)
>>> df.dtypes
{'col1': 'int64', 'col2': 'str'}
property empty

Whether the dataframe is empty.

Returns:

True if the dataframe is empty, otherwise False.

Return type:

bool

Examples

>>> import arkouda as ak
>>> df = ak.DataFrame({})
>>> df
Empty DataFrame
Columns: []
Index: [] (None rows x 0 columns)
>>> df.empty
True
filter_by_range(keys, low=1, high=None)[source]

Filter rows by the size of groups defined on one or more columns.

Group the DataFrame by the specified keys, compute the count of each group, and return a boolean mask indicating which rows belong to groups whose sizes fall within the inclusive range [low, high].

Parameters:
  • keys (str or list of str) – Column name or list of column names to group by.

  • low (int, default=1) – Minimum group size (inclusive). Must be >= 0.

  • high (int or None, default=None) – Maximum group size (inclusive). If None, no upper bound is applied.

Returns:

A boolean mask array of length equal to the number of rows in the DataFrame, where True indicates the row’s group size is between low and high.

Return type:

pdarray of bool

Raises:
  • ValueError – If low is negative, or if high is not None and high < low.

  • TypeError – If keys is not a string or list of strings.

Examples

>>> import arkouda as ak
>>> df = ak.DataFrame({'col1': [1, 2, 2, 2, 3, 3], 'col2': [4, 5, 6, 7, 8, 9]})
>>> df
   col1  col2
0     1     4
1     2     5
2     2     6
3     2     7
4     3     8
5     3     9 (6 rows x 2 columns)
>>> df.filter_by_range("col1", low=1, high=2)
array([True False False False True True])
>>> filtered_df = df[df.filter_by_range("col1", low=1, high=2)]
>>> filtered_df
   col1  col2
0     1     4
4     3     8
5     3     9 (3 rows x 2 columns)
classmethod from_pandas(pd_df)[source]

Copy the data from a pandas DataFrame into a new arkouda.pandas.dataframe.DataFrame.

Parameters:

pd_df (pandas.DataFrame) – A pandas DataFrame to convert.

Return type:

DataFrame

Examples

>>> import arkouda as ak
>>> import pandas as pd
>>> pd_df = pd.DataFrame({"A":[1,2],"B":[3,4]})
>>> type(pd_df)
<class 'pandas....DataFrame'>
>>> pd_df
   A  B
0  1  3
1  2  4
>>> ak_df = DataFrame.from_pandas(pd_df)
>>> type(ak_df)
<class 'arkouda....DataFrame'>
>>> ak_df
   A  B
0  1  3
1  2  4 (2 rows x 2 columns)
classmethod from_return_msg(rep_msg)[source]

Create a DataFrame object from an arkouda server response message.

Parameters:

rep_msg (string) – Server response message used to create a DataFrame.

Return type:

DataFrame

groupby(keys, use_series=True, as_index=True, dropna=True)[source]

Group the dataframe by a column or a list of columns.

Alias for GroupBy.

Parameters:
  • keys (str or list of str) – An (ordered) list of column names or a single string to group by.

  • use_series (bool, default=True) – If True, returns an arkouda.pandas.dataframe.DataFrameGroupBy object. Otherwise an arkouda.pandas.groupbyclass.GroupBy object.

  • as_index (bool, default=True) – If True, groupby columns will be set as index otherwise, the groupby columns will be treated as DataFrame columns.

  • dropna (bool, default=True) – If True, and the groupby keys contain NaN values, the NaN values together with the corresponding row will be dropped. Otherwise, the rows corresponding to NaN values will be kept.

Returns:

If use_series = True, returns an arkouda.pandas.dataframe.DataFrameGroupBy object. Otherwise returns an arkouda.pandas.groupbyclass.GroupBy object.

Return type:

arkouda.pandas.dataframe.DataFrameGroupBy or arkouda.pandas.groupbyclass.GroupBy

See also

arkouda.GroupBy

Examples

>>> import arkouda as ak
>>> df = ak.DataFrame({'col1': [1.0, 1.0, 2.0, np.nan], 'col2': [4, 5, 6, 7]})
>>> df
   col1  col2
0   1.0     4
1   1.0     5
2   2.0     6
3   NaN     7 (4 rows x 2 columns)
>>> df.groupby("col1")
<arkouda.pandas.groupbyclass.GroupBy object at 0x795584773f00>
>>> df.groupby("col1").size()
col1
1.0    2
2.0    1
dtype: int64
>>> df.groupby("col1",use_series=True).size()
col1
1.0    2
2.0    1
dtype: int64
>>> df.groupby("col1",use_series=True, as_index = False).size()
   col1  size
0   1.0     2
1   2.0     1 (2 rows x 2 columns)
head(n=5)[source]

Return the first n rows.

This function returns the first n rows of the the dataframe. It is useful for quickly verifying data, for example, after sorting or appending rows.

Parameters:

n (int, default = 5) – Number of rows to select.

Returns:

The first n rows of the DataFrame.

Return type:

DataFrame

See also

tail

Examples

>>> import arkouda as ak
>>> df = ak.DataFrame({'col1': ak.arange(10), 'col2': -1 * ak.arange(10)})
>>> df
   col1  col2
0     0     0
1     1    -1
2     2    -2
3     3    -3
4     4    -4
5     5    -5
6     6    -6
7     7    -7
8     8    -8
9     9    -9 (10 rows x 2 columns)
>>> df.head()
   col1  col2
0     0     0
1     1    -1
2     2    -2
3     3    -3
4     4    -4 (5 rows x 2 columns)
>>> df.head(n=2)
   col1  col2
0     0     0
1     1    -1 (2 rows x 2 columns)
property index

The index of the dataframe.

Returns:

The index of the dataframe.

Return type:

arkouda.pandas.index.Index or arkouda.pandas.index.MultiIndex

Examples

>>> import arkouda as ak
>>> df = ak.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
>>> df
   col1  col2
0     1     3
1     2     4 (2 rows x 2 columns)
>>> df.index
Index(array([0 1]), dtype='int64')
property info

Return a summary string of this dataframe.

Returns:

A summary string of this dataframe.

Return type:

str

Examples

>>> import arkouda as ak
>>> df = ak.DataFrame({'col1': [1, 2], 'col2': ["a", "b"]})
>>> df
   col1 col2
0     1    a
1     2    b (2 rows x 2 columns)
>>> df.info
"DataFrame(['col1', 'col2'], 2 rows, 36.00 B)"
is_registered() bool[source]

Return True if the object is contained in the registry.

Returns:

Indicates if the object is contained in the registry.

Return type:

bool

Raises:

RegistrationError – Raised if there’s a server-side error or a mismatch of registered components.

See also

register, unregister, unregister_dataframe_by_name

Notes

Objects registered with the server are immune to deletion until they are unregistered.

Example

>>> import arkouda as ak
>>> df = ak.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
>>> df.register("my_table_name")
   col1  col2
0     1     4
1     2     5
2     3     6 (3 rows x 2 columns)
>>> df.is_registered()
True
>>> df.unregister()
>>> df.is_registered()
False
isin(values: arkouda.numpy.pdarrayclass.pdarray | Dict | arkouda.pandas.series.Series | DataFrame) DataFrame[source]

Determine whether each element in the DataFrame is contained in values.

Parameters:

values (Union[pdarray, Dict, Series, DataFrame]) – The values to check for in the DataFrame. A Series must have a single index.

Returns:

Arkouda DataFrame of booleans indicating whether each element in the DataFrame is contained in values.

Return type:

DataFrame

See also

ak.Series.isin

Notes

  • Pandas supports values being any iterable type. In Arkouda, this is replaced with pdarray.

  • Pandas supports ~ operations. Currently, ak.DataFrame does not.

Examples

>>> import arkouda as ak
>>> df = ak.DataFrame({"col_A": ak.array([7, 3]), "col_B": ak.array([1, 9])})
>>> df
   col_A  col_B
0      7      1
1      3      9 (2 rows x 2 columns)

When values is a pdarray, every value in the DataFrame is checked to determine whether it exists in values.

>>> df.isin(ak.array([0, 1]))
   col_A  col_B
0  False   True
1  False  False (2 rows x 2 columns)

When values is a dict, the dictionary values are used to check the column indicated by each key.

>>> df.isin({"col_A": ak.array([0, 3])})
   col_A  col_B
0  False  False
1   True  False (2 rows x 2 columns)

When values is a Series, each column is checked positionally. For True to be returned, the indexes must match.

>>> i = ak.Index(ak.arange(2))
>>> s = ak.Series(data=[3, 9], index=i)
>>> df.isin(s)
   col_A  col_B
0  False  False
1  False   True (2 rows x 2 columns)

When values is a DataFrame, the index and columns must match. Note that 9 is not found because the column name differs.

>>> other_df = ak.DataFrame({"col_A": ak.array([7, 3]), "col_C": ak.array([0, 9])})
>>> df.isin(other_df)
   col_A  col_B
0   True  False
1   True  False (2 rows x 2 columns)
isna() DataFrame[source]

Detect missing values.

Return a boolean object of the same size indicating whether each value is missing. numpy.nan values are mapped to True. All other values are mapped to False.

Returns:

Boolean mask for each element in the DataFrame indicating whether the value is NA.

Return type:

DataFrame

Examples

>>> import arkouda as ak
>>> import numpy as np
>>> df = ak.DataFrame(
...     {
...         "A": [np.nan, 2, 2, 3],
...         "B": [3, np.nan, 5, 6],
...         "C": [1, np.nan, 2, np.nan],
...         "D": ["a", "b", "c", "d"],
...     }
... )
>>> df
     A    B    C  D
0  NaN  3.0  1.0  a
1  2.0  NaN  NaN  b
2  2.0  5.0  2.0  c
3  3.0  6.0  NaN  d (4 rows x 4 columns)
>>> df.isna()
       A      B      C      D
0   True  False  False  False
1  False   True   True  False
2  False  False  False  False
3  False  False   True  False (4 rows x 4 columns)
classmethod load(prefix_path, file_format='INFER')[source]

Load a DataFrame from a file.

The file_format parameter is included for consistency with other Arkouda load functions.

Parameters:
  • prefix_path (str) – The prefix path for the stored data.

  • file_format (str, default="INFER") – File format of the stored data. If "INFER", the format will be inferred automatically.

Returns:

A DataFrame loaded from prefix_path.

Return type:

DataFrame

Examples

>>> import arkouda as ak
>>> import os.path
>>> from pathlib import Path

To store data in <my_dir>/my_data_LOCALE0000, use "<my_dir>/my_data" as the prefix.

>>> my_path = os.path.join(os.getcwd(), "hdf5_output", "my_data")
>>> Path(my_path).mkdir(parents=True, exist_ok=True)
>>> df = ak.DataFrame({"A": ak.arange(5), "B": -1 * ak.arange(5)})
>>> df.to_parquet(my_path + "/my_data")
>>> df.load(my_path + "/my_data")
   B  A
0  0  0
1 -1  1
2 -2  2
3 -3  3
4 -4  4 (5 rows x 2 columns)
memory_usage(index=True, unit='B') arkouda.pandas.series.Series[source]

Return the memory usage of each column.

The memory usage can optionally include the contribution of the index.

Parameters:
  • index (bool, default=True) – Whether to include the memory usage of the DataFrame’s index in the returned Series. If True, the memory usage of the index appears as the first item in the output.

  • unit (str, default="B") – Unit to return. One of {"B", "KB", "MB", "GB"}.

Returns:

A Series whose index contains the original column names and whose values represent the memory usage of each column in the specified unit.

Return type:

Series

Examples

>>> import arkouda as ak
>>> dtypes = {"int64": ak.int64, "float64": ak.float64, "bool": ak.bool_}
>>> data = {t: ak.ones(5000, dtype=dtypes[t]) for t in dtypes}
>>> df = ak.DataFrame(data)
>>> df.head()
   int64  float64  bool
0      1      1.0  True
1      1      1.0  True
2      1      1.0  True
3      1      1.0  True
4      1      1.0  True (5 rows x 3 columns)
>>> df.memory_usage()
Index      40000
int64      40000
float64    40000
bool        5000
dtype: int64
>>> df.memory_usage(index=False)
int64      40000
float64    40000
bool        5000
dtype: int64
>>> df.memory_usage(unit="KB")
Index      39.062500
int64      39.062500
float64    39.062500
bool        4.882812
dtype: float64

To get the approximate total memory usage:

>>> df.memory_usage(index=True).sum()
np.int64(125000)
memory_usage_info(unit='GB')[source]

Return a formatted string representation of the size of this DataFrame.

Parameters:

unit (str, default = "GB") – Unit to return. One of {‘KB’, ‘MB’, ‘GB’}.

Returns:

A string representation of the number of bytes used by this DataFrame in [unit]s.

Return type:

str

Examples

>>> import arkouda as ak
>>> df = ak.DataFrame({'col1': ak.arange(1000), 'col2': ak.arange(1000)})
>>> df.memory_usage_info()
'0.00 GB'
>>> df.memory_usage_info(unit="KB")
'23.44 KB'
merge(right: DataFrame, on: str | List[str] | None = None, how: str = 'inner', left_suffix: str = '_x', right_suffix: str = '_y', convert_ints: bool = True, sort: bool = True) DataFrame[source]

Merge Arkouda DataFrames with a database-style join.

The resulting dataframe contains rows from both DataFrames as specified by the merge condition (based on the “how” and “on” parameters).

Based on pandas merge functionality. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html

Parameters:
  • right (DataFrame) – The Right DataFrame to be joined.

  • on (Optional[Union[str, List[str]]] = None) – The name or list of names of the DataFrame column(s) to join on. If on is None, this defaults to the intersection of the columns in both DataFrames.

  • how ({"inner", "left", "right}, default = "inner") – The merge condition. Must be “inner”, “left”, or “right”.

  • left_suffix (str, default = "_x") – A string indicating the suffix to add to columns from the left dataframe for overlapping column names in both left and right. Defaults to “_x”. Only used when how is “inner”.

  • right_suffix (str, default = "_y") – A string indicating the suffix to add to columns from the right dataframe for overlapping column names in both left and right. Defaults to “_y”. Only used when how is “inner”.

  • convert_ints (bool = True) – If True, convert columns with missing int values (due to the join) to float64. This is to match pandas. If False, do not convert the column dtypes. This has no effect when how = “inner”.

  • sort (bool = True) – If True, DataFrame is returned sorted by “on”. Otherwise, the DataFrame is not sorted.

Returns:

Joined Arkouda DataFrame.

Return type:

DataFrame

Note

Multiple column joins are only supported for integer columns.

Examples

>>> import arkouda as ak
>>> left_df = ak.DataFrame({'col1': ak.arange(5), 'col2': -1 * ak.arange(5)})
>>> left_df
   col1  col2
0     0     0
1     1    -1
2     2    -2
3     3    -3
4     4    -4 (5 rows x 2 columns)
>>> right_df = ak.DataFrame({'col1': 2 * ak.arange(5), 'col2': 2 * ak.arange(5)})
>>> right_df
   col1  col2
0     0     0
1     2     2
2     4     4
3     6     6
4     8     8 (5 rows x 2 columns)
>>> left_df.merge(right_df, on = "col1")
   col1  col2_x  col2_y
0     0       0       0
1     2      -2       2
2     4      -4       4 (3 rows x 3 columns)
>>> left_df.merge(right_df, on = "col1", how = "left")
   col1  col2_x  col2_y
0     0       0     0.0
1     1      -1     NaN
2     2      -2     2.0
3     3      -3     NaN
4     4      -4     4.0 (5 rows x 3 columns)
>>> left_df.merge(right_df, on = "col1", how = "right")
   col1  col2_x  col2_y
0     0     0.0       0
1     2    -2.0       2
2     4    -4.0       4
3     6     NaN       6
4     8     NaN       8 (5 rows x 3 columns)
>>> left_df.merge(right_df, on = "col1", how = "outer")
   col1  col2_x  col2_y
0     0     0.0     0.0
1     1    -1.0     NaN
2     2    -2.0     2.0
3     3    -3.0     NaN
4     4    -4.0     4.0
5     6     NaN     6.0
6     8     NaN     8.0 (7 rows x 3 columns)
notna() DataFrame[source]

Detect existing (non-missing) values.

Return a boolean same-sized object indicating if the values are not NA. numpy.NaN values get mapped to False values.

Returns:

Mask of bool values for each element in DataFrame that indicates whether an element is not an NA value.

Return type:

DataFrame

Examples

>>> import arkouda as ak
>>> import numpy as np
>>> df = ak.DataFrame({"A": [np.nan, 2, 2, 3], "B": [3, np.nan, 5, 6],
...          "C": [1, np.nan, 2, np.nan], "D":["a","b","c","d"]})
>>> df
     A    B    C  D
0  NaN  3.0  1.0  a
1  2.0  NaN  NaN  b
2  2.0  5.0  2.0  c
3  3.0  6.0  NaN  d (4 rows x 4 columns)
>>> df.notna()
       A      B      C     D
0  False   True   True  True
1   True  False  False  True
2   True   True   True  True
3   True   True  False  True (4 rows x 4 columns)
objType = 'DataFrame'
classmethod read_csv(filename: str, col_delim: str = ',')[source]

Read the columns of a CSV file into an Arkouda DataFrame.

If the file contains the appropriately formatted header, typed data will be returned. Otherwise, all data will be returned as a Strings objects.

Parameters:
  • filename (str) – Filename to read data from.

  • col_delim (str, default=",") – The delimiter for columns within the data.

Returns:

Arkouda DataFrame containing the columns from the CSV file.

Return type:

DataFrame

Raises:
  • ValueError – Raised if all datasets are not present in all parquet files or if one or more of the specified files do not exist.

  • RuntimeError – Raised if one or more of the specified files cannot be opened. If allow_errors is true this may be raised if no values are returned from the server.

  • TypeError – Raised if we receive an unknown arkouda_type returned from the server.

See also

to_csv

Notes

  • CSV format is not currently supported by load/load_all operations.

  • The column delimiter is expected to be the same for column names and data.

  • Be sure that column delimiters are not found within your data.

  • All CSV files must delimit rows using newline (”\n”) at this time.

  • Unlike other file formats, CSV files store Strings as their UTF-8 format instead of storing

bytes as uint(8).

Examples

>>> import arkouda as ak
>>> import os.path
>>> from pathlib import Path
>>> my_path = os.path.join(os.getcwd(), 'csv_output','my_data')
>>> Path(my_path).mkdir(parents=True, exist_ok=True)
>>> df = ak.DataFrame({"A":[1,2],"B":[3,4]})
>>> df.to_csv(my_path)
>>> df2 = ak.DataFrame.read_csv(my_path + "_LOCALE0000")
>>> df2
   A  B
0  1  3
1  2  4 (2 rows x 2 columns)
register(user_defined_name: str) DataFrame[source]

Register this DataFrame object and underlying components with the Arkouda server.

Parameters:

user_defined_name (str) – User defined name the DataFrame is to be registered under. This will be the root name for underlying components.

Returns:

The same DataFrame which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different DataFrames with the same name.

Return type:

DataFrame

Raises:
  • TypeError – Raised if user_defined_name is not a str.

  • RegistrationError – If the server was unable to register the DataFrame with the user_defined_name.

See also

unregister, unregister_dataframe_by_name, is_registered

Notes

Objects registered with the server are immune to deletion until they are unregistered.

Any changes made to a DataFrame object after registering with the server may not be reflected in attached copies.

Example

>>> import arkouda as ak
>>> df = ak.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
>>> df.register("my_table_name")
    col1  col2
0     1     4
1     2     5
2     3     6 (3 rows x 2 columns)
>>> df.is_registered()
True
>>> df.unregister()
>>> df.is_registered()
False
registered_name = None
rename(mapper: Callable | Dict | None = None, index: Callable | Dict | None = None, column: Callable | Dict | None = None, axis: str | int = 0, inplace: bool = False) DataFrame | None[source]

Rename indexes or columns according to a mapping.

Parameters:
  • mapper (Optional[Union[Callable, Dict]]) – Function or dictionary mapping existing values to new values. Nonexistent names will not raise an error. The value of axis determines whether the mapping is applied to the index or columns.

  • index (Optional[Union[Callable, Dict]]) – Function or dictionary mapping existing index names to new index names. Nonexistent names will not raise an error. When this is set, axis is ignored.

  • column (Optional[Union[Callable, Dict]]) – Function or dictionary mapping existing column names to new column names. Nonexistent names will not raise an error. When this is set, axis is ignored.

  • axis (Union[str, int], default=0) –

    Axis to perform the rename operation on.

    • 0 or "index" — rename index values

    • 1 or "column" — rename column names

  • inplace (bool, default=False) – If True, perform the operation on the calling object. If False, return a new object.

Returns:

DataFrame when inplace=False; otherwise None.

Return type:

Optional[DataFrame]

Examples

>>> import arkouda as ak
>>> df = ak.DataFrame({"A": ak.array([1, 2, 3]), "B": ak.array([4, 5, 6])})
>>> df
   A  B
0  1  4
1  2  5
2  3  6 (3 rows x 2 columns)

Rename columns using a mapping:

>>> df.rename(column={"A": "a", "B": "c"})
   a  c
0  1  4
1  2  5
2  3  6 (3 rows x 2 columns)

Rename indexes using a mapping:

>>> df.rename(index={0: 99, 2: 11})
    A  B
99  1  4
1   2  5
11  3  6 (3 rows x 2 columns)

Rename using the axis-style parameter:

>>> df.rename(str.lower, axis="column")
   a  b
0  1  4
1  2  5
2  3  6 (3 rows x 2 columns)
reset_index(size: int | None = None, inplace: bool = False) None | DataFrame[source]

Set the index to an integer range.

Useful if this dataframe is the result of a slice operation from another dataframe, or if you have permuted the rows and no longer need to keep that ordering on the rows.

Parameters:
  • size (int, optional) – If size is passed, do not attempt to determine size based on existing column sizes. Assume caller handles consistency correctly.

  • inplace (bool, default=False) – When True, perform the operation on the calling object. When False, return a new object.

Returns:

DateFrame when inplace=False; None when inplace=True.

Return type:

DataFrame or None

Note

Pandas adds a column ‘index’ to indicate the original index. Arkouda does not currently support this behavior.

Example

>>> import arkouda as ak
>>> df = ak.DataFrame({"A": ak.array([1, 2, 3]), "B": ak.array([4, 5, 6])})
>>> df
   A  B
0  1  4
1  2  5
2  3  6 (3 rows x 2 columns)
>>> perm_df = df[ak.array([0,2,1])]
>>> perm_df
   A  B
0  1  4
2  3  6
1  2  5 (3 rows x 2 columns)
>>> perm_df.reset_index()
   A  B
0  1  4
1  3  6
2  2  5 (3 rows x 2 columns)
sample(n=5) DataFrame[source]

Return a random sample of n rows.

Parameters:

n (int, default=5) – Number of rows to return.

Returns:

A DataFrame containing n randomly sampled rows.

Return type:

DataFrame

Examples

>>> import arkouda as ak
>>> df = ak.DataFrame({"A": ak.arange(5), "B": -1 * ak.arange(5)})
>>> df
   A  B
0  0  0
1  1 -1
2  2 -2
3  3 -3
4  4 -4 (5 rows x 2 columns)

Random output of size 3:

>>> df.sample(n=3)
   A  B
4  4 -4
3  3 -3
1  1 -1 (3 rows x 2 columns)
property shape

The shape of the dataframe.

Returns:

Tuple of array dimensions.

Return type:

tuple of int

Examples

>>> import arkouda as ak
>>> df = ak.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
>>> df
   col1  col2
0     1     4
1     2     5
2     3     6 (3 rows x 2 columns)
>>> df.shape
(3, 2)
property size

Return the number of bytes on the arkouda server.

Returns:

The number of bytes on the arkouda server.

Return type:

int

Examples

>>> import arkouda as ak
>>> df = ak.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
>>> df
   col1  col2
0     1     4
1     2     5
2     3     6 (3 rows x 2 columns)
>>> df.size
6
sort_index(ascending=True)[source]

Sort the DataFrame by indexed columns.

Note: Fails on sort order of arkouda.numpy.strings.Strings columns when

multiple columns being sorted.

Parameters:

ascending (bool, default = True) – Sort values in ascending (default) or descending order.

Example

>>> import arkouda as ak
>>> df = ak.DataFrame({'col1': [1.1, 3.1, 2.1], 'col2': [6, 5, 4]},
...          index = Index(ak.array([2,0,1]), name="idx"))
>>> df
     col1  col2
idx
2     1.1     6
0     3.1     5
1     2.1     4 (3 rows x 2 columns)
>>> df.sort_index()
     col1  col2
idx
0     3.1     5
1     2.1     4
2     1.1     6 (3 rows x 2 columns)
sort_values(by=None, ascending=True)[source]

Sort the DataFrame by one or more columns.

If no column is specified, all columns are used.

Note: Fails on order of arkouda.numpy.strings.Strings columns when multiple columns being sorted.

Parameters:
  • by (str or list/tuple of str, default = None) – The name(s) of the column(s) to sort by.

  • ascending (bool, default = True) – Sort values in ascending (default) or descending order.

Examples

>>> import arkouda as ak
>>> df = ak.DataFrame({'col1': [2, 2, 1], 'col2': [3, 4, 3], 'col3':[5, 6, 7]})
>>> df
   col1  col2  col3
0     2     3     5
1     2     4     6
2     1     3     7 (3 rows x 3 columns)
>>> df.sort_values()
   col1  col2  col3
2     1     3     7
0     2     3     5
1     2     4     6 (3 rows x 3 columns)
>>> df.sort_values("col3")
   col1  col2  col3
0     2     3     5
1     2     4     6
2     1     3     7 (3 rows x 3 columns)
tail(n=5)[source]

Return the last n rows.

This function returns the last n rows for the dataframe. It is useful for quickly testing if your object has the right type of data in it.

Parameters:

n (int, default=5) – Number of rows to select.

Returns:

The last n rows of the DataFrame.

Return type:

DataFrame

See also

arkouda.pandas.dataframe.head

Examples

>>> import arkouda as ak
>>> df = ak.DataFrame({'col1': ak.arange(10), 'col2': -1 * ak.arange(10)})
>>> df
   col1  col2
0     0     0
1     1    -1
2     2    -2
3     3    -3
4     4    -4
5     5    -5
6     6    -6
7     7    -7
8     8    -8
9     9    -9 (10 rows x 2 columns)
>>> df.tail()
   col1  col2
5     5    -5
6     6    -6
7     7    -7
8     8    -8
9     9    -9 (5 rows x 2 columns)
>>> df.tail(n=2)
   col1  col2
8     8    -8
9     9    -9 (2 rows x 2 columns)
to_csv(path: str, index: bool = False, columns: List[str] | None = None, col_delim: str = ',', overwrite: bool = False)[source]

Write DataFrame to CSV file(s).

File will contain a column for each column in the DataFrame. All CSV Files written by Arkouda include a header denoting data types of the columns. Unlike other file formats, CSV files store Strings as their UTF-8 format instead of storing bytes as uint(8).

Parameters:
  • path (str) – The filename prefix to be used for saving files. Files will have _LOCALE#### appended when they are written to disk.

  • index (bool, default=False) – If True, the index of the DataFrame will be written to the file as a column.

  • columns (list of str (Optional)) – Column names to assign when writing data.

  • col_delim (str, default=",") – Value to be used to separate columns within the file. Please be sure that the value used DOES NOT appear in your dataset.

  • overwrite (bool, default=False) – If True, any existing files matching your provided prefix_path will be overwritten. If False, an error will be returned if existing files are found.

Raises:
  • ValueError – Raised if all datasets are not present in all parquet files or if one or more of the specified files do not exist.

  • RuntimeError – Raised if one or more of the specified files cannot be opened. If allow_errors is true this may be raised if no values are returned from the server.

  • TypeError – Raised if we receive an unknown arkouda_type returned from the server.

Notes

  • CSV format is not currently supported by load/load_all operations.

  • The column delimiter is expected to be the same for column names and data.

  • Be sure that column delimiters are not found within your data.

  • All CSV files must delimit rows using newline (”\n”) at this time.

Examples

>>> import arkouda as ak
>>> import os.path
>>> from pathlib import Path
>>> my_path = os.path.join(os.getcwd(), 'csv_output')
>>> Path(my_path).mkdir(parents=True, exist_ok=True)
>>> df = ak.DataFrame({"A":[1,2],"B":[3,4]})
>>> df.to_csv(my_path)
>>> df2 = ak.DataFrame.read_csv(my_path + "_LOCALE0000")
>>> df2
   A  B
0  1  3
1  2  4 (2 rows x 2 columns)
to_hdf(path, index=False, columns=None, file_type: Literal['single', 'distribute'] = 'distribute')[source]

Save DataFrame to disk as hdf5, preserving column names.

Parameters:
  • path (str) – File path to save data.

  • index (bool, default=False) – If True, save the index column. By default, do not save the index.

  • columns (List, default = None) – List of columns to include in the file. If None, writes out all columns.

  • file_type ({"single", "distribute"}) – Whether to save to a single file or distribute across Locales. Default is “distribute”.

Raises:

RuntimeError – Raised if a server-side error is thrown saving the pdarray.

Notes

This method saves one file per locale of the arkouda server. All files are prefixed by the path argument and suffixed by their locale number.

See also

to_parquet, load

Examples

>>> import arkouda as ak
>>> import os.path
>>> from pathlib import Path
>>> my_path = os.path.join(os.getcwd(), 'hdf_output')
>>> Path(my_path).mkdir(parents=True, exist_ok=True)
>>> df = ak.DataFrame({"A":[1,2],"B":[3,4]})
>>> df.to_hdf(my_path + "/my_data")
>>> df.load(my_path + "/my_data")
   A  B
0  1  3
1  2  4 (2 rows x 2 columns)
to_markdown(mode='wt', index=True, tablefmt='grid', storage_options=None, **kwargs)[source]

Print the DataFrame in a Markdown-friendly format.

Parameters:
  • mode (str, optional) – Mode in which the file is opened, by default "wt".

  • index (bool, optional, default=True) – Whether to include index (row) labels.

  • tablefmt (str, default="grid") – Table format passed to tabulate. See https://pypi.org/project/tabulate/ for available formats.

  • storage_options (dict, optional) – Extra options for a particular storage connection (for example host, port, username, password) when using a URL handled by fsspec such as "s3://", "gcs://". An error will be raised if this argument is provided with a non-fsspec URL. See the fsspec and backend storage implementation documentation for the set of allowed keys and values.

  • **kwargs – Additional keyword arguments passed to tabulate.

Notes

This function should only be used with small DataFrames because it calls pandas.DataFrame.to_markdown internally: https://pandas.pydata.org/pandas-docs/version/1.2.4/reference/api/pandas.DataFrame.to_markdown.html

Examples

>>> import arkouda as ak
>>> df = ak.DataFrame({"animal_1": ["elk", "pig"], "animal_2": ["dog", "quetzal"]})
>>> print(df.to_markdown())
+----+------------+------------+
|    | animal_1   | animal_2   |
+====+============+============+
|  0 | elk        | dog        |
+----+------------+------------+
|  1 | pig        | quetzal    |
+----+------------+------------+

Suppress the index:

>>> print(df.to_markdown(index=False))
+------------+------------+
| animal_1   | animal_2   |
+============+============+
| elk        | dog        |
+------------+------------+
| pig        | quetzal    |
+------------+------------+
to_pandas(datalimit=maxTransferBytes, retain_index=False)[source]

Send this DataFrame to a pandas DataFrame.

Parameters:
  • datalimit (int, default=arkouda.core.client.maxTransferBytes) – The maximum number size, in megabytes to transfer. The requested DataFrame will be converted to a pandas DataFrame only if the estimated size of the DataFrame does not exceed this value.

  • retain_index (bool, default=False) – Normally, to_pandas() creates a new range index object. If you want to keep the index column, set this to True.

Returns:

The result of converting this DataFrame to a pandas DataFrame.

Return type:

pandas.DataFrame

Examples

>>> import arkouda as ak
>>> ak_df = ak.DataFrame({"A": ak.arange(2), "B": -1 * ak.arange(2)})
>>> type(ak_df)
<class 'arkouda...DataFrame'>
>>> ak_df
   A  B
0  0  0
1  1 -1 (2 rows x 2 columns)
>>> import pandas as pd
>>> pd_df = ak_df.to_pandas()
>>> type(pd_df)
<class 'pandas...DataFrame'>
>>> pd_df
   A  B
0  0  0
1  1 -1
to_parquet(path, index=False, columns=None, compression: str | None = None, convert_categoricals: bool = False)[source]

Save DataFrame to disk as parquet, preserving column names.

Parameters:
  • path (str) – File path to save data.

  • index (bool, default=False) – If True, save the index column. By default, do not save the index.

  • columns (list) – List of columns to include in the file. If None, writes out all columns.

  • compression (str (Optional), default=None) – Provide the compression type to use when writing the file. Supported values: snappy, gzip, brotli, zstd, lz4

  • convert_categoricals (bool, default=False) – Parquet requires all columns to be the same size and Categoricals don’t satisfy that requirement. If set, write the equivalent Strings in place of any Categorical columns.

Raises:

RuntimeError – Raised if a server-side error is thrown saving the pdarray

Notes

This method saves one file per locale of the arkouda server. All files are prefixed by the path argument and suffixed by their locale number.

See also

to_hdf, load

Examples

>>> import arkouda as ak
>>> import os.path
>>> from pathlib import Path
>>> my_path = os.path.join(os.getcwd(), 'parquet_output')
>>> Path(my_path).mkdir(parents=True, exist_ok=True)
>>> df = ak.DataFrame({"A":[1,2],"B":[3,4]})
>>> df.to_parquet(my_path + "/my_data")
>>> df.load(my_path + "/my_data")
   B  A
0  3  1
1  4  2 (2 rows x 2 columns)
transfer(hostname, port)[source]

Send a DataFrame to a different Arkouda server.

Parameters:
  • hostname (str) – The hostname where the Arkouda server intended to receive the DataFrame is running.

  • port (int_scalars) – The port to send the array over. This needs to be an open port (i.e., not one that the Arkouda server is running on). This will open up numLocales ports, each of which in succession, so will use ports of the range {port..(port+numLocales)} (e.g., running an Arkouda server of 4 nodes, port 1234 is passed as port, Arkouda will use ports 1234, 1235, 1236, and 1237 to send the array data). This port much match the port passed to the call to ak.receive_array().

Returns:

A message indicating a complete transfer.

Return type:

str

Raises:
  • ValueError – Raised if the op is not within the pdarray.BinOps set

  • TypeError – Raised if other is not a pdarray or the pdarray.dtype is not a supported dtype

unregister()[source]

Unregister this DataFrame object in the arkouda server.

Unregister this DataFrame object in the arkouda server which was previously registered using register() and/or attached to using attach().

Raises:

RegistrationError – If the object is already unregistered or if there is a server error when attempting to unregister.

See also

register, unregister_dataframe_by_name, is_registered

Notes

Objects registered with the server are immune to deletion until they are unregistered.

Example

>>> import arkouda as ak
>>> df = ak.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
>>> df.register("my_table_name")
           col1  col2
0     1     4
1     2     5
2     3     6 (3 rows x 2 columns)
>>> df.is_registered()
True
>>> df.unregister()
>>> df.is_registered()
False
update_hdf(prefix_path: str, index=False, columns=None, repack: bool = True)[source]

Overwrite the dataset with the name provided with this dataframe.

If the dataset does not exist it is added.

Parameters:
  • prefix_path (str) – Directory and filename prefix that all output files share.

  • index (bool, default=False) – If True, save the index column. By default, do not save the index.

  • columns (List, default=None) – List of columns to include in the file. If None, writes out all columns.

  • repack (bool, default=True) – HDF5 does not release memory on delete. When True, the inaccessible data (that was overwritten) is removed. When False, the data remains, but is inaccessible. Setting to false will yield better performance, but will cause file sizes to expand.

Returns:

Success message if successful.

Return type:

str

Raises:

RuntimeError – Raised if a server-side error is thrown saving the pdarray.

Notes

If file does not contain File_Format attribute to indicate how it was saved,

the file name is checked for _LOCALE#### to determine if it is distributed.

If the dataset provided does not exist, it will be added.

Examples

>>> import arkouda as ak
>>> import os.path
>>> from pathlib import Path
>>> my_path = os.path.join(os.getcwd(), 'hdf_output')
>>> Path(my_path).mkdir(parents=True, exist_ok=True)
>>> df = ak.DataFrame({"A":[1,2],"B":[3,4]})
>>> df.to_hdf(my_path + "/my_data")
>>> df.load(my_path + "/my_data")
   A  B
0  1  3
1  2  4 (2 rows x 2 columns)
>>> df2 = ak.DataFrame({"A":[5,6],"B":[7,8]})
>>> df2.update_hdf(my_path + "/my_data")
>>> df.load(my_path + "/my_data")
   A  B
0  5  7
1  6  8 (2 rows x 2 columns)
update_nrows()[source]

Compute the number of rows on the arkouda server and updates the size parameter.

class arkouda.DataFrameGroupBy(gb, df, gb_key_names=None, as_index=True)[source]

A DataFrame that has been grouped by a subset of columns.

Parameters:
  • gb_key_names (str or list(str), default=None) – The column name(s) associated with the aggregated columns.

  • as_index (bool, default=True) – If True, interpret aggregated column as index (only implemented for single dimensional aggregates). Otherwise, treat aggregated column as a dataframe column.

gb

GroupBy object, where the aggregation keys are values of column(s) of a dataframe, usually in preparation for aggregating with respect to the other columns.

Type:

GroupBy

df

The dataframe containing the original data.

Type:

DataFrame

gb_key_names

The column name(s) associated with the aggregated columns.

Type:

Union[str, List[str]]

as_index

If True the grouped values of the aggregation keys will be treated as an index. Defaults to True.

Type:

bool

all(colnames=None)[source]
all_non_nan = False
and_(colnames=None)[source]
any(colnames=None)[source]
argmax(colnames=None)[source]
argmin(colnames=None)[source]
as_index: bool
broadcast(x, permute=True)[source]

Fill each group’s segment with a constant value.

Parameters:
  • x (Series or pdarray) – The values to put in each group’s segment.

  • permute (bool, default=True) – If True (default), permute broadcast values back to the ordering of the original array on which GroupBy was called. If False, the broadcast values are grouped by value.

Returns:

A Series with the Index of the original frame and the values of the broadcast.

Return type:

arkouda.pandas.series.Series

Examples

>>> import arkouda as ak
>>> from arkouda.pandas.dataframe import DataFrameGroupBy
>>> df = ak.DataFrame({"A":[1,2,2,3],"B":[3,4,5,6]})
>>> df
   A  B
0  1  3
1  2  4
2  2  5
3  3  6 (4 rows x 2 columns)
>>> gb = df.groupby("A")
>>> x = ak.array([10,11,12])
>>> s = DataFrameGroupBy.broadcast(gb, x)
>>> df["C"] = s.values
>>> df
   A  B   C
0  1  3  10
1  2  4  11
2  2  5  11
3  3  6  12 (4 rows x 3 columns)
count(colnames=None)[source]
df: DataFrame
diff(colname)[source]

Create a difference aggregate for the given column.

For each group, the difference between successive values is calculated. Aggregate operations (mean,min,max,std,var) can be done on the results.

Parameters:

colname (str) – Name of the column to compute the difference on.

Returns:

Object containing the differences, which can be aggregated.

Return type:

DiffAggregate

Examples

>>> import arkouda as ak
>>> df = ak.DataFrame({"A":[1,2,2,2,3,3],"B":[3,9,11,27,86,100]})
>>> df
   A    B
0  1    3
1  2    9
2  2   11
3  2   27
4  3   86
5  3  100 (6 rows x 2 columns)
>>> gb = df.groupby("A")
>>> gb.diff("B").values
array([nan nan 2.00000000000000000 16.00000000000000000 nan 14.00000000000000000])
dropna
first(colnames=None)[source]
gb: arkouda.pandas.groupbyclass.GroupBy
gb_key_names: str | List[str]
head(n: int = 5, sort_index: bool = True) DataFrame[source]

Return the first n rows from each group.

Parameters:
  • n (int, optional, default = 5) – Maximum number of rows to return for each group. If the number of rows in a group is less than n, all the values from that group will be returned.

  • sort_index (bool, default = True) – If true, return the DataFrame with indices sorted.

Return type:

DataFrame

Examples

>>> import arkouda as ak
>>> df = ak.DataFrame({"a":ak.arange(10) %3 , "b":ak.arange(10)})
>>> df
   a  b
0  0  0
1  1  1
2  2  2
3  0  3
4  1  4
5  2  5
6  0  6
7  1  7
8  2  8
9  0  9 (10 rows x 2 columns)
>>> df.groupby("a").head(2)
   a  b
0  0  0
1  1  1
2  2  2
3  0  3
4  1  4
5  2  5 (6 rows x 2 columns)
max(colnames=None)[source]
mean(colnames=None)[source]
median(colnames=None)[source]
min(colnames=None)[source]
mode(colnames=None)[source]
nunique(colnames=None)[source]
or_(colnames=None)[source]
prod(colnames=None)[source]
sample(n=None, frac=None, replace=False, weights=None, random_state=None)[source]

Return a random sample from each group.

You can specify either the number of elements to sample or the fraction of elements to sample. random_state can be used for reproducibility.

Parameters:
  • n (int, optional) – Number of items to return for each group. Cannot be used with frac and must be no larger than the smallest group unless replace is True. Defaults to 1 if frac is None.

  • frac (float, optional) – Fraction of items to return. Cannot be used with n.

  • replace (bool, default=False) – Whether to allow sampling of the same row more than once.

  • weights (pdarray, optional) – If None, all rows are given equal probability. If a pdarray is provided, it must have the same length as the underlying DataFrame and will be used as sampling probabilities after normalization within each group. Weights must be non-negative, with at least one positive element in each group.

  • random_state (int or ak.random.Generator, optional) – If an int, it is used as the seed for the random number generator. If an ak.random.Generator, it is used directly.

Returns:

A new DataFrame containing items randomly sampled from each group, sorted according to the grouped columns.

Return type:

DataFrame

Examples

>>> import arkouda as ak
>>> df = ak.DataFrame({"A": [3, 1, 2, 1, 2, 3], "B": [3, 4, 5, 6, 7, 8]})
>>> df
   A  B
0  3  3
1  1  4
2  2  5
3  1  6
4  2  7
5  3  8 (6 rows x 2 columns)

Sample one row per group:

>>> df.groupby("A").sample(random_state=6)
   A  B
3  1  6
4  2  7
5  3  8 (3 rows x 2 columns)

Sample a fraction of each group with weights:

>>> df.groupby("A").sample(
...     frac=0.5,
...     random_state=3,
...     weights=ak.array([1, 1, 1, 0, 0, 0]),
... )
   A  B
1  1  4
2  2  5
0  3  3 (3 rows x 2 columns)

Sample with replacement:

>>> df.groupby("A").sample(
...     n=3,
...     replace=True,
...     random_state=ak.random.default_rng(7),
... )
   A  B
1  1  4
3  1  6
1  1  4
4  2  7
4  2  7
4  2  7
0  3  3
5  3  8
5  3  8 (9 rows x 2 columns)
size(as_series=None, sort_index=True)[source]

Compute the size of each value as the total number of rows, including NaN values.

Parameters:
  • as_series (bool, default=None) – Indicates whether to return arkouda.pandas.dataframe.DataFrame (if as_series = False) or arkouda.pandas.series.Series (if as_series = True)

  • sort_index (bool, default=True) – If True, results will be returned with index values sorted in ascending order.

Return type:

arkouda.pandas.dataframe.DataFrame or arkouda.pandas.series.Series

Examples

>>> import arkouda as ak
>>> df = ak.DataFrame({"A":[1,2,2,3],"B":[3,4,5,6]})
>>> df
   A  B
0  1  3
1  2  4
2  2  5
3  3  6 (4 rows x 2 columns)
>>> df.groupby("A").size(as_series = False)
   size
A
1     1
2     2
3     1 (3 rows x 1 columns)
std(colnames=None)[source]
sum(colnames=None)[source]
tail(n: int = 5, sort_index: bool = True) DataFrame[source]

Return the last n rows from each group.

Parameters:
  • n (int, optional, default = 5) – Maximum number of rows to return for each group. If the number of rows in a group is less than n, all the rows from that group will be returned.

  • sort_index (bool, default = True) – If true, return the DataFrame with indices sorted.

Return type:

DataFrame

Examples

>>> import arkouda as ak
>>> df = ak.DataFrame({"a":ak.arange(10) %3 , "b":ak.arange(10)})
>>> df
   a  b
0  0  0
1  1  1
2  2  2
3  0  3
4  1  4
5  2  5
6  0  6
7  1  7
8  2  8
9  0  9 (10 rows x 2 columns)
>>> df.groupby("a").tail(2)
   a  b
4  1  4
5  2  5
6  0  6
7  1  7
8  2  8
9  0  9 (6 rows x 2 columns)
unique(colnames=None)[source]
var(colnames=None)[source]
where_not_nan = None
xor(colnames=None)[source]
class arkouda.DatetimeAccessor(series)[source]

Bases: Properties

Accessor for datetime-like operations on Arkouda Series.

Provides datetime methods such as .floor(), .ceil(), and .round(), mirroring the .dt accessor in pandas.

This accessor is automatically attached to Series objects that wrap arkouda.Datetime values. It should not be instantiated directly.

Parameters:

series (arkouda.pandas.Series) – The Series object containing Datetime values.

Raises:

AttributeError – If the underlying Series values are not of type arkouda.Datetime.

Examples

>>> import arkouda as ak
>>> from arkouda import Datetime, Series
>>> s = Series(Datetime(ak.array([1_000_000_000_000])))
>>> s.dt.floor("D")
0   1970-01-01
dtype: datetime64[ns]
series
class arkouda.DiffAggregate(gb, series)[source]

A column in a GroupBy that has been differenced.

Aggregation operations can be done on the result.

gb

GroupBy object, where the aggregation keys are values of column(s) of a dataframe.

Type:

GroupBy

values

A column to compute the difference on.

Type:

pdarray

all()[source]
and_()[source]
any()[source]
argmax()[source]
argmin()[source]
count()[source]
first()[source]
gb: arkouda.pandas.groupbyclass.GroupBy
max()[source]
mean()[source]
median()[source]
min()[source]
mode()[source]
nunique()[source]
or_()[source]
prod()[source]
std()[source]
sum()[source]
unique()[source]
values: arkouda.numpy.pdarrayclass.pdarray
var()[source]
xor()[source]
class arkouda.Index(values: List | arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings | arkouda.pandas.categorical.Categorical | pandas.Index | Index | pandas.Categorical, name: str | None = None, allow_list=False, max_list_size=1000)[source]

Sequence used for indexing and alignment.

The basic object storing axis labels for all DataFrame objects.

Parameters:
  • values (List, pdarray, Strings, Categorical, pandas.Categorical, pandas.Index, or Index)

  • name (str, default=None) – Name to be stored in the index.

  • False (allow_list =) – If False, list values will be converted to a pdarray. If True, list values will remain as a list, provided the data length is less than max_list_size.

:paramIf False, list values will be converted to a pdarray.

If True, list values will remain as a list, provided the data length is less than max_list_size.

Parameters:

1000 (max_list_size =) – This is the maximum allowed data length for the values to be stored as a list object.

Raises:

ValueError – Raised if allow_list=True and the size of values is > max_list_size.

See also

MultiIndex

Examples

>>> import arkouda as ak
>>> ak.Index([1, 2, 3])
Index(array([1 2 3]), dtype='int64')
>>> ak.Index(list('abc'))
Index(array(['a', 'b', 'c']), dtype='<U0')
>>> ak.Index([1, 2, 3], allow_list=True)
Index([1, 2, 3], dtype='int64')
argsort(ascending: bool = True) list | arkouda.numpy.pdarrayclass.pdarray[source]

Return the permutation that sorts the Index.

Parameters:

ascending (bool, optional) – If True (default), sort in ascending order. If False, sort in descending order.

Returns:

Indices that would sort the Index.

Return type:

list or pdarray

Examples

>>> import arkouda as ak
>>> idx = ak.Index([10, 3, 5])
>>> idx.argsort()
array([1 2 0])
concat(other)[source]

Concatenate this Index with another Index.

Parameters:

other (Index) – The Index to concatenate with this one.

Returns:

A new Index with values from both indices.

Return type:

Index

Raises:

TypeError – If the types of the two Index objects do not match.

equals(other: Index) arkouda.numpy.dtypes.bool_scalars[source]

Whether Indexes are the same size, and all entries are equal.

Parameters:

other (Index) – object to compare.

Returns:

True if the Indexes are the same, o.w. False.

Return type:

bool_scalars

Examples

>>> import arkouda as ak
>>> i = ak.Index([1, 2, 3])
>>> i_cpy = ak.Index([1, 2, 3])
>>> i.equals(i_cpy)
np.True_
>>> i2 = ak.Index([1, 2, 4])
>>> i.equals(i2)
np.False_

MultiIndex case:

>>> arrays = [ak.array([1, 1, 2, 2]), ak.array(["red", "blue", "red", "blue"])]
>>> m = ak.MultiIndex(arrays, names=["numbers2", "colors2"])
>>> m.equals(m)
True
>>> arrays2 = [ak.array([1, 1, 2, 2]), ak.array(["red", "blue", "red", "green"])]
>>> m2 = ak.MultiIndex(arrays2, names=["numbers2", "colors2"])
>>> m.equals(m2)
False
static factory(index)[source]

Construct an Index or MultiIndex based on the input.

Parameters:

index (array-like or tuple of array-like) – If a single array-like, returns an Index. If a tuple of array-like objects, returns a MultiIndex.

Returns:

An Index if input is a single array-like, or a MultiIndex otherwise.

Return type:

Index or MultiIndex

classmethod from_return_msg(rep_msg)[source]

Reconstruct an Index or MultiIndex from a return message.

Parameters:

rep_msg (str) – A string return message containing encoded index information.

Returns:

The reconstructed Index or MultiIndex instance.

Return type:

Index or MultiIndex

property inferred_type: str

Return a string of the type inferred from the values.

is_registered()[source]

Return whether the object is registered.

Return True iff the object is contained in the registry or is a component of a registered object.

Returns:

Indicates if the object is contained in the registry

Return type:

numpy.bool

Raises:

RegistrationError – Raised if there’s a server-side error or a mis-match of registered components

See also

register, attach, unregister

Notes

Objects registered with the server are immune to deletion until they are unregistered.

property is_unique

Property indicating if all values in the index are unique.

Return type:

bool - True if all values are unique, False otherwise.

lookup(key)[source]

Check for presence of key(s) in the Index.

Parameters:

key (pdarray or scalar) – The value(s) to look up in the Index. If a scalar is provided, it will be converted to a one-element array.

Returns:

A boolean array of length len(self), indicating which entries of the Index are present in key.

Return type:

pdarray

Raises:

TypeError – If key cannot be converted to an arkouda array.

map(arg: dict | arkouda.pandas.series.Series) Index[source]

Map values of Index according to an input mapping.

Parameters:

arg (dict or Series) – The mapping correspondence.

Returns:

A new index with the values transformed by the mapping correspondence.

Return type:

arkouda.pandas.index.Index

Raises:

TypeError – Raised if arg is not of type dict or arkouda.pandas.Series. Raised if index values not of type pdarray, Categorical, or Strings.

Examples

>>> import arkouda as ak
>>> idx = ak.Index(ak.array([2, 3, 2, 3, 4]))
>>> idx
Index(array([2 3 2 3 4]), dtype='int64')
>>> idx.map({4: 25.0, 2: 30.0, 1: 7.0, 3: 5.0})
Index(array([30.00000000000000000 5.00000000000000000 30.00000000000000000
5.00000000000000000 25.00000000000000000]), dtype='float64')
>>> s2 = ak.Series(ak.array(["a","b","c","d"]), index = ak.array([4,2,1,3]))
>>> idx.map(s2)
Index(array(['b', 'd', 'b', 'd', 'a']), dtype='<U0')
max_list_size = 1000
memory_usage(unit='B')[source]

Return the memory usage of the Index values.

Parameters:

unit (str, default = "B") – Unit to return. One of {‘B’, ‘KB’, ‘MB’, ‘GB’}.

Returns:

Bytes of memory consumed.

Return type:

int

Examples

>>> import arkouda as ak
>>> idx = Index(ak.array([1, 2, 3]))
>>> idx.memory_usage()
24
property names

Return Index or MultiIndex names.

property ndim

Number of dimensions of the underlying data, by definition 1.

See also

MultiIndex.ndim

property nlevels

Integer number of levels in this Index.

An Index will always have 1 level.

objType = 'Index'
register(user_defined_name)[source]

Register this Index object and underlying components with the Arkouda server.

Parameters:

user_defined_name (str) – user defined name the Index is to be registered under, this will be the root name for underlying components

Returns:

The same Index which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different Indexes with the same name.

Return type:

Index

Raises:
  • TypeError – Raised if user_defined_name is not a str

  • RegistrationError – If the server was unable to register the Index with the user_defined_name

See also

unregister, attach, is_registered

Notes

Objects registered with the server are immune to deletion until they are unregistered.

registered_name: str | None = None
set_dtype(dtype)[source]

Change the data type of the index.

Currently only aku.ip_address and ak.array are supported.

property shape

Return the shape of the Index.

Returns:

A tuple representing the shape of the Index (size,).

Return type:

tuple

sort_values(return_indexer: bool = False, ascending: bool = True, na_position: str = 'last') Index | Tuple[Index, arkouda.numpy.pdarrayclass.pdarray | list][source]

Return a sorted copy of the index.

Parameters:
  • return_indexer (bool, default False) – If True, also return the integer positions that sort the index.

  • ascending (bool, default True) – Sort in ascending order. Use False for descending.

  • na_position ({'first', 'last'}, default 'last') – Where to position NaNs. ‘first’ puts NaNs at the beginning, ‘last’ at the end.

Returns:

sorted_indexarkouda.Index

A new Index whose values are sorted.

indexerUnion[arkouda.pdarray, list], optional

The indices that would sort the original index. Only returned when return_indexer=True.

Return type:

Union[Index, Tuple[Index, Union[pdarray, list]]]

Examples

>>> import arkouda as ak
>>> idx = ak.Index([10, 100, 1, 1000])
>>> idx
Index(array([10 100 1 1000]), dtype='int64')

Sort in ascending order (default): >>> idx.sort_values() Index(array([1 10 100 1000]), dtype=’int64’)

Sort in descending order and get the sort positions: >>> idx.sort_values(ascending=False, return_indexer=True) (Index(array([1000 100 10 1]), dtype=’int64’), array([3 1 0 2]))

to_csv(prefix_path: str, dataset: str = 'index', col_delim: str = ',', overwrite: bool = False)[source]

Write Index to CSV file(s).

File will contain a single column with the pdarray data. All CSV Files written by Arkouda include a header denoting data types of the columns.

Parameters:
  • prefix_path (str) – The filename prefix to be used for saving files. Files will have _LOCALE#### appended when they are written to disk.

  • dataset (str) – Column name to save the pdarray under. Defaults to “array”.

  • col_delim (str) – Defaults to “,”. Value to be used to separate columns within the file. Please be sure that the value used DOES NOT appear in your dataset.

  • overwrite (bool) – Defaults to False. If True, any existing files matching your provided prefix_path will be overwritten. If False, an error will be returned if existing files are found.

Return type:

str reponse message

Raises:
  • ValueError – Raised if all datasets are not present in all parquet files or if one or more of the specified files do not exist.

  • RuntimeError – Raised if one or more of the specified files cannot be opened. If allow_errors is true this may be raised if no values are returned from the server.

  • TypeError – Raised if we receive an unknown arkouda_type returned from the server. Raised if the Index values are a list.

Notes

  • CSV format is not currently supported by load/load_all operations

  • The column delimiter is expected to be the same for column names and data

  • Be sure that column delimiters are not found within your data.

  • All CSV files must delimit rows using newline (n) at this time.

to_dict(label)[source]

Convert the Index to a dictionary with a specified label.

Parameters:

label (str or list of str) – The key to use in the resulting dictionary. If a list is provided, only the first element is used. If None, defaults to “idx”.

Returns:

A dictionary with the label as the key and the Index as the value.

Return type:

dict

to_hdf(prefix_path: str, dataset: str = 'index', mode: Literal['truncate', 'append'] = 'truncate', file_type: Literal['single', 'distribute'] = 'distribute') str[source]

Save the Index to HDF5.

The object can be saved to a collection of files or single file.

Parameters:
  • prefix_path (str) – Directory and filename prefix that all output files share

  • dataset (str) – Name of the dataset to create in files (must not already exist)

  • mode (str {'truncate' | 'append'}) – By default, truncate (overwrite) output files, if they exist. If ‘append’, attempt to create new dataset in existing files.

  • file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.

Return type:

string message indicating result of save operation

Raises:
  • RuntimeError – Raised if a server-side error is thrown saving the pdarray

  • TypeError – Raised if the Index values are a list.

Notes

  • The prefix_path must be visible to the arkouda server and the user must

have write permission. - Output files have names of the form <prefix_path>_LOCALE<i>, where <i> ranges from 0 to numLocales for file_type=’distribute’. Otherwise, the file name will be prefix_path. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, a RuntimeError will result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.

to_ndarray()[source]

Convert the Index values to a NumPy ndarray.

Returns:

A NumPy array representation of the Index values.

Return type:

numpy.ndarray

to_pandas()[source]

Convert this Arkouda-backed index wrapper to an equivalent pandas Index.

This method materializes the underlying values into a local NumPy array (or pandas Categorical, when applicable) and returns the corresponding pandas Index (or CategoricalIndex).

Returns:

A pandas Index representing the same logical values. For categorical data, a pandas.CategoricalIndex is returned.

Return type:

pandas.Index

Notes

  • If the underlying values are categorical, this returns a pandas.CategoricalIndex.

  • For unicode string-like data (or object arrays inferred as strings), this attempts to return a pandas “string” dtype Index to match pandas’ missing-value behavior (e.g., NA handling).

  • Fixed-width bytes data is preserved as bytes (no implicit decoding).

Examples

>>> import arkouda as ak
>>> import pandas
>>> idx = ak.Index(ak.array([1,2,3]))
>>> pidx = idx.to_pandas()
>>> pidx.dtype
dtype('<i8')
to_parquet(prefix_path: str, dataset: str = 'index', mode: Literal['truncate', 'append'] = 'truncate', compression: str | None = None)[source]

Save the Index to Parquet.

The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. Each locale saves its chunk of the array to its corresponding file.

Parameters:
  • prefix_path (str) – Directory and filename prefix that all output files share

  • dataset (str) – Name of the dataset to create in files (must not already exist)

  • mode ({'truncate' | 'append'}) – By default, truncate (overwrite) output files, if they exist. If ‘append’, attempt to create new dataset in existing files.

  • compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Sets the compression type used with Parquet files

Return type:

string message indicating result of save operation

Raises:
  • RuntimeError – Raised if a server-side error is thrown saving the pdarray

  • TypeError – Raised if the Index values are a list.

Notes

  • The prefix_path must be visible to the arkouda server and the user must

have write permission. - Output files have names of the form <prefix_path>_LOCALE<i>, where <i> ranges from 0 to numLocales for file_type=’distribute’. - ‘append’ write mode is supported, but is not efficient. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, a RuntimeError will result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.

tolist()[source]

Convert the Index values to a Python list.

Returns:

A list containing the Index values.

Return type:

list

unregister()[source]

Unregister this Index object in the arkouda server.

Unregister this Index object in the arkouda server, which was previously registered using register() and/or attached to using attach().

Raises:

RegistrationError – If the object is already unregistered or if there is a server error when attempting to unregister

See also

register, attach, is_registered

Notes

Objects registered with the server are immune to deletion until they are unregistered.

update_hdf(prefix_path: str, dataset: str = 'index', repack: bool = True)[source]

Overwrite the dataset with the name provided with this Index object.

If the dataset does not exist it is added.

Parameters:
  • prefix_path (str) – Directory and filename prefix that all output files share

  • dataset (str) – Name of the dataset to create in files

  • repack (bool) – Default: True HDF5 does not release memory on delete. When True, the inaccessible data (that was overwritten) is removed. When False, the data remains, but is inaccessible. Setting to false will yield better performance, but will cause file sizes to expand.

Raises:

RuntimeError – Raised if a server-side error is thrown saving the index

Notes

  • If file does not contain File_Format attribute to indicate how it was saved, the file name is checked for _LOCALE#### to determine if it is distributed.

  • If the dataset provided does not exist, it will be added

  • Because HDF5 deletes do not release memory, this will create a copy of the file with the new data

class arkouda.LogLevel(*args, **kwds)[source]

Bases: enum.Enum

Enum for defining valid log levels used by ArkoudaLogger.

Members

INFOstr

Confirmation that things are working as expected.

DEBUGstr

Detailed information, typically of interest only when diagnosing problems.

WARNstr

An indication that something unexpected happened, or indicative of some problem.

ERRORstr

A more serious problem, the software has not been able to perform some function.

CRITICALstr

An extremely serious error, indicating the program itself may be unable to continue.

Notes

This enum provides a controlled vocabulary for setting log levels on ArkoudaLogger instances. These are mapped internally to the standard Python logging levels.

CRITICAL = 'CRITICAL'
DEBUG = 'DEBUG'
ERROR = 'ERROR'
INFO = 'INFO'
WARN = 'WARN'
class arkouda.MultiIndex(data: list | tuple | pandas.MultiIndex | MultiIndex, name: str | None = None, names: Iterable[Hashable | None] | None = None)[source]

Bases: Index

A multi-level, or hierarchical, index object for Arkouda DataFrames and Series.

A MultiIndex allows you to represent multiple dimensions of indexing using a single object, enabling advanced indexing and grouping operations.

This class mirrors the behavior of pandas’ MultiIndex while leveraging Arkouda’s distributed data structures. Internally, it stores a list of Index objects, each representing one level of the hierarchy.

Examples

>>> import arkouda as ak
>>> from arkouda.pandas.index import MultiIndex
>>> a = ak.array([1, 2, 3])
>>> b = ak.array(['a', 'b', 'c'])
>>> mi = MultiIndex([a, b])
>>> mi[1]
MultiIndex([np.int64(2), np.str_('b')])
argsort(ascending=True)[source]

Return the indices that would sort the MultiIndex.

Parameters:

ascending (bool, default True) – If False, the result is in descending order.

Returns:

An array of indices that would sort the MultiIndex.

Return type:

pdarray

concat(other)[source]

Concatenate this MultiIndex with another, preserving duplicates and order.

Parameters:

other (MultiIndex) – The other MultiIndex to concatenate with.

Returns:

A new MultiIndex containing values from both inputs, preserving order.

Return type:

MultiIndex

Raises:

TypeError – If the type of other does not match.

property dtype: numpy.dtype

Return the dtype object of the underlying data.

equal_levels(other: MultiIndex) bool[source]

Return True if the levels of both MultiIndex objects are the same.

get_level_values(level: str | int)[source]

Return the values at a particular level of the MultiIndex.

Parameters:

level (int or str) – The level number or name. If a string is provided, it must match an entry in self.names.

Returns:

An Index object corresponding to the requested level.

Return type:

Index

Raises:
  • RuntimeError – If self.names is None and a string level is provided.

  • ValueError – If the provided string is not in self.names, or if the level index is out of bounds.

property index

Return the levels of the MultiIndex.

Returns:

A list of Index objects representing the levels of the MultiIndex.

Return type:

list

property inferred_type: str

Return the inferred type of the MultiIndex.

Returns:

The string “mixed”, indicating the MultiIndex may contain multiple types.

Return type:

str

is_registered()[source]

Check if the MultiIndex is registered with the Arkouda server.

Returns:

True if the MultiIndex has a registered name and is recognized by the server, False otherwise.

Return type:

bool

levels: list[arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings | arkouda.pandas.categorical.Categorical]
lookup(key: list[Any] | tuple[Any, Ellipsis]) arkouda.pandas.groupbyclass.groupable[source]

Perform element-wise lookup on the MultiIndex.

Parameters:

key (list or tuple) –

A sequence of values, one for each level of the MultiIndex.

  • If the elements are scalars (e.g., (1, "red")), they are treated as a single row key: the result is a boolean mask over rows where all levels match the corresponding scalar.

  • If the elements are arkouda arrays (e.g., list of pdarrays / Strings), they must align one-to-one with the levels, and the lookup is delegated to in1d(self.index, key) for multi-column membership.

Returns:

A boolean array indicating which rows in the MultiIndex match the key.

Return type:

groupable

Raises:
  • TypeError – If key is not a list or tuple.

  • ValueError – If the length of key does not match the number of levels.

memory_usage(unit='B')[source]

Return the memory usage of the MultiIndex levels.

Parameters:

unit (str, default = "B") – Unit to return. One of {‘B’, ‘KB’, ‘MB’, ‘GB’}.

Returns:

Bytes of memory consumed.

Return type:

int

Examples

>>> import arkouda as ak
>>> m = ak.pandas.index.MultiIndex([ak.array([1,2,3]),ak.array([4,5,6])])
>>> m.memory_usage()
48
property name

Return Index or MultiIndex name.

property names

Return Index or MultiIndex names.

property ndim

Number of dimensions of the underlying data, by definition 1.

See also

Index.ndim

property nlevels: int

Integer number of levels in this MultiIndex.

See also

Index.nlevels

objType = 'MultiIndex'
register(user_defined_name)[source]

Register this Index object and underlying components with the Arkouda server.

Parameters:

user_defined_name (str) – user defined name the Index is to be registered under, this will be the root name for underlying components

Returns:

The same Index which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different Indexes with the same name.

Return type:

MultiIndex

Raises:
  • TypeError – Raised if user_defined_name is not a str

  • RegistrationError – If the server was unable to register the Index with the user_defined_name

See also

unregister, attach, is_registered

Notes

Objects registered with the server are immune to deletion until they are unregistered.

registered_name: str | None
set_dtype(dtype)[source]

Change the data type of the index.

Currently only aku.ip_address and ak.array are supported.

size: arkouda.numpy.dtypes.int_scalars
to_dict(labels=None)[source]

Convert the MultiIndex to a dictionary representation.

Parameters:

labels (list of str, optional) – A list of column names for the index levels. If not provided, defaults to [‘idx_0’, ‘idx_1’, …, ‘idx_n’].

Returns:

A dictionary mapping each label to the corresponding Index object.

Return type:

dict

to_hdf(prefix_path: str, dataset: str = 'index', mode: Literal['truncate', 'append'] = 'truncate', file_type: Literal['single', 'distribute'] = 'distribute') str[source]

Save the Index to HDF5.

The object can be saved to a collection of files or single file.

Parameters:
  • prefix_path (str) – Directory and filename prefix that all output files share

  • dataset (str) – Name of the dataset to create in files (must not already exist)

  • mode ({'truncate' | 'append'}) – By default, truncate (overwrite) output files, if they exist. If ‘append’, attempt to create new dataset in existing files.

  • file_type ({"single" | "distribute"}) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.

Return type:

string message indicating result of save operation

Raises:

RuntimeError – Raised if a server-side error is thrown saving the pdarray.

Notes

  • The prefix_path must be visible to the arkouda server and the user must

have write permission. - Output files have names of the form <prefix_path>_LOCALE<i>, where <i> ranges from 0 to numLocales for file_type=’distribute’. Otherwise, the file name will be prefix_path. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, a RuntimeError will result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.

to_ndarray()[source]

Convert the MultiIndex to a NumPy ndarray of arrays.

Returns:

A NumPy array where each element is an array corresponding to one level of the MultiIndex. Categorical levels are converted to their underlying arrays.

Return type:

numpy.ndarray

to_pandas()[source]

Convert the MultiIndex to a pandas.MultiIndex object.

Returns:

A pandas MultiIndex with the same levels and names.

Return type:

pandas.MultiIndex

Notes

Categorical levels are converted to pandas categorical arrays, while others are converted to NumPy arrays.

tolist()[source]

Convert the MultiIndex to a list of lists.

Returns:

A list of Python lists, where each inner list corresponds to one level of the MultiIndex.

Return type:

list

unregister()[source]

Unregister this MultiIndex from the Arkouda server.

Raises:

RegistrationError – If the MultiIndex is not currently registered.

update_hdf(prefix_path: str, dataset: str = 'index', repack: bool = True)[source]

Overwrite the dataset with the name provided with this Index object.

If the dataset does not exist it is added.

Parameters:
  • prefix_path (str) – Directory and filename prefix that all output files share

  • dataset (str) – Name of the dataset to create in files

  • repack (bool) – Default: True HDF5 does not release memory on delete. When True, the inaccessible data (that was overwritten) is removed. When False, the data remains, but is inaccessible. Setting to false will yield better performance, but will cause file sizes to expand.

Raises:
  • RuntimeError – Raised if a server-side error is thrown saving the index

  • TypeError – Raised if the Index levels are a list.

Notes

  • If file does not contain File_Format attribute to indicate how it was saved, the file name is checked for _LOCALE#### to determine if it is distributed.

  • If the dataset provided does not exist, it will be added

  • Because HDF5 deletes do not release memory, this will create a copy of the file with the new data

class arkouda.Power_divergenceResult[source]

Bases: Power_divergenceResult

The results of a power divergence statistical test.

statistic
Type:

float64

pvalue
Type:

float64

class arkouda.Properties[source]

Base class for accessor implementations in Arkouda.

Provides the _make_op class method to dynamically generate accessor methods that wrap underlying Strings or Datetime operations and return new Series.

Notes

This class is subclassed by StringAccessor and DatetimeAccessor, and is not intended to be used directly.

Examples

Subclasses should define _make_op(“operation_name”), which will generate a method that applies series.values.operation_name(…) and returns a new Series.

arkouda.RegisteredSymbols = '__RegisteredSymbols__'
class arkouda.Row(dict=None, /, **kwargs)[source]

Bases: collections.UserDict

Dictionary-like representation of a single row in an Arkouda DataFrame.

Wraps the column-to-value mapping for one row and provides convenient ASCII and HTML formatting for display.

Parameters:

data (dict) – Mapping of column names to their corresponding values for this row.

Examples

>>> import arkouda as ak
>>> from arkouda.pandas.row import Row
>>> df = ak.DataFrame({"x": ak.array([10, 20]), "y": ak.array(["a", "b"])})

Suppose df[0] returns {"x": 10, "y": "a}:

>>> row = Row({"x": 10, "y": "a"})
>>> print(row)
keys    values
------  --------
x       10
y       a
class arkouda.Series(data: Tuple | List | arkouda.pandas.groupbyclass.groupable_element_type | Series | arkouda.numpy.segarray.SegArray | pandas.Series | pandas.Categorical, name=None, index: arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings | Tuple | List | arkouda.pandas.index.Index | None = None)[source]

One-dimensional Arkouda array with axis labels.

Parameters:
  • index (pdarray or Strings, optional) – An array of indices associated with the data array. If not provided (or empty), it defaults to a range of ints whose size matches the size of the data.

  • data (tuple, list, groupable_element_type, Series, or SegArray) – A 1D array-like. Must not be None.

Raises:
  • TypeError – Raised if index is not a pdarray or Strings object. Raised if data is not a supported type.

  • ValueError – Raised if the index size does not match the data size.

Notes

The Series class accepts either positional arguments or keyword arguments.

Positional arguments
  • Series(data): data is provided and an index is generated automatically.

  • Series(data, index): both data and index are provided.

Keyword arguments
  • Series(data=..., index=...): index is optional but must match the size of data when provided.

add(b: Series) Series[source]
argmax()[source]
argmin()[source]
property at: _LocIndexer

Accesses entries of a Series by label.

Returns:

An indexer for label-based access to Series entries.

Return type:

_LocIndexer

static concat(arrays: List, axis: int = 0, index_labels: List[str] | None = None, value_labels: List[str] | None = None, ordered: bool = False) arkouda.pandas.dataframe.DataFrame | Series[source]

Concatenate a list of Arkouda Series or grouped arrays horizontally or vertically.

If a list of grouped Arkouda arrays is passed, they are converted to Series. Each grouping is a 2-tuple where the first item is the key(s) and the second is the value. If concatenating horizontally (axis=1), all series/groupings must have the same length and the same index. The index is converted to a column in the resulting DataFrame; if it’s a MultiIndex, each level is converted to a separate column.

Parameters:
  • arrays (List) – A list of Series or groupings (tuples of index and values) to concatenate.

  • axis (int) – The axis to concatenate along: - 0 = vertical (stack series into one) - 1 = horizontal (align by index and produce a DataFrame) Defaults to 0.

  • index_labels (List[str] or None, optional) – Column name(s) to label the index when axis=1.

  • value_labels (List[str] or None, optional) – Column names to label the values of each Series.

  • ordered (bool) – Unused parameter. Reserved for future support of deterministic vs. performance-optimized concatenation. Defaults to False.

Returns:

  • If axis=0: a new Series

  • If axis=1: a new DataFrame

Return type:

Series or DataFrame

diff() Series[source]

Diffs consecutive values of the series.

Returns a new series with the same index and length. First value is set to NaN.

dt
property dtype: numpy.dtype
fillna(value: supported_scalars | Series | arkouda.numpy.pdarrayclass.pdarray) Series[source]

Fill NA/NaN values using the specified method.

Parameters:

value (supported_scalars, Series, or pdarray) – Value to use to fill holes (e.g. 0), alternately a Series of values specifying which value to use for each index. Values not in the Series will not be filled. This value cannot be a list.

Returns:

Object with missing values filled.

Return type:

Series

Examples

>>> import arkouda as ak
>>> from arkouda import Series
>>> import numpy as np
>>> data = ak.Series([1, np.nan, 3, np.nan, 5])
>>> data
0    1.0
1    NaN
2    3.0
3    NaN
4    5.0
dtype: float64
>>> fill_values1 = ak.ones(5)
>>> data.fillna(fill_values1)
0    1.0
1    1.0
2    3.0
3    1.0
4    5.0
dtype: float64
>>> fill_values2 = Series(ak.ones(5))
>>> data.fillna(fill_values2)
0    1.0
1    1.0
2    3.0
3    1.0
4    5.0
dtype: float64
>>> fill_values3 = 100.0
>>> data.fillna(fill_values3)
0      1.0
1    100.0
2      3.0
3    100.0
4      5.0
dtype: float64
classmethod from_return_msg(rep_msg: str) Series[source]

Return a Series instance pointing to components created by the arkouda server.

The user should not call this function directly.

Parameters:

rep_msg (builtin_str) –

  • delimited string containing the values and indexes.

Returns:

A Series representing a set of pdarray components on the server.

Return type:

Series

Raises:

RuntimeError – Raised if a server-side error is thrown in the process of creating the Series instance.

has_repeat_labels() bool[source]

Return whether the Series has any labels that appear more than once.

hasnans() arkouda.numpy.dtypes.bool_scalars[source]

Return True if there are any NaNs.

Return type:

bool

Examples

>>> import arkouda as ak
>>> from arkouda import Series
>>> import numpy as np
>>> s = ak.Series(ak.array([1, 2, 3, np.nan]))
>>> s
0    1.0
1    2.0
2    3.0
3    NaN
dtype: float64
>>> s.hasnans()
np.True_
head(n: int = 10) Series[source]

Return the first n values of the series.

property iat: _iLocIndexer

Accesses entries of a Series by position.

Returns:

An indexer for position-based access to a single element.

Return type:

_iLocIndexer

property iloc: _iLocIndexer

Accesses entries of a Series by position.

Returns:

An indexer for position-based access to Series entries.

Return type:

_iLocIndexer

is_registered() bool[source]
Return True iff the object is contained in the registry or is a component of a

registered object.

Returns:

Indicates if the object is contained in the registry

Return type:

bool

Raises:

RegistrationError – Raised if there’s a server-side error or a mis-match of registered components

See also

register, attach, unregister

Notes

Objects registered with the server are immune to deletion until they are unregistered.

isin(lst: arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings | List) Series[source]

Find Series elements whose values are in the specified list.

Parameters:

lst (pdarray, Strings, or List) – Either a Python list or an Arkouda array to check membership against.

Returns:

A Series of booleans that is True for elements found in the list, and False otherwise.

Return type:

Series

isna() Series[source]

Detect missing values.

Return a boolean same-sized object indicating if the values are NA. NA values, such as numpy.NaN, gets mapped to True values. Everything else gets mapped to False values. Characters such as empty strings ‘’ are not considered NA values.

Returns:

Mask of bool values for each element in Series that indicates whether an element is an NA value.

Return type:

Series

Examples

>>> import arkouda as ak
>>> from arkouda import Series
>>> import numpy as np
>>> s = Series(ak.array([1, 2, np.nan]), index = ak.array([1, 2, 4]))
>>> s.isna()
1    False
2    False
4     True
dtype: bool
isnull() Series[source]

Series.isnull is an alias for Series.isna.

Detect missing values.

Return a boolean same-sized object indicating if the values are NA. NA values, such as numpy.NaN, gets mapped to True values. Everything else gets mapped to False values. Characters such as empty strings ‘’ are not considered NA values.

Returns:

Mask of bool values for each element in Series that indicates whether an element is an NA value.

Return type:

Series

Examples

>>> import arkouda as ak
>>> from arkouda import Series
>>> import numpy as np
>>> s = Series(ak.array([1, 2, np.nan]), index = ak.array([1, 2, 4]))
>>> s.isnull()
1    False
2    False
4     True
dtype: bool
property loc: _LocIndexer

Accesses entries of a Series by label.

Returns:

An indexer for label-based access to Series entries.

Return type:

_LocIndexer

locate(key: int | arkouda.numpy.pdarrayclass.pdarray | arkouda.pandas.index.Index | Series | List | Tuple) Series[source]

Lookup values by index label.

Parameters:

key (int, pdarray, Index, Series, List, or Tuple) –

The key or keys to look up. This can be: - A scalar - A list of scalars - A list of lists (for MultiIndex) - A Series (in which case labels are preserved, and its values are used as keys)

Keys will be converted to Arkouda arrays as needed.

Returns:

A Series containing the values corresponding to the key.

Return type:

Series

map(arg: dict | arkouda.Series) arkouda.Series[source]

Map values of Series according to an input mapping.

Parameters:

arg (dict or Series) – The mapping correspondence.

Returns:

A new series with the same index as the caller. When the input Series has Categorical values, the return Series will have Strings values. Otherwise, the return type will match the input type.

Return type:

Series

Raises:

TypeError – Raised if arg is not of type dict or arkouda.Series. Raised if series values not of type pdarray, Categorical, or Strings.

Examples

>>> import arkouda as ak
>>> s = ak.Series(ak.array([2, 3, 2, 3, 4]))
>>> s
0    2
1    3
2    2
3    3
4    4
dtype: int64
>>> s.map({4: 25.0, 2: 30.0, 1: 7.0, 3: 5.0})
0    30.0
1     5.0
2    30.0
3     5.0
4    25.0
dtype: float64
>>> s2 = ak.Series(ak.array(["a","b","c","d"]), index = ak.array([4,2,1,3]))
>>> s.map(s2)
0    b
1    d
2    b
3    d
4    a
dtype: ...
max()[source]
mean()[source]
memory_usage(index: bool = True, unit: Literal['B', 'KB', 'MB', 'GB'] = 'B') int[source]

Return the memory usage of the Series.

The memory usage can optionally include the contribution of the index.

Parameters:
  • index (bool) – Specifies whether to include the memory usage of the Series index. Defaults to True.

  • unit ({"B", "KB", "MB", "GB"}) – Unit to return. One of {‘B’, ‘KB’, ‘MB’, ‘GB’}. Defaults to “B”.

Returns:

Bytes of memory consumed.

Return type:

int

See also

arkouda.numpy.pdarrayclass.nbytes, arkouda.Index.memory_usage, arkouda.pandas.series.Series.memory_usage, arkouda.pandas.datafame.DataFrame.memory_usage

Examples

>>> import arkouda as ak
>>> from arkouda.pandas.series import Series
>>> s = ak.Series(ak.arange(3))
>>> s.memory_usage()
48

Not including the index gives the size of the rest of the data, which is necessarily smaller:

>>> s.memory_usage(index=False)
24

Select the units:

>>> s = ak.Series(ak.arange(3000))
>>> s.memory_usage(unit="KB")
46.875
min()[source]
property ndim: int
notna() Series[source]

Detect existing (non-missing) values.

Return a boolean same-sized object indicating if the values are not NA. Non-missing values get mapped to True. Characters such as empty strings ‘’ are not considered NA values. NA values, such as numpy.NaN, get mapped to False values.

Returns:

Mask of bool values for each element in Series that indicates whether an element is not an NA value.

Return type:

Series

Examples

>>> import arkouda as ak
>>> from arkouda import Series
>>> import numpy as np
>>> s = Series(ak.array([1, 2, np.nan]), index = ak.array([1, 2, 4]))
>>> s.notna()
1     True
2     True
4    False
dtype: bool
notnull() Series[source]

Series.notnull is an alias for Series.notna.

Detect existing (non-missing) values.

Return a boolean same-sized object indicating if the values are not NA. Non-missing values get mapped to True. Characters such as empty strings ‘’ are not considered NA values. NA values, such as numpy.NaN, get mapped to False values.

Returns:

Mask of bool values for each element in Series that indicates whether an element is not an NA value.

Return type:

Series

Examples

>>> import arkouda as ak
>>> from arkouda import Series
>>> import numpy as np
>>> s = Series(ak.array([1, 2, np.nan]), index = ak.array([1, 2, 4]))
>>> s.notnull()
1     True
2     True
4    False
dtype: bool
objType = 'Series'
static pdconcat(arrays: List, axis: int = 0, labels: arkouda.numpy.strings.Strings | None = None) pandas.Series | pandas.DataFrame[source]

Concatenate a list of Arkouda Series or grouped arrays, returning a local pandas object.

If a list of grouped Arkouda arrays is passed, they are converted to Series. Each grouping is a 2-tuple with the first item being the key(s) and the second the value.

If axis=1 (horizontal), each Series or grouping must have the same length and the same index. The index is converted to a column in the resulting DataFrame. If it is a MultiIndex, each level is converted to a separate column.

Parameters:
  • arrays (List) – A list of Series or groupings (tuples of index and values) to concatenate.

  • axis (int) – The axis along which to concatenate: - 0 = vertical (stack into a Series) - 1 = horizontal (align by index into a DataFrame) Defaults to 0.

  • labels (Strings or None, optional) – Names to assign to the resulting columns in the DataFrame.

Returns:

  • If axis=0: a local pandas Series

  • If axis=1: a local pandas DataFrame

Return type:

Series or DataFrame

prod()[source]
register(user_defined_name: str)[source]

Register this Series object and underlying components with the Arkouda server.

Parameters:

user_defined_name (builtin_str) – User-defined name the Series is to be registered under. This will be the root name for the underlying components.

Returns:

The same Series which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different Series with the same name.

Return type:

Series

Raises:
  • TypeError – Raised if user_defined_name is not a str

  • RegistrationError – If the server was unable to register the Series with the user_defined_name

See also

unregister, attach, is_registered

Notes

Objects registered with the server are immune to deletion until they are unregistered.

registered_name: str | None = None
property shape: Tuple[int]
size
sort_index(ascending: bool = True) Series[source]

Sort the Series by its index.

Parameters:

ascending (bool) – Whether to sort the index in ascending (default) or descending order. Defaults to True.

Returns:

A new Series sorted by index.

Return type:

Series

sort_values(ascending: bool = True) Series[source]

Sort the Series by its values.

Parameters:

ascending (bool) – Whether to sort values in ascending (default) or descending order. Defaults to True.

Returns:

A new Series sorted by its values.

Return type:

Series

std()[source]
str
sum()[source]
tail(n: int = 10) Series[source]

Return the last n values of the series.

to_dataframe(index_labels: List[str] | None = None, value_label: str | None = None) arkouda.pandas.dataframe.DataFrame[source]

Convert the Series to an Arkouda DataFrame.

Parameters:
  • index_labels (list of str or None, optional) – Column name(s) to label the index.

  • value_label (str or None, optional) – Column name to label the values.

Returns:

An Arkouda DataFrame representing the Series.

Return type:

DataFrame

to_markdown(mode='wt', index=True, tablefmt='grid', storage_options=None, **kwargs)[source]

Print Series in Markdown-friendly format.

Parameters:
  • mode (str, optional) – Mode in which file is opened, “wt” by default.

  • index (bool, optional, default True) – Add index (row) labels.

  • tablefmt (str = "grid") – Table format to call from tablulate: https://pypi.org/project/tabulate/

  • storage_options (dict, optional) – Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc., if using a URL that will be parsed by fsspec, e.g., starting “s3://”, “gcs://”. An error will be raised if providing this argument with a non-fsspec URL. See the fsspec and backend storage implementation docs for the set of allowed keys and values.

  • **kwargs – These parameters will be passed to tabulate.

Note

This function should only be called on small Series as it calls pandas.Series.to_markdown: https://pandas.pydata.org/docs/reference/api/pandas.Series.to_markdown.html

Examples

>>> import arkouda as ak
>>> s = ak.Series(["elk", "pig", "dog", "quetzal"], name="animal")
>>> print(s.to_markdown())
+----+----------+
|    | animal   |
+====+==========+
|  0 | elk      |
+----+----------+
|  1 | pig      |
+----+----------+
|  2 | dog      |
+----+----------+
|  3 | quetzal  |
+----+----------+

Output markdown with a tabulate option.

>>> print(s.to_markdown(tablefmt="grid"))
+----+----------+
|    | animal   |
+====+==========+
|  0 | elk      |
+----+----------+
|  1 | pig      |
+----+----------+
|  2 | dog      |
+----+----------+
|  3 | quetzal  |
+----+----------+
to_ndarray() numpy.ndarray[source]
to_pandas() pandas.Series[source]

Convert the series to a local PANDAS series.

tolist() list[source]
topn(n: int = 10) Series[source]

Return the top values of the Series.

Parameters:

n (int) – Number of values to return. Defaults to 10.

Returns:

A new Series containing the top n values.

Return type:

Series

unregister()[source]

Unregister this Series object in the arkouda server which was previously registered using register() and/or attached to using attach().

Raises:

RegistrationError – If the object is already unregistered or if there is a server error when attempting to unregister

See also

register, attach, is_registered

Notes

Objects registered with the server are immune to deletion until they are unregistered.

validate_key(key: Series | arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings | arkouda.pandas.categorical.Categorical | List | supported_scalars | arkouda.numpy.segarray.SegArray) arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings | arkouda.pandas.categorical.Categorical | supported_scalars | arkouda.numpy.segarray.SegArray[source]

Validate type requirements for keys when reading or writing the Series.

Also converts list and tuple arguments into pdarrays.

Parameters:

key (Series, pdarray, Strings, Categorical, List, supported_scalars, or SegArray) – The key or container of keys that might be used to index into the Series.

Return type:

The validated key(s), with lists and tuples converted to pdarrays

Raises:
  • TypeError – Raised if keys are not boolean values or the type of the labels Raised if key is not one of the supported types

  • KeyError – Raised if container of keys has keys not present in the Series

  • IndexError – Raised if the length of a boolean key array is different from the Series

validate_val(val: arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings | supported_scalars | List) arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings | supported_scalars[source]

Validate type requirements for values being written into the Series.

Also converts list and tuple arguments into pdarrays.

Parameters:

val (pdarray, Strings, supported_scalars, or List) – The value or container of values that might be assigned into the Series.

Return type:

The validated value, with lists converted to pdarrays

Raises:

TypeError

Raised if val is not the same type or a container with elements

of the same time as the Series

Raised if val is a string or Strings type. Raised if val is not one of the supported types

value_counts(sort: bool = True) Series[source]

Return a Series containing counts of unique values.

Parameters:

sort (bool) – Whether to sort the result by count in descending order. If False, the order of the results is not guaranteed. Defaults to True.

Returns:

A Series where the index contains the unique values and the values are their counts in the original Series.

Return type:

Series

var()[source]
class arkouda.StringAccessor(series)[source]

Bases: Properties

Accessor for string operations on Arkouda Series.

Provides string-like methods such as .contains(), .startswith(), and .endswith() via the .str accessor, similar to pandas.

This accessor is automatically attached to Series objects that wrap arkouda.Strings or arkouda.Categorical values. It should not be instantiated directly.

Parameters:

series (arkouda.pandas.Series) – The Series object containing Strings or Categorical values.

Raises:

AttributeError – If the underlying Series values are not Strings or Categorical.

Examples

>>> import arkouda as ak
>>> from arkouda import Series
>>> s = Series(["apple", "banana", "apricot"])
>>> s.str.startswith("a")
0     True
1    False
2     True
dtype: bool
series
arkouda.apply(arr: arkouda.numpy.pdarrayclass.pdarray, func: Callable | str, result_dtype: numpy.dtype | str | None = None) arkouda.numpy.pdarrayclass.pdarray[source]

Apply a python function to a pdarray.

The function should take one argument and return a new value. The function will then be called on each element in the pdarray.

Warning

This function is experimental and may not work as expected.

Known limitations:

  • Any python modules used inside of the function must be installed on the server.

Parameters:
  • arr (pdarray) – The pdarray to which the function is applied.

  • func (Union[Callable, str]) – The function to apply to the array. This can be a callable function or a string, but either way it should take a single argument and return a single value. If a string, it should be a lambda function that takes a single argument, e.g. "lambda x,: x+1". Note the dangling comma after the argument; this is required for string functions.

  • result_dtype (Optional[Union[np.dtype, str]]) – The dtype of the resulting pdarray. If None, the dtype of the resulting pdarray will be the same as the input pdarray. If a string, it should be a valid NumPy dtype string, e.g. "float64". If a NumPy dtype, it should be a valid NumPy dtype object, e.g. np.float64. This is not supported for functions passed as strings.

Returns:

The pdarray resulting from applying the function to the input array.

Return type:

pdarray

Examples

>>> import arkouda as ak
>>> arr = ak.apply(ak.array([1, 2, 3]), lambda x: x + 1)
>>> arr
array([2 3 4])
>>> import arkouda as ak
>>> import math
>>> arr = ak.randint(0, 10, 4, seed=1)
>>> def times_pi(x):
...     return x * math.pi
>>> arr = ak.apply(arr, times_pi, "float64")
>>> arr
array([21.991148575128552 28.274333882308138 15.707963267948966 3.1415926535897931])
arkouda.assert_almost_equal(left, right, rtol: float = 1e-05, atol: float = 1e-08, **kwargs) None[source]

Check that the left and right objects are approximately equal.

By approximately equal, we refer to objects that are numbers or that contain numbers which may be equivalent to specific levels of precision.

Parameters:
  • left (object)

  • right (object)

  • rtol (float, default 1e-5) – Relative tolerance.

  • atol (float, default 1e-8) – Absolute tolerance.

Warning

This function cannot be used on pdarray of size > ak.core.client.maxTransferBytes because it converts pdarrays to numpy arrays and calls np.allclose.

arkouda.assert_almost_equivalent(left, right, rtol: float = 1e-05, atol: float = 1e-08) None[source]

Check that two objects are approximately equal.

By approximately equal, we refer to objects that are numbers or that contain numbers which may be equivalent to specific levels of precision.

If the objects are pandas or numpy objects, they are converted to Arkouda objects. Then assert_almost_equal is applied to the result.

Parameters:
  • left (object) – First object to compare.

  • right (object) – Second object to compare.

  • rtol (float) – Relative tolerance. Default is 1e-5.

  • atol (float) – Absolute tolerance. Default is 1e-8.

Raises:

TypeError – If either input is not a supported numeric-like type.

Warning

This function cannot be used on pdarrays of size > ak.core.client.maxTransferBytes because it converts pdarrays to numpy arrays and calls np.allclose.

Examples

>>> import arkouda as ak
>>> from arkouda.testing import assert_almost_equivalent
>>> assert_almost_equivalent(0.123456, 0.123457, rtol=1e-4)
arkouda.assert_arkouda_array_equal(left: arkouda.pdarray | arkouda.Strings | arkouda.Categorical | arkouda.SegArray, right: arkouda.pdarray | arkouda.Strings | arkouda.Categorical | arkouda.SegArray, check_dtype: bool = True, err_msg=None, check_same=None, obj: str = 'pdarray', index_values=None) None[source]

Check that two Arkouda arrays are equivalent. Supports pdarray, Strings, Categorical, and SegArray.

Parameters:
  • left (pdarray or Strings or Categorical or SegArray) – The first array to compare.

  • right (pdarray or Strings or Categorical or SegArray) – The second array to compare.

  • check_dtype (bool) – Whether to check dtype if both left and right are ak.pdarray. Defaults to True.

  • err_msg (str or None) – Custom assertion message, if provided. Defaults to None.

  • check_same ({'copy', 'same'} or None) – If not None, assert whether left and right share the same memory. - ‘copy’: assert that they do not share memory. - ‘same’: assert that they do share memory. Defaults to None.

  • obj (str) – Object name used in assertion error messages. Defaults to ‘pdarray’.

  • index_values (Index or pdarray or None) – Optional index shared by both left and right, used to enhance output in error messages. Defaults to None.

arkouda.assert_arkouda_array_equivalent(left: arkouda.pdarray | arkouda.Strings | arkouda.Categorical | arkouda.SegArray | numpy.ndarray | pandas.Categorical, right: arkouda.pdarray | arkouda.Strings | arkouda.Categorical | arkouda.SegArray | numpy.ndarray | pandas.Categorical, check_dtype: bool = True, err_msg=None, check_same=None, obj: str = 'pdarray', index_values=None) None[source]

Check that two Arkouda-compatible arrays are equal.

Supported types include numpy arrays, pandas Categorical, and Arkouda arrays.

Parameters:
  • left (pdarray, Strings, Categorical, SegArray, np.ndarray, or pd.Categorical) – First array to compare.

  • right (pdarray, Strings, Categorical, SegArray, np.ndarray, or pd.Categorical) – Second array to compare.

  • check_dtype (bool) – Whether to verify that dtypes match. Default is True.

  • err_msg (str or None) – Optional message to display on failure.

  • check_same (None or {"copy", "same"}) – Whether to ensure identity or separation in memory. Default is None.

  • obj (str) – Object label for error messages. Default is “pdarray”.

  • index_values (Index or pdarray, optional) – Shared index used in error output. Default is None.

Raises:

TypeError – If either input is not a supported array type.

Examples

>>> import arkouda as ak
>>> from arkouda import Strings
>>> from arkouda.testing import assert_arkouda_array_equivalent
>>> a = ak.array([1, 2, 3])
>>> b = ak.array([1, 2, 3])
>>> assert_arkouda_array_equivalent(a, b)
>>> s1 = ak.array(['x', 'y'])
>>> s2 = ak.array(['x', 'y'])
>>> assert_arkouda_array_equivalent(s1, s2)
arkouda.assert_arkouda_pdarray_equal(left: arkouda.pdarray, right: arkouda.pdarray, check_dtype: bool = True, err_msg=None, check_same=None, obj: str = 'pdarray', index_values=None) None[source]

Check that two Arkouda pdarray objects are equivalent.

Parameters:
  • left (pdarray) – The first array to compare.

  • right (pdarray) – The second array to compare.

  • check_dtype (bool) – Whether to check dtype if both arrays are pdarrays. Defaults to True.

  • err_msg (str or None) – Custom assertion message to display on failure. Defaults to None.

  • check_same ({'copy', 'same'} or None) – If not None, asserts whether left and right share the same memory: - ‘copy’: assert they do not share memory - ‘same’: assert they do share memory Defaults to None.

  • obj (str) – A name for the object being compared, used in assertion messages. Defaults to ‘pdarray’.

  • index_values (Index or pdarray or None) – Optional index shared by both arrays, used to enhance output on failure. Defaults to None.

arkouda.assert_arkouda_segarray_equal(left: arkouda.SegArray, right: arkouda.SegArray, check_dtype: bool = True, err_msg=None, check_same=None, obj: str = 'segarray') None[source]

Check that two Arkouda SegArray objects are equivalent.

Parameters:
  • left (SegArray) – The first SegArray to compare.

  • right (SegArray) – The second SegArray to compare.

  • check_dtype (bool) – Whether to check dtype if both arrays contain pdarrays. Defaults to True.

  • err_msg (str or None) – Custom assertion message. Defaults to None.

  • check_same ({'copy', 'same'} or None) – If not None, asserts whether left and right share the same memory. - ‘copy’: assert that they do not share memory. - ‘same’: assert that they do share memory. Defaults to None.

  • obj (str) – Name of the object being compared (used in assertion messages). Defaults to ‘segarray’.

arkouda.assert_arkouda_strings_equal(left, right, err_msg=None, check_same=None, obj: str = 'Strings', index_values=None) None[source]

Check that two ak.Strings arrays are equivalent.

Parameters:
  • left (Strings) – The first Strings object to compare.

  • right (Strings) – The second Strings object to compare.

  • err_msg (str or None) – Custom assertion message. Defaults to None.

  • check_same ({'copy', 'same'} or None) – If not None, assert whether left and right share the same memory. - ‘copy’: assert that they do not share memory - ‘same’: assert that they do share memory Defaults to None.

  • obj (str) – A name for the object being compared, used in assertion messages. Defaults to ‘Strings’.

  • index_values (Index or pdarray or None) – Optional index shared by both arrays, used in output. Defaults to None.

arkouda.assert_attr_equal(attr: str, left, right, obj: str = 'Attributes') None[source]

Check that attributes are equal. Both objects must have the given attribute.

Parameters:
  • attr (str) – The name of the attribute being compared.

  • left (object) – The first object to compare.

  • right (object) – The second object to compare.

  • obj (str) – A name for the object being compared, used in assertion messages. Defaults to ‘Attributes’.

arkouda.assert_categorical_equal(left, right, check_dtype: bool = True, check_category_order: bool = True, obj: str = 'Categorical') None[source]

Test that Categoricals are equivalent.

Parameters:
  • left (Categorical) – The first Categorical to compare.

  • right (Categorical) – The second Categorical to compare.

  • check_dtype (bool) – Whether to check that the integer dtype of the codes is the same. Defaults to True.

  • check_category_order (bool) – Whether to compare the order of the categories (which implies identical integer codes). If False, only the resulting values are compared. The ordered attribute is always checked. Defaults to True.

  • obj (str) – A name for the object being compared, used in assertion messages. Defaults to ‘Categorical’.

arkouda.assert_class_equal(left, right, exact: bool = True, obj: str = 'Input') None[source]

Check classes are equal.

arkouda.assert_contains_all(iterable, dic) None[source]

Assert that a dictionary contains all the elements of an iterable.

Parameters:
  • iterable (iterable)

  • dic (dict)

arkouda.assert_copy(iter1, iter2, **eql_kwargs) None[source]

Check that the elements are equal, but not the same object.

Does not check that items in sequences are also not the same object.

Parameters:
  • iter1 (iterable) – Iterables that produce elements comparable with assert_almost_equal.

  • iter2 (iterable) – Iterables that produce elements comparable with assert_almost_equal.

arkouda.assert_dict_equal(left, right, compare_keys: bool = True) None[source]

Assert that two dictionaries are equal.

Values must be arkouda objects.

Parameters:
  • left (dict) – The dictionaries to be compared.

  • right (dict) – The dictionaries to be compared.

  • compare_keys (bool) – Whether to compare the keys. Defaults to True. If False, only the values are compared.

arkouda.assert_equal(left, right, **kwargs) None[source]

Wrap tm.assert_*_equal to dispatch to the appropriate test function.

Parameters:
arkouda.assert_equivalent(left, right, **kwargs) None[source]

Dispatch to the appropriate assertion function depending on object types.

Parameters:
  • left (Any) – First object to compare. Type determines which assertion function is used.

  • right (Any) – Second object to compare.

  • **kwargs (dict) – Keyword arguments passed to the specific assertion function.

Raises:

AssertionError – If values are not equivalent.

Examples

>>> import arkouda as ak
>>> import pandas as pd
>>> from arkouda.testing import assert_equivalent
>>> ak_series = ak.Series([1, 2, 3])
>>> pd_series = pd.Series([1, 2, 3])
>>> assert_equivalent(ak_series, pd_series)
arkouda.assert_frame_equal(left: arkouda.DataFrame, right: arkouda.DataFrame, check_dtype: bool = True, check_index_type: bool = True, check_column_type: bool | Literal['equiv'] = 'equiv', check_frame_type: bool = True, check_names: bool = True, check_exact: bool = True, check_categorical: bool = True, check_like: bool = False, rtol: float = 1e-05, atol: float = 1e-08, obj: str = 'DataFrame') None[source]

Check that left and right DataFrame are equal.

This function is intended to compare two DataFrames and output any differences. It is mostly intended for use in unit tests. Additional parameters allow varying the strictness of the equality checks performed.

Parameters:
  • left (DataFrame) – First DataFrame to compare.

  • right (DataFrame) – Second DataFrame to compare.

  • check_dtype (bool) – Whether to check the DataFrame dtype is identical. Defaults to True.

  • check_index_type (bool) – Whether to check the Index class, dtype, and inferred_type are identical. Defaults to True.

  • check_column_type (bool or {'equiv'}) – Whether to check the column class, dtype, and inferred_type are identical. Passed as the exact argument of assert_index_equal(). Defaults to ‘equiv’.

  • check_frame_type (bool) – Whether to check the DataFrame class is identical. Defaults to True.

  • check_names (bool) – Whether to check that the names attribute for both the index and column attributes of the DataFrame is identical. Defaults to True.

  • check_exact (bool) – Whether to compare numbers exactly. Defaults to False.

  • check_categorical (bool) – Whether to compare internal Categoricals exactly. Defaults to True.

  • check_like (bool) – If True, ignore the order of index and columns. Note: index labels must match their respective rows (as in columns); same labels must be with the same data. Defaults to False.

  • rtol (float) – Relative tolerance. Only used when check_exact is False. Defaults to 1e-5.

  • atol (float) – Absolute tolerance. Only used when check_exact is False. Defaults to 1e-8.

  • obj (str) – A name for the object being compared, used in assertion messages. Defaults to ‘DataFrame’.

See also

assert_series_equal

Equivalent method for asserting Series equality.

Examples

>>> import arkouda as ak

This example shows comparing two DataFrames that are equal but with columns of differing dtypes.

>>> from arkouda.testing import assert_frame_equal
>>> df1 = ak.DataFrame({'a': [1, 2], 'b': [3, 4]})
>>> df2 = ak.DataFrame({'a': [1, 2], 'b': [3.0, 4.0]})

df1 equals itself. >>> assert_frame_equal(df1, df1)

df1 differs from df2 as column ‘b’ is of a different type. >>> assert_frame_equal(df1, df2) # doctest: +SKIP Traceback (most recent call last): … AssertionError: Attributes of DataFrame.iloc[:, 1] (column name=”b”) are different

Attribute “dtype” are different [left]: int64 [right]: float64

Ignore differing dtypes in columns with check_dtype.

>>> assert_frame_equal(df1, df2, check_dtype=False)
arkouda.assert_frame_equivalent(left: arkouda.DataFrame | pandas.DataFrame, right: arkouda.DataFrame | pandas.DataFrame, check_dtype: bool = True, check_index_type: bool = True, check_column_type: bool = True, check_frame_type: bool = True, check_names: bool = True, check_exact: bool = True, check_categorical: bool = True, check_like: bool = False, rtol: float = 1e-05, atol: float = 1e-08, obj: str = 'DataFrame') None[source]

Check that two DataFrames are equal.

This function compares two DataFrames and raises an assertion if they differ. It is intended primarily for use in unit tests. pandas DataFrames are converted to Arkouda equivalents before comparison.

Parameters:
  • left (DataFrame or pd.DataFrame) – First DataFrame to compare.

  • right (DataFrame or pd.DataFrame) – Second DataFrame to compare.

  • check_dtype (bool) – Whether to check that dtypes are identical. Default is True.

  • check_index_type (bool) – Whether to check that index class, dtype, and inferred type are identical. Default is True.

  • check_column_type (bool) – Whether to check that column class, dtype, and inferred type are identical. Default is True.

  • check_frame_type (bool) – Whether to check that the DataFrame class is identical. Default is True.

  • check_names (bool) – Whether to check that the index and column names are identical. Default is True.

  • check_exact (bool) – Whether to compare values exactly. Default is True.

  • check_categorical (bool) – Whether to compare internal categoricals exactly. Default is True.

  • check_like (bool) – Whether to ignore the order of index and columns. Labels must still match their data. / Default is False.

  • rtol (float) – Relative tolerance used when check_exact is False. Default is 1e-5.

  • atol (float) – Absolute tolerance used when check_exact is False. Default is 1e-8.

  • obj (str) – Object name used in error messages. Default is “DataFrame”.

Raises:

TypeError – If either input is not a DataFrame or pd.DataFrame.

Examples

>>> import arkouda as ak
>>> import pandas as pd
>>> from arkouda.testing import assert_frame_equivalent
>>> df1 = ak.DataFrame({'a': [1, 2], 'b': [3, 4]})
>>> df2 = pd.DataFrame({'a': [1, 2], 'b': [3.0, 4.0]})

Fails because dtypes are different: >>> assert_frame_equivalent(df1, df2) # doctest: +SKIP

arkouda.assert_index_equal(left: arkouda.Index, right: arkouda.Index, exact: bool = True, check_names: bool = True, check_exact: bool = True, check_categorical: bool = True, check_order: bool = True, rtol: float = 1e-05, atol: float = 1e-08, obj: str = 'Index') None[source]

Check that left and right Index are equal.

Parameters:
  • left (Index) – The first Index to compare.

  • right (Index) – The second Index to compare.

  • exact (bool) – Whether to check that the Index class, dtype, and inferred_type are identical. Defaults to True.

  • check_names (bool) – Whether to check the name attribute. Defaults to True.

  • check_exact (bool) – Whether to compare numbers exactly. Defaults to True.

  • check_categorical (bool) – Whether to compare internal Categorical values exactly. Defaults to True.

  • check_order (bool) – Whether to compare the order of index entries as well as their values. If True, both indexes must contain the same elements, in the same order. If False, both indexes must contain the same elements, but in any order. Defaults to True.

  • rtol (float) – Relative tolerance. Only used when check_exact is False. Defaults to 1e-5.

  • atol (float) – Absolute tolerance. Only used when check_exact is False. Defaults to 1e-8.

  • obj (str) – A name for the object being compared, used in assertion messages. Defaults to ‘Index’.

Examples

>>> import arkouda as ak
>>> from arkouda import testing as tm
>>> a = ak.Index([1, 2, 3])
>>> b = ak.Index([1, 2, 3])
>>> tm.assert_index_equal(a, b)
arkouda.assert_index_equivalent(left: arkouda.Index | pandas.Index, right: arkouda.Index | pandas.Index, exact: bool = True, check_names: bool = True, check_exact: bool = True, check_categorical: bool = True, check_order: bool = True, rtol: float = 1e-05, atol: float = 1e-08, obj: str = 'Index') None[source]

Check that two Index objects are equal.

If the objects are pandas Index, they are converted to Arkouda Index. Then assert_index_equal is applied to the result.

Parameters:
  • left (Index or pd.Index) – First Index to compare.

  • right (Index or pd.Index) – Second Index to compare.

  • exact (bool) – Whether to check that class, dtype, and inferred type are identical. Default is True.

  • check_names (bool) – Whether to check the names attribute. Default is True.

  • check_exact (bool) – Whether to compare values exactly. Default is True.

  • check_categorical (bool) – Whether to compare internal Categoricals exactly. Default is True.

  • check_order (bool) – Whether to require identical order in index values. Default is True.

  • rtol (float) – Relative tolerance used when check_exact is False. Default is 1e-5.

  • atol (float) – Absolute tolerance used when check_exact is False. Default is 1e-8.

  • obj (str) – Object name used in error messages. Default is “Index”.

Raises:

TypeError – If either input is not an Index or pd.Index.

Examples

>>> import arkouda as ak
>>> from arkouda import testing as tm
>>> import pandas as pd
>>> a = ak.Index([1, 2, 3])
>>> b = pd.Index([1, 2, 3])
>>> tm.assert_index_equivalent(a, b)
arkouda.assert_is_sorted(seq) None[source]

Assert that the sequence is sorted.

arkouda.assert_series_equal(left, right, check_dtype: bool = True, check_index_type: bool = True, check_series_type: bool = True, check_names: bool = True, check_exact: bool = False, check_categorical: bool = True, check_category_order: bool = True, rtol: float = 1e-05, atol: float = 1e-08, obj: str = 'Series', *, check_index: bool = True, check_like: bool = False) None[source]

Check that left and right Series are equal.

Parameters:
  • left (Series) – First Series to compare.

  • right (Series) – Second Series to compare.

  • check_dtype (bool) – Whether to check the Series dtype is identical. Defaults to True.

  • check_index_type (bool) – Whether to check the Index class, dtype, and inferred_type are identical. Defaults to True.

  • check_series_type (bool) – Whether to check that the Series class is identical. Defaults to True.

  • check_names (bool) – Whether to check the Series and Index name attribute. Defaults to True.

  • check_exact (bool) – Whether to compare numbers exactly. Defaults to False.

  • check_categorical (bool) – Whether to compare internal Categoricals exactly. Defaults to True.

  • check_category_order (bool) – Whether to compare the category order of internal Categoricals. Defaults to True.

  • rtol (float) – Relative tolerance. Only used when check_exact is False. Defaults to 1e-5.

  • atol (float) – Absolute tolerance. Only used when check_exact is False. Defaults to 1e-8.

  • obj (str) – Name of the object being compared, used in assertion messages. Defaults to ‘Series’.

  • check_index (bool) – Whether to check index equivalence. If False, only the values are compared. Defaults to True.

  • check_like (bool) – If True, ignore the order of the index. Must be False if check_index is False. Note: same labels must be with the same data. Defaults to False.

Examples

>>> import arkouda as ak
>>> from arkouda import testing as tm
>>> a = ak.Series([1, 2, 3, 4])
>>> b = ak.Series([1, 2, 3, 4])
>>> tm.assert_series_equal(a, b)
arkouda.assert_series_equivalent(left: arkouda.Series | pandas.Series, right: arkouda.Series | pandas.Series, check_dtype: bool = True, check_index_type: bool = True, check_series_type: bool = True, check_names: bool = True, check_exact: bool = False, check_categorical: bool = True, check_category_order: bool = True, rtol: float = 1e-05, atol: float = 1e-08, obj: str = 'Series', *, check_index: bool = True, check_like: bool = False) None[source]

Check that two Series are equal.

This function compares two Series and raises an assertion if they differ. pandas Series are converted to Arkouda equivalents before comparison. The comparison can be customized using the provided keyword arguments.

Parameters:
  • left (Series or pd.Series) – First Series to compare.

  • right (Series or pd.Series) – Second Series to compare.

  • check_dtype (bool) – Whether to check that dtypes are identical. Default is True.

  • check_index_type (bool) – Whether to check that index class, dtype, and inferred type are identical. Default is True.

  • check_series_type (bool) – Whether to check that the Series class is identical. Default is True.

  • check_names (bool) – Whether to check that the Series and Index name attributes are identical. Default is True.

  • check_exact (bool) – Whether to compare numbers exactly. Default is False.

  • check_categorical (bool) – Whether to compare internal Categoricals exactly. Default is True.

  • check_category_order (bool) – Whether to compare category order in internal Categoricals. Default is True.

  • rtol (float) – Relative tolerance used when check_exact is False. Default is 1e-5.

  • atol (float) – Absolute tolerance used when check_exact is False. Default is 1e-8.

  • obj (str) – Object name used in error messages. Default is “Series”.

  • check_index (bool) – Whether to check index equivalence. If False, only values are compared. Default is True.

  • check_like (bool) – If True, ignore the order of the index. Must be False if check_index is False. Note: identical labels must still correspond to the same data. Default is False.

Raises:

TypeError – If either input is not a Series or pd.Series.

Examples

>>> import arkouda as ak
>>> from arkouda import testing as tm
>>> import pandas as pd
>>> a = ak.Series([1, 2, 3, 4])
>>> b = pd.Series([1, 2, 3, 4])
>>> tm.assert_series_equivalent(a, b)
arkouda.chisquare(f_obs, f_exp=None, ddof=0)[source]

Computes the chi square statistic and p-value.

Parameters:
  • f_obs (pdarray) – The observed frequency.

  • f_exp (pdarray, default = None) – The expected frequency.

  • ddof (int) – The delta degrees of freedom.

Return type:

arkouda.akstats.Power_divergenceResult

Examples

>>> import arkouda as ak
>>> from arkouda.scipy import chisquare
>>> chisquare(ak.array([10, 20, 30, 10]), ak.array([10, 30, 20, 10]))
Power_divergenceResult(statistic=np.float64(8.333333333333334), pvalue=np.float64(0.03960235520...))

See also

scipy.stats.chisquare, arkouda.akstats.power_divergence

References

[1] “Chi-squared test”, https://en.wikipedia.org/wiki/Chi-squared_test

[2] “scipy.stats.chisquare”, https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html

arkouda.compute_join_size(a: arkouda.numpy.pdarrayclass.pdarray, b: arkouda.numpy.pdarrayclass.pdarray) Tuple[int, int][source]

Compute the internal size of a hypothetical join between a and b. Returns both the number of elements and number of bytes required for the join.

arkouda.date_operators(cls)[source]

Add common datetime operation methods to a DatetimeAccessor class.

This class decorator dynamically attaches datetime operations (floor, ceil, round) to the given class using the _make_op helper.

Parameters:

cls (type) – The accessor class to decorate.

Returns:

The accessor class with datetime methods added.

Return type:

type

Notes

Used internally to implement the .dt accessor API.

arkouda.disable_verbose(logLevel: LogLevel = LogLevel.INFO) None[source]

Disables verbose logging.

Disables verbose logging (DEBUG log level) for all ArkoudaLoggers, setting the log level for each to the logLevel parameter.

Parameters:

logLevel (LogLevel) – The new log level, defaultts to LogLevel.INFO

Raises:

TypeError – Raised if logLevel is not a LogLevel enum

arkouda.enable_verbose() None[source]

Enable verbose logging (DEBUG log level) for all ArkoudaLoggers.

arkouda.export(read_path: str, dataset_name: str = 'ak_data', write_file: str | None = None, return_obj: bool = True, index: bool = False)[source]

Export data from arkouda to pandas.

Export data from Arkouda file (Parquet/HDF5) to Pandas object or file formatted to be readable by Pandas.

Parameters:
  • read_path (str) – path to file where arkouda data is stored.

  • dataset_name (str) – name to store dataset under

  • write_file (str) – path to file to write pandas formatted data to. Only write the file if this is set. Default is None.

  • return_obj (bool) – When True (default) return the Pandas DataFrame object, otherwise return None.

  • index (bool) – Default False. When True, maintain the indexes loaded from the pandas file

Raises:

RuntimeError

  • Unsupported file type

Returns:

When return_obj=True

Return type:

pd.DataFrame

See also

pandas.DataFrame.to_parquet, pandas.DataFrame.to_hdf, pandas.DataFrame.read_parquet, pandas.DataFrame.read_hdf, ak.import_data

Notes

  • If Arkouda file is exported for pandas, the format will not change. This mean parquet files will remain parquet and hdf5 will remain hdf5.

  • Export can only be performed from hdf5 or parquet files written by Arkouda. The result will be the same file type, but formatted to be read by Pandas.

arkouda.from_series(series: pandas.Series, dtype: type | str | None = None) arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings[source]

Convert a pandas Series to an Arkouda pdarray or Strings.

If dtype is not provided, the dtype is inferred from the pandas Series (using pandas dtype metadata). If dtype is provided, it is used as an override and normalized via Arkouda’s dtype resolution rules.

In addition to the core numeric and boolean types, this function supports datetime and timedelta Series of any resolution (ns, us, ms, etc.) by converting them to an int64 pdarray of nanoseconds.

Parameters:
  • series (pd.Series) – The pandas Series to convert.

  • dtype (Optional[Union[type, str]], optional) –

    Optional dtype override. This may be a Python type (e.g. bool), a NumPy scalar type (e.g. np.int64), or a dtype string.

    String-like spellings are normalized to Arkouda string dtype, including "object", "str", "string", "string[python]", and "string[pyarrow]".

Returns:

An Arkouda pdarray for numeric, boolean, datetime, or timedelta inputs, or an Arkouda Strings for string inputs.

Return type:

Union[pdarray, Strings]

Raises:

ValueError – Raised if the dtype cannot be interpreted or is unsupported for conversion.

Examples

>>> import arkouda as ak
>>> import numpy as np
>>> import pandas as pd

Integers:

>>> np.random.seed(1701)
>>> ak.from_series(pd.Series(np.random.randint(0, 10, 5)))
array([4 3 3 5 0])
>>> ak.from_series(pd.Series(['1', '2', '3', '4', '5']), dtype=np.int64)
array([1 2 3 4 5])

Floats:

>>> np.random.seed(1701)
>>> ak.from_series(pd.Series(np.random.uniform(low=0.0, high=1.0, size=3)))
array([0.089433234324597599 0.1153776854774361 0.51874393620990389])

Booleans:

>>> np.random.seed(1864)
>>> ak.from_series(pd.Series(np.random.choice([True, False], size=5)))
array([True True True False False])

Strings (pandas dtype spellings normalized to Arkouda Strings):

>>> ak.from_series(pd.Series(['a', 'b', 'c', 'd', 'e'], dtype="string"))
array(['a', 'b', 'c', 'd', 'e'])
>>> ak.from_series(pd.Series(['a', 'b', 'c'], dtype="string[pyarrow]"))
array(['a', 'b', 'c'])

Datetime (any resolution is accepted and returned as int64 nanoseconds):

>>> ak.from_series(pd.Series(pd.to_datetime(['1/1/2018', np.datetime64('2018-01-01')])))
array([1514764800000000000 1514764800000000000])

Notes

Datetime and timedelta Series are converted to int64 nanoseconds.

String-like pandas dtypes (including object) are treated as string and converted to Arkouda Strings.

arkouda.gen_ranges(starts, ends, stride=1, return_lengths=False)[source]

Generate a segmented array of variable-length, contiguous ranges between pairs of start- and end-points.

Parameters:
  • starts (pdarray, int64) – The start value of each range

  • ends (pdarray, int64) – The end value (exclusive) of each range

  • stride (int) – Difference between successive elements of each range

  • return_lengths (bool, optional) – Whether or not to return the lengths of each segment. Default False.

Returns:

segmentspdarray, int64

The starting index of each range in the resulting array

rangespdarray, int64

The actual ranges, flattened into a single array

lengthspdarray, int64

The lengths of each segment. Only returned if return_lengths=True.

Return type:

pdarray|int64, pdarray|int64, pdarray|int64

arkouda.get_columns(filenames: str | List[str], col_delim: str = ',', allow_errors: bool = False) List[str][source]

Get a list of column names from CSV file(s).

arkouda.get_datasets(filenames: str | List[str], allow_errors: bool = False, column_delim: str = ',', read_nested: bool = True) List[str][source]

Get the names of the datasets in the provide files.

Parameters:
  • filenames (str or List[str]) – Name of the file/s from which to return datasets

  • allow_errors (bool) – Default: False Whether or not to allow errors while accessing datasets

  • column_delim (str) – Column delimiter to be used if dataset is CSV. Otherwise, unused.

  • read_nested (bool) – Default True, when True, SegArray objects will be read from the file. When False, SegArray (or other nested Parquet columns) will be ignored. Only used for Parquet Files.

Return type:

List[str] of names of the datasets

Raises:

RuntimeError

  • If no datasets are returned

Notes

  • This function currently supports HDF5 and Parquet formats.

  • Future updates to Parquet will deprecate this functionality on that format,

but similar support will be added for Parquet at that time. - If a list of files is provided, only the datasets in the first file will be returned

See also

ls

arkouda.get_filetype(filenames: str | List[str]) str[source]

Get the type of a file accessible to the server.

Supported file types and possible return strings are ‘HDF5’ and ‘Parquet’.

Parameters:

filenames (Union[str, List[str]]) – A file or list of files visible to the arkouda server

Returns:

Type of the file returned as a string, either ‘HDF5’, ‘Parquet’ or ‘CSV

Return type:

str

Raises:

ValueError – Raised if filename is empty or contains only whitespace

Notes

  • When list provided, it is assumed that all files are the same type

  • CSV Files without the Arkouda Header are not supported

arkouda.get_null_indices(filenames: str | List[str], datasets: str | List[str] | None = None) arkouda.numpy.pdarrayclass.pdarray | Mapping[str, arkouda.numpy.pdarrayclass.pdarray][source]

Get null indices of a string column in a Parquet file.

Parameters:
  • filenames (list or str) – Either a list of filenames or shell expression

  • datasets (list or str or None) – (List of) name(s) of dataset(s) to read. Each dataset must be a string column. There is no default value for this function, the datasets to be read must be specified.

Returns:

Dictionary of {datasetName: pdarray}

Return type:

returns a dictionary of Arkouda pdarrays

Raises:
  • RuntimeError – Raised if one or more of the specified files cannot be opened.

  • TypeError – Raised if we receive an unknown arkouda_type returned from the server

See also

get_datasets, ls

arkouda.import_data(read_path: str, write_file: str | None = None, return_obj: bool = True, index: bool = False)[source]

Import data from a file saved by Pandas (HDF5/Parquet).

Import data from a file saved by Pandas (HDF5/Parquet) to Arkouda object and/or a file formatted to be read by Arkouda.

Parameters:
  • read_path (str) – path to file where pandas data is stored. This can be glob expression for parquet formats.

  • write_file (str, optional) – path to file to write arkouda formatted data to. Only write file if provided

  • return_obj (bool) – If True (default), return the Arkouda DataFrame object. If False, return None.

  • index (bool) – If True, maintain the indexes loaded from the pandas file. Default is False.

Raises:
  • RuntimeWarning

    • Export attempted on Parquet file. Arkouda formatted Parquet files are readable by pandas.

  • RuntimeError

    • Unsupported file type

Returns:

When return_obj=True

Return type:

pd.DataFrame

See also

pandas.DataFrame.to_parquet, pandas.DataFrame.to_hdf, pandas.DataFrame.read_parquet, pandas.DataFrame.read_hdf, ak.export

Notes

  • Import can only be performed from hdf5 or parquet files written by pandas.

arkouda.information(names: List[str] | str = RegisteredSymbols) str[source]

Return a JSON formatted string containing information about the objects in names.

Parameters:

names (Union[List[str], str]) – names is either the name of an object or list of names of objects to retrieve info if names is ak.AllSymbols, retrieves info for all symbols in the symbol table if names is ak.RegisteredSymbols, retrieves info for all symbols in the registry

Returns:

JSON formatted string containing a list of information for each object in names

Return type:

str

Raises:

RuntimeError – Raised if a server-side error is thrown in the process of retrieving information about the objects in names

arkouda.intersect(a, b, positions=True, unique=False)[source]

Find the intersection of two arkouda arrays.

This function can be especially useful when positions=True so that the caller gets the indices of values present in both arrays.

Parameters:
  • a (Strings or pdarray) – An array of strings.

  • b (Strings or pdarray) – An array of strings.

  • positions (bool, default=True) – Return tuple of boolean pdarrays that indicate positions in a and b of the intersection values.

  • unique (bool, default=False) – If the number of distinct values in a (and b) is equal to the size of a (and b), there is a more efficient method to compute the intersection.

Returns:

  • (arkouda.numpy.pdarrayclass.pdarray, arkouda.numpy.pdarrayclass.pdarray) or

  • arkouda.numpy.pdarrayclass.pdarray – The indices of a and b where any element occurs at least once in both arrays.

Examples

>>> import arkouda as ak
>>> a = ak.arange(10)
>>> print(a)
[0 1 2 3 4 5 6 7 8 9]
>>> b = 2 * ak.arange(10)
>>> print(b)
[0 2 4 6 8 10 12 14 16 18]
>>> intersect(a,b, positions=True)
(array([True False True False True False True False True False]),
array([True True True True True False False False False False]))
>>> intersect(a,b, positions=False)
array([0 2 4 6 8])
arkouda.intx(a, b)[source]

Find all the rows that are in both dataframes.

Columns should be in identical order.

Note: does not work for columns of floating point values, but does work for Strings, pdarrays of int64 type, and Categorical should work.

Examples

>>> import arkouda as ak
>>> a = ak.DataFrame({'a':ak.arange(5),'b': 2* ak.arange(5)})
>>> a
   a  b
0  0  0
1  1  2
2  2  4
3  3  6
4  4  8 (5 rows x 2 columns)
>>> b = ak.DataFrame({'a':ak.arange(5),'b':ak.array([0,3,4,7,8])})
>>> b
   a  b
0  0  0
1  1  3
2  2  4
3  3  7
4  4  8 (5 rows x 2 columns)
>>> intx(a,b)
array([True False True False True])
>>> intersect_df = a[intx(a,b)]
>>> intersect_df
   a  b
0  0  0
2  2  4
4  4  8 (3 rows x 2 columns)
arkouda.invert_permutation(perm)[source]

Find the inverse of a permutation array.

Parameters:

perm (pdarray) – The permutation array.

Returns:

The inverse of the permutation array.

Return type:

arkouda.numpy.pdarrayclass.pdarray

Examples

>>> import arkouda as ak
>>> from arkouda.pandas.index import Index
>>> i = Index(ak.array([1,2,0,5,4]))
>>> perm = i.argsort()
>>> print(perm)
[2 0 1 4 3]
>>> invert_permutation(perm)
array([1 2 0 4 3])
arkouda.join_on_eq_with_dt(a1: arkouda.numpy.pdarrayclass.pdarray, a2: arkouda.numpy.pdarrayclass.pdarray, t1: arkouda.numpy.pdarrayclass.pdarray, t2: arkouda.numpy.pdarrayclass.pdarray, dt: int | numpy.int64, pred: str, result_limit: int | numpy.int64 = 1000) Tuple[arkouda.numpy.pdarrayclass.pdarray, arkouda.numpy.pdarrayclass.pdarray][source]

Inner-join on equality between two integer arrays where the time-window predicate is also true.

Parameters:
  • a1 (pdarray) – Values to join (must be int64 dtype).

  • a2 (pdarray) – Values to join (must be int64 dtype).

  • t1 (pdarray) – timestamps in millis corresponding to the a1 pdarray

  • t2 (pdarray) – timestamps in millis corresponding to the a2 pdarray

  • dt (Union[int,np.int64]) – time delta

  • pred (str) – time window predicate

  • result_limit (Union[int,np.int64]) – size limit for returned result

Returns:

result_array_onepdarray, int64

a1 indices where a1 == a2

result_array_onepdarray, int64

a2 indices where a2 == a1

Return type:

Tuple[pdarray, pdarray]

Raises:
  • TypeError – Raised if a1, a2, t1, or t2 is not a pdarray, or if dt or result_limit is not an int

  • ValueError – if a1, a2, t1, or t2 dtype is not int64, pred is not ‘true_dt’, ‘abs_dt’, or ‘pos_dt’, or result_limit is < 0

arkouda.list_registry(detailed: bool = False)[source]

Return a list containing the names of all registered objects.

Parameters:

detailed (bool) – Default = False Return details of registry objects. Currently includes object type for any objects

Returns:

Dict containing keys “Components” and “Objects”.

Return type:

dict

Raises:

RuntimeError – Raised if there’s a server-side error thrown

arkouda.list_symbol_table() List[str][source]

Return a list containing the names of all objects in the symbol table.

Returns:

List of all object names in the symbol table

Return type:

list

Raises:

RuntimeError – Raised if there’s a server-side error thrown

arkouda.load(path_prefix: str, file_format: str = 'INFER', dataset: str = 'array', calc_string_offsets: bool = False, column_delim: str = ',') Mapping[str, arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings | arkouda.numpy.segarray.SegArray | arkouda.pandas.categorical.Categorical | arkouda.pandas.dataframe.DataFrame | arkouda.client_dtypes.IPv4 | arkouda.numpy.timeclass.Datetime | arkouda.numpy.timeclass.Timedelta | arkouda.pandas.index.Index][source]

Load objects previously saved with pdarray.save().

Parameters:
  • path_prefix (str) – Filename prefix used when saving the original object.

  • file_format (str, default="INFER") – File format to load. One of "INFER", "HDF5", or "Parquet". If "INFER", the format will be detected automatically.

  • dataset (str, default="array") – Dataset name where the object was saved.

  • calc_string_offsets (bool, default=False) – If True, the server ignores the segmented Strings offsets array and derives offsets from null-byte terminators.

  • column_delim (str, default=",") – Column delimiter used if the dataset is CSV. Otherwise unused.

Returns:

DataFrame, IPv4, Datetime, Timedelta, Index]] Dictionary mapping datasetName to the loaded object. The values may be pdarray, Strings, SegArray, Categorical, DataFrame, IPv4, Datetime, Timedelta, or Index.

Return type:

Mapping[str, Union[pdarray, Strings, SegArray, Categorical,

Raises:
  • TypeError – Raised if either path_prefix or dataset is not a str.

  • ValueError – Raised if an invalid file_format is given, if the dataset is not present in all HDF5 files, or if path_prefix does not correspond to files accessible to Arkouda.

  • RuntimeError – Raised if the HDF5 files are present but an error occurs while opening one or more of them.

Notes

If a previously saved Parquet file raises a FileNotFoundError, try loading it with .parquet appended to path_prefix. Older versions of Arkouda always stored Parquet files with a .parquet extension.

ak.load does not support loading a single file. To load a single HDF5 file without the _LOCALE#### suffix, use ak.read().

CSV files without the Arkouda header are not supported.

Examples

>>> import arkouda as ak

Loading from file without extension:

>>> obj = ak.load("path/prefix")

This loads the array from numLocales files with the name cwd/path/name_prefix_LOCALE####. The file type is inferred automatically.

Loading with an extension (HDF5):

>>> obj = ak.load("path/prefix.test")

This loads the object from numLocales files with the name cwd/path/name_prefix_LOCALE####.test where #### corresponds to each locale number. Because the file type is inferred, the extension does not need to correspond to a specific format.

arkouda.load_all(path_prefix: str, file_format: str = 'INFER', column_delim: str = ',', read_nested: bool = True) Mapping[str, arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings | arkouda.numpy.segarray.SegArray | arkouda.pandas.categorical.Categorical | arkouda.pandas.dataframe.DataFrame | arkouda.client_dtypes.IPv4 | arkouda.numpy.timeclass.Datetime | arkouda.numpy.timeclass.Timedelta | arkouda.pandas.index.Index][source]

Load multiple pdarrays, Strings, SegArrays, or Categoricals previously saved with save_all().

Parameters:
  • path_prefix (str) – Filename prefix used to save the original pdarray

  • file_format (str) – ‘INFER’, ‘HDF5’, ‘Parquet’, or ‘CSV’. Defaults to ‘INFER’. Indicates the format being loaded. When ‘INFER’ the processing will detect the format Defaults to ‘INFER’

  • column_delim (str) – Column delimiter to be used if dataset is CSV. Otherwise, unused.

  • read_nested (bool) – Default True, when True, SegArray objects will be read from the file. When False, SegArray (or other nested Parquet columns) will be ignored. Parquet files only

Returns:

Dictionary of {datsetName: Union[pdarray, Strings, SegArray, Categorical]} with the previously saved pdarrays, Strings, SegArrays, or Categoricals

Return type:

Mapping[str, Union[pdarray, Strings, SegArray, Categorical]]

Raises:
  • TypeError – Raised if path_prefix is not a str

  • ValueError – Raised if file_format/extension is encountered that is not hdf5 or parquet or if all datasets are not present in all hdf5/parquet files or if the path_prefix does not correspond to files accessible to Arkouda

  • RuntimeError – Raised if the hdf5 files are present but there is an error in opening one or more of them

See also

to_parquet, to_hdf, load, read

Notes

This function has been updated to determine the file extension based on the file format variable

This function will be deprecated when glob flags are added to read_* methods

CSV files without the Arkouda Header are not supported.

arkouda.load_checkpoint(name, path='.akdata')[source]

Load server’s state.

The server metadata must match the current configuration (e.g. same number of locales must be used).

Parameters:
  • name (str) – Name of the checkpoint. <path>/<name> must be a directory.

  • path (str) – The directory to save the checkpoint.

Returns:

The checkpoint name, which will be the same as the name argument.

Return type:

str

Examples

>>> import arkouda as ak
>>> arr = ak.zeros(10, int)
>>> arr[2] = 2
>>> arr[2]
np.int64(2)
>>> cp_name = ak.save_checkpoint()
>>> arr[2] = 3
>>> arr[2]
np.int64(3)
>>> ak.load_checkpoint(cp_name)
>>> arr[2]
np.int64(3)

See also

save_checkpoint

arkouda.ls(filename: str, col_delim: str = ',', read_nested: bool = True) List[str][source]

List the contents of an HDF5 or Parquet file on the Arkouda server.

This function invokes the HDF5 h5ls utility on a file visible to the Arkouda server, or simulates a similar listing for Parquet files. For CSV files without headers, see ls_csv.

Parameters:
  • filename (str) – Path to the file on the Arkouda server. Must be a non-empty string.

  • col_delim (str) – Delimiter to use when interpreting CSV files. Default is “,”.

  • read_nested (bool) – If True, include nested Parquet columns (e.g., SegArray). If False, nested columns are ignored. Only applies to Parquet files. Default is True.

Returns:

A list of lines describing each dataset or column in the file.

Return type:

List[str]

Raises:
  • TypeError – If filename is not a string.

  • ValueError – If filename is empty or contains only whitespace.

  • RuntimeError – If an error occurs when running h5ls or simulating the Parquet listing.

Notes

  • Parquet support is limited and may change in future releases.

  • Output lines mirror the format of the HDF5 h5ls output.

  • For CSV files lacking headers, use ls_csv.

See also

ls_csv

List the contents of CSV files without headers.

arkouda.ls_csv(filename: str, col_delim: str = ',') List[str][source]

List the datasets within a file when a CSV does not have a header.

Parameters:
  • filename (str) – The name of the file to pass to the server

  • col_delim (str) – The delimiter used to separate columns if the file is a csv

Returns:

The string output of the datasets from the server

Return type:

str

See also

ls

arkouda.merge(left: DataFrame, right: DataFrame, on: str | List[str] | None = None, left_on: str | List[str] | None = None, right_on: str | List[str] | None = None, how: str = 'inner', left_suffix: str = '_x', right_suffix: str = '_y', convert_ints: bool = True, sort: bool = True) DataFrame[source]

Merge Arkouda DataFrames with a database-style join.

The resulting dataframe contains rows from both DataFrames as specified by the merge condition (based on the “how” and “on” parameters).

Based on pandas merge functionality. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html

Parameters:
  • left (DataFrame) – The Left DataFrame to be joined.

  • right (DataFrame) – The Right DataFrame to be joined.

  • on (Optional[Union[str, List[str]]] = None) – The name or list of names of the DataFrame column(s) to join on. If on is None, this defaults to the intersection of the columns in both DataFrames.

  • left_on (str or List of str, optional) – Column name or names to join on in the left DataFrame. If this is not None, then right_on must also not be None, and this will override on.

  • right_on (str or List of str, optional) – Column name or names to join on in the right DataFrame. If this is not None, then left_on must also not be None, and this will override on.

  • how (str, default = "inner") – The merge condition. Must be one of “inner”, “left”, “right”, or “outer”.

  • left_suffix (str, default = "_x") – A string indicating the suffix to add to columns from the left dataframe for overlapping column names in both left and right. Defaults to “_x”. Only used when how is “inner”.

  • right_suffix (str, default = "_y") – A string indicating the suffix to add to columns from the right dataframe for overlapping column names in both left and right. Defaults to “_y”. Only used when how is “inner”.

  • convert_ints (bool = True) – If True, convert columns with missing int values (due to the join) to float64. This is to match pandas. If False, do not convert the column dtypes. This has no effect when how = “inner”.

  • sort (bool = True) – If True, DataFrame is returned sorted by “on”. Otherwise, the DataFrame is not sorted.

Returns:

Joined Arkouda DataFrame.

Return type:

DataFrame

Note

Multiple column joins are only supported for integer columns.

Examples

>>> import arkouda as ak
>>> from arkouda import merge
>>> left_df = ak.DataFrame({'col1': ak.arange(5), 'col2': -1 * ak.arange(5)})
>>> left_df
   col1  col2
0     0     0
1     1    -1
2     2    -2
3     3    -3
4     4    -4 (5 rows x 2 columns)
>>> right_df = ak.DataFrame({'col1': 2 * ak.arange(5), 'col2': 2 * ak.arange(5)})
>>> right_df
   col1  col2
0     0     0
1     2     2
2     4     4
3     6     6
4     8     8 (5 rows x 2 columns)
>>> merge(left_df, right_df, on = "col1")
   col1  col2_x  col2_y
0     0       0       0
1     2      -2       2
2     4      -4       4 (3 rows x 3 columns)
>>> merge(left_df, right_df, on = "col1", how = "left")
   col1  col2_x  col2_y
0     0       0     0.0
1     1      -1     NaN
2     2      -2     2.0
3     3      -3     NaN
4     4      -4     4.0 (5 rows x 3 columns)
>>> merge(left_df, right_df, on = "col1", how = "right")
   col1  col2_x  col2_y
0     0     0.0       0
1     2    -2.0       2
2     4    -4.0       4
3     6     NaN       6
4     8     NaN       8 (5 rows x 3 columns)
>>> merge(left_df, right_df, on = "col1", how = "outer")
   col1  col2_x  col2_y
0     0     0.0     0.0
1     1    -1.0     NaN
2     2    -2.0     2.0
3     3    -3.0     NaN
4     4    -4.0     4.0
5     6     NaN     6.0
6     8     NaN     8.0 (7 rows x 3 columns)
arkouda.power_divergence(f_obs, f_exp=None, ddof=0, lambda_=None)[source]

Computes the power divergence statistic and p-value.

Parameters:
  • f_obs (pdarray) – The observed frequency.

  • f_exp (pdarray, default = None) – The expected frequency.

  • ddof (int) – The delta degrees of freedom.

  • lambda (string, default = "pearson") –

    The power in the Cressie-Read power divergence statistic. Allowed values: “pearson”, “log-likelihood”, “freeman-tukey”, “mod-log-likelihood”, “neyman”, “cressie-read”

    Powers correspond as follows:

    ”pearson”: 1

    ”log-likelihood”: 0

    ”freeman-tukey”: -0.5

    ”mod-log-likelihood”: -1

    ”neyman”: -2

    ”cressie-read”: 2 / 3

Return type:

arkouda.akstats.Power_divergenceResult

Examples

>>> import arkouda as ak
>>> from arkouda.scipy import power_divergence
>>> x = ak.array([10, 20, 30, 10])
>>> y = ak.array([10, 30, 20, 10])
>>> power_divergence(x, y, lambda_="pearson")
Power_divergenceResult(statistic=np.float64(8.333333333333334), pvalue=np.float64(0.03960235520...))
>>> power_divergence(x, y, lambda_="log-likelihood")
Power_divergenceResult(statistic=np.float64(8.109302162163285), pvalue=np.float64(0.04380595350...))

See also

scipy.stats.power_divergence, arkouda.akstats.chisquare

Notes

This is a modified version of scipy.stats.power_divergence [2] in order to scale using arkouda pdarrays.

References

[1] “scipy.stats.power_divergence”, https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.power_divergence.html

[2] Scipy contributors (2024) scipy (Version v1.12.0) [Source code]. https://github.com/scipy/scipy

arkouda.pretty_print_information(names: List[str] | str = RegisteredSymbols) None[source]

Print verbose information for each object in names in a human readable format.

Parameters:

names (Union[List[str], str]) – names is either the name of an object or list of names of objects to retrieve info if names is ak.AllSymbols, retrieves info for all symbols in the symbol table if names is ak.RegisteredSymbols, retrieves info for all symbols in the registry

Raises:

RuntimeError – Raised if a server-side error is thrown in the process of retrieving information about the objects in names

arkouda.read(filenames: str | List[str], datasets: str | List[str] | None = None, iterative: bool = False, strictTypes: bool = True, allow_errors: bool = False, calc_string_offsets: bool = False, column_delim: str = ',', read_nested: bool = True, has_non_float_nulls: bool = False, fixed_len: int = -1) Mapping[str, arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings | arkouda.numpy.segarray.SegArray | arkouda.pandas.categorical.Categorical | arkouda.pandas.dataframe.DataFrame | arkouda.client_dtypes.IPv4 | arkouda.numpy.timeclass.Datetime | arkouda.numpy.timeclass.Timedelta | arkouda.pandas.index.Index][source]

Read datasets from files.

The file type is determined automatically.

Parameters:
  • filenames (Union[str, List[str]]) – Either a list of filenames or a shell expression.

  • datasets (Optional[Union[str, List[str]]], default=None) – Name or list of names of datasets to read. If None, all available datasets are read.

  • iterative (bool, default=False) – If True, make iterative function calls to the server. If False, make a single function call to the server.

  • strictTypes (bool, default=True) – If True, require all dtypes of a given dataset to have the same precision and sign. If False, allow dtypes of different precision and sign across different files. For example, if one file contains a uint32 dataset and another contains an int64 dataset with the same name, the contents of both will be read into an int64 pdarray.

  • allow_errors (bool, default=False) – If True, files with read errors may be skipped instead of causing the operation to fail. A warning will be included in the return containing the total number of files skipped due to failure and up to 10 filenames.

  • calc_string_offsets (bool, default=False) – If True, instruct the server to calculate the offsets or segments array instead of loading it from HDF5 files.

  • column_delim (str, default=",") – Column delimiter to use if the dataset is CSV. Otherwise unused.

  • read_nested (bool, default=True) – If True, SegArray objects are read from the file. If False, SegArray objects and other nested Parquet columns are ignored. Ignored if datasets is not None. Parquet only.

  • has_non_float_nulls (bool, default=False) – Must be set to True to read non-float Parquet columns that contain null values.

  • fixed_len (int, default=-1) – Fixed string length to use when reading Parquet string columns if the length of each string is known at runtime. This can avoid byte calculation and may improve performance.

Returns:

DataFrame, IPv4, Datetime, Timedelta, Index]] Dictionary mapping datasetName to the loaded object. The values may be pdarray, Strings, SegArray, Categorical, DataFrame, IPv4, Datetime, Timedelta, or Index.

Return type:

Mapping[str, Union[pdarray, Strings, SegArray, Categorical,

Raises:

RuntimeError – Raised if an invalid file type is detected.

Notes

If filenames is a string, it is interpreted as a shell expression. A single filename is a valid expression, so it will also work. The expression is expanded with glob to read all matching files.

If iterative=True, each dataset name and filename is passed to the server independently in sequence. If iterative=False, all dataset names and filenames are passed to the server in a single string.

If datasets is None, dataset names are inferred from the first file and all datasets are read. Use get_datasets to show the names of datasets in HDF5 or Parquet files.

CSV files without the Arkouda header are not supported.

Examples

>>> import arkouda as ak

Read a file with an extension:

>>> x = ak.read("path/name_prefix.h5")

The file type is determined from file contents, not the extension.

Read a Parquet file:

>>> x = ak.read("path/name_prefix.parquet")

Read files matching a glob expression:

>>> x = ak.read("path/name_prefix*")
arkouda.read_csv(filenames: str | List[str], datasets: str | List[str] | None = None, column_delim: str = ',', allow_errors: bool = False) Mapping[str, arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings | arkouda.numpy.segarray.SegArray | arkouda.pandas.categorical.Categorical | arkouda.pandas.dataframe.DataFrame | arkouda.client_dtypes.IPv4 | arkouda.numpy.timeclass.Datetime | arkouda.numpy.timeclass.Timedelta | arkouda.pandas.index.Index][source]

Read CSV file(s) into Arkouda objects.

If more than one dataset is found, the objects will be returned in a dictionary mapping the dataset name to the Arkouda object containing the data. If the file contains the appropriately formatted header, typed data will be returned. Otherwise, all data will be returned as a Strings object.

Parameters:
  • filenames (str or List[str]) – The filenames to read data from

  • datasets (str or List[str] (Optional)) – names of the datasets to read. When None, all datasets will be read.

  • column_delim (str) – The delimiter for column names and data. Defaults to “,”.

  • allow_errors (bool) – Default False, if True will allow files with read errors to be skipped instead of failing. A warning will be included in the return containing the total number of files skipped due to failure and up to 10 filenames.

Returns:

Dictionary of {datasetName: pdarray, String, or SegArray}

Return type:

Returns a dictionary of Arkouda pdarrays, Arkouda Strings, or Arkouda Segarrays.

Raises:
  • ValueError – Raised if all datasets are not present in all parquet files or if one or more of the specified files do not exist

  • RuntimeError – Raised if one or more of the specified files cannot be opened. If allow_errors is true this may be raised if no values are returned from the server.

  • TypeError – Raised if we receive an unknown arkouda_type returned from the server

See also

to_csv

Notes

  • CSV format is not currently supported by load/load_all operations

  • The column delimiter is expected to be the same for column names and data

  • Be sure that column delimiters are not found within your data.

  • All CSV files must delimit rows using newline (\\n) at this time.

  • Unlike other file formats, CSV files store Strings as their UTF-8 format instead of storing bytes as uint(8).

arkouda.read_hdf(filenames: str | List[str], datasets: str | List[str] | None = None, iterative: bool = False, strict_types: bool = True, allow_errors: bool = False, calc_string_offsets: bool = False, tag_data: bool = False) Mapping[str, arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings | arkouda.numpy.segarray.SegArray | arkouda.pandas.categorical.Categorical | arkouda.pandas.dataframe.DataFrame | arkouda.client_dtypes.IPv4 | arkouda.numpy.timeclass.Datetime | arkouda.numpy.timeclass.Timedelta | arkouda.pandas.index.Index][source]

Read Arkouda objects from HDF5 files.

Parameters:
  • filenames (Union[str, List[str]]) – Filename or list of filenames to read objects from.

  • datasets (Optional[Union[str, List[str]]], default=None) – Dataset name or list of dataset names to read from the provided files. If None, all datasets are read.

  • iterative (bool, default=False) – If True, make iterative function calls to the server. If False, make a single function call to the server.

  • strict_types (bool, default=True) – If True, require all dtypes of a given dataset to have the same precision and sign. If False, allow dtypes of different precision and sign across different files. For example, if one file contains a uint32 dataset and another contains an int64 dataset with the same name, the contents of both will be read into an int64 pdarray.

  • allow_errors (bool, default=False) – If True, files with read errors may be skipped instead of causing the operation to fail. A warning will be included in the return containing the total number of files skipped due to failure and up to 10 filenames.

  • calc_string_offsets (bool, default=False) – If True, instruct the server to calculate the offsets or segments array instead of loading it from HDF5 files. In the future, this option may become the default.

  • tag_data (bool, default=False) – If True, tag the returned data with the code associated with the filename from which it was read.

Returns:

  • Mapping[ – str, Union[

    pdarray, Strings, SegArray, Categorical, DataFrame, IPv4, Datetime, Timedelta, Index,

    ],

  • ]

  • Dictionary mapping datasetName to the loaded object. The values

  • may be pdarray, Strings, SegArray, Categorical,

  • DataFrame, IPv4, Datetime, Timedelta, or Index. – Dictionary mapping datasetName to the loaded object. The values may be pdarray, Strings, SegArray, Categorical, DataFrame, IPv4, Datetime, Timedelta, or Index.

Raises:
  • ValueError – Raised if not all datasets are present in all HDF5 files or if one or more of the specified files do not exist.

  • RuntimeError – Raised if one or more of the specified files cannot be opened. If allow_errors is True, this may also be raised if no values are returned from the server.

  • TypeError – Raised if an unknown Arkouda type is returned from the server.

Notes

If filenames is a string, it is interpreted as a shell expression. A single filename is a valid expression, so it will also work. The expression is expanded with glob to read all matching files.

If iterative=True, each dataset name and filename is passed to the server independently in sequence. If iterative=False, all dataset names and filenames are passed to the server in a single string.

If datasets is None, dataset names are inferred from the first file and all datasets are read. Use get_datasets to show dataset names in HDF5 files.

See also

read_tagged_data

Examples

>>> import arkouda as ak

Read a file with an extension:

>>> x = ak.read_hdf("path/name_prefix.h5")

Read files matching a glob expression:

>>> x = ak.read_hdf("path/name_prefix*")
arkouda.read_parquet(filenames: str | List[str], datasets: str | List[str] | None = None, iterative: bool = False, strict_types: bool = True, allow_errors: bool = False, tag_data: bool = False, read_nested: bool = True, has_non_float_nulls: bool = False, null_handling: str | None = None, fixed_len: int = -1) Mapping[str, arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings | arkouda.numpy.segarray.SegArray | arkouda.pandas.categorical.Categorical | arkouda.pandas.dataframe.DataFrame | arkouda.client_dtypes.IPv4 | arkouda.numpy.timeclass.Datetime | arkouda.numpy.timeclass.Timedelta | arkouda.pandas.index.Index][source]

Read Arkouda objects from Parquet files.

Parameters:
  • filenames (Union[str, List[str]]) – Filename or list of filenames to read objects from.

  • datasets (Optional[Union[str, List[str]]], default=None) – Dataset name or list of dataset names to read from the provided files. If None, all datasets are read.

  • iterative (bool, default=False) – If True, make iterative function calls to the server. If False, make a single function call to the server.

  • strict_types (bool, default=True) – If True, require all dtypes of a given dataset to have the same precision and sign. If False, allow dtypes of different precision and sign across different files. For example, if one file contains a uint32 dataset and another contains an int64 dataset with the same name, the contents of both will be read into an int64 pdarray.

  • allow_errors (bool, default=False) – If True, files with read errors may be skipped instead of causing the operation to fail. A warning will be included in the return containing the total number of files skipped due to failure and up to 10 filenames.

  • tag_data (bool, default=False) – If True, tag the data with the code associated with the filename from which the data was read.

  • read_nested (bool, default=True) – If True, SegArray objects are read from the file. If False, SegArray objects and other nested Parquet columns are ignored. If datasets is not None, this parameter is ignored.

  • has_non_float_nulls (bool, default=False) –

    Deprecated. Use null_handling instead.

    This flag must be set to True to read non-float Parquet columns that contain null values.

  • null_handling (Optional[str], default=None) –

    Null-handling mode. Supported values are "none", "only floats", and "all". If None, the default is "only floats".

    If "none", the data is assumed to contain no nulls. This gives the best performance, but behavior is undefined if nulls are present.

    If "only floats", only floating-point columns may contain nulls. This improves performance for other data types.

    If "all", any column may contain nulls. This is the most general mode, but it is slower overall.

  • fixed_len (int, default=-1) – Fixed string length to use when reading Parquet string columns if the length of each string is known at runtime. This can avoid byte calculation and may improve performance.

Returns:

DataFrame, IPv4, Datetime, Timedelta, Index]] Dictionary mapping datasetName to the loaded object. The values may be pdarray, Strings, SegArray, Categorical, DataFrame, IPv4, Datetime, Timedelta, or Index.

Return type:

Mapping[str, Union[pdarray, Strings, SegArray, Categorical,

Raises:
  • ValueError – Raised if not all datasets are present in all Parquet files or if one or more of the specified files do not exist.

  • RuntimeError – Raised if one or more of the specified files cannot be opened. If allow_errors is True, this may also be raised if no values are returned from the server.

  • TypeError – Raised if an unknown Arkouda type is returned from the server.

Notes

If filenames is a string, it is interpreted as a shell expression. A single filename is a valid expression, so it will also work. The expression is expanded with glob to read all matching files.

If iterative=True, each dataset name and filename is passed to the server independently in sequence. If iterative=False, all dataset names and filenames are passed to the server in a single string.

If datasets is None, dataset names are inferred from the first file and all datasets are read. Use get_datasets to show the names of datasets in Parquet files.

Parquet currently always recomputes offsets. This note should be updated when the Parquet workflow changes.

See also

read_tagged_data

Examples

>>> import arkouda as ak

Read a Parquet file:

>>> x = ak.read_parquet("path/name_prefix.parquet")

Read files matching a glob expression:

>>> x = ak.read_parquet("path/name_prefix*")
arkouda.read_tagged_data(filenames: str | List[str], datasets: str | List[str] | None = None, strictTypes: bool = True, allow_errors: bool = False, calc_string_offsets: bool = False, read_nested: bool = True, has_non_float_nulls: bool = False)[source]

Read datasets from files and tag each record with the file it was read from.

The file type is determined automatically.

Parameters:
  • filenames (Union[str, List[str]]) – Either a list of filenames or a shell expression.

  • datasets (Optional[Union[str, List[str]]], default=None) – Dataset name or list of dataset names to read. If None, all available datasets are read.

  • strictTypes (bool, default=True) – If True, require all dtypes of a given dataset to have the same precision and sign. If False, allow dtypes of different precision and sign across different files. For example, if one file contains a uint32 dataset and another contains an int64 dataset with the same name, the contents of both will be read into an int64 pdarray.

  • allow_errors (bool, default=False) – If True, files with read errors may be skipped instead of causing the operation to fail. A warning will be included in the return containing the total number of files skipped due to failure and up to 10 filenames.

  • calc_string_offsets (bool, default=False) – If True, instruct the server to calculate the offsets or segments array instead of loading it from HDF5 files. In the future, this option may become the default.

  • read_nested (bool, default=True) – If True, SegArray objects are read from the file. If False, SegArray objects and other nested Parquet columns are ignored. Ignored if datasets is not None. Parquet only.

  • has_non_float_nulls (bool, default=False) – Must be set to True to read non-float Parquet columns that contain null values.

Notes

This function is not currently supported for Categorical or GroupBy datasets.

Examples

>>> import arkouda as ak

Read files and return the data along with tagging information:

>>> data, cat = ak.read_tagged_data("path/name")

The codes in cat map each record in data to the file it came from. The returned data includes a "Filename_Codes" array.

>>> data
{"Filename_Codes": array([0 3 6 9 12]), "col_name": array([0 0 0 1])}
arkouda.read_zarr(store_path: str, ndim: int, dtype)[source]

Read a Zarr store from disk into a pdarray.

Supports multi-dimensional pdarrays of numeric types. To use this function, ensure you have installed the blosc dependency (make install-blosc) and have included ZarrMsg.chpl in the ServerModules.cfg file.

Parameters:
  • store_path (str) – The path to the Zarr store. The path must be to a directory that contains a .zarray file containing the Zarr store metadata.

  • ndim (int) – The number of dimensions in the array

  • dtype (str) – The data type of the array

Returns:

The pdarray read from the Zarr store.

Return type:

pdarray

arkouda.receive(hostname: str, port)[source]

Receive a pdarray sent by pdarray.transfer().

Parameters:
  • hostname (str) – The hostname of the pdarray that sent the array

  • port (int_scalars) – The port to send the array over. This needs to be an open port (i.e., not one that the Arkouda server is running on). This will open up numLocales ports, each of which in succession, so will use ports of the range {port..(port+numLocales)} (e.g., running an Arkouda server of 4 nodes, port 1234 is passed as port, Arkouda will use ports 1234, 1235, 1236, and 1237 to send the array data). This port much match the port passed to the call to pdarray.transfer().

Returns:

The pdarray sent from the sending server to the current receiving server.

Return type:

pdarray

Raises:
  • ValueError – Raised if the op is not within the pdarray.BinOps set

  • TypeError – Raised if other is not a pdarray or the pdarray.dtype is not a supported dtype

arkouda.receive_dataframe(hostname: str, port)[source]

Receive a pdarray sent by dataframe.transfer().

Parameters:
  • hostname (str) – The hostname of the dataframe that sent the array

  • port (int_scalars) – The port to send the dataframe over. This needs to be an open port (i.e., not one that the Arkouda server is running on). This will open up numLocales ports, each of which in succession, so will use ports of the range {port..(port+numLocales)} (e.g., running an Arkouda server of 4 nodes, port 1234 is passed as port, Arkouda will use ports 1234, 1235, 1236, and 1237 to send the array data). This port much match the port passed to the call to pdarray.send_array().

Returns:

The dataframe sent from the sending server to the current receiving server.

Return type:

pdarray

Raises:
  • ValueError – Raised if the op is not within the pdarray.BinOps set

  • TypeError – Raised if other is not a pdarray or the pdarray.dtype is not a supported dtype

arkouda.restore(filename)[source]

Return data saved using ak.snapshot.

Parameters:

filename (str) – Name used to create snapshot to be read

Return type:

Dict

Notes

Unlike other save/load methods using snapshot restore will save DataFrames alongside other objects in HDF5. Thus, they are returned within the dictionary as a dataframe.

arkouda.save_checkpoint(name='', path='.akdata', mode: Literal['overwrite', 'preserve_previous', 'error'] = 'overwrite')[source]

Save the server’s state.

Records some metadata about the server, and saves all pdarrays into parquet files.

Parameters:
  • name (str) – Name of the checkpoint. The default will be the server session ID, which is typically in format id_<hash>_. A directory will be created in path with this name.

  • path (str) – The directory to save the checkpoint. If the directory doesn’t exist, it will be created. If it exists, a new directory for the checkpoint instance will be created inside this directory.

  • mode ({'overwrite', 'preserve_previous', 'error'}) –

    How to handle an existing checkpoint with the same name. - 'overwrite' (default): overwrite the checkpoint files. - 'preserve_previous': rename existing checkpoint to <name>.prev,

    overwriting that if it exists.

    • 'error': raise an error if the checkpoint exists.

Notes

Only ``pdarray``s are saved. Other data structures will not be recorded. We expect to expand the coverage in the future.

Returns:

The checkpoint name, which will be the same as the name argument if it was passed.

Return type:

str

Examples

>>> import arkouda as ak
>>> arr = ak.zeros(10, int)
>>> arr[2] = 2
>>> arr[2]
np.int64(2)
>>> cp_name = ak.save_checkpoint()
>>> arr[2] = 3
>>> arr[2]
np.int64(3)
>>> ak.load_checkpoint(cp_name)
>>> arr[2]
np.int64(3)

See also

load_checkpoint

arkouda.snapshot(filename)[source]

Create a snapshot of the current Arkouda namespace.

All currently accessible variables containing Arkouda objects will be written to an HDF5 file.

Unlike other save/load functions, this maintains the integrity of dataframes.

Current Variable names are used as the dataset name when saving.

Parameters:

filename (str) – Name to use when storing file

See also

ak.restore

arkouda.string_operators(cls)[source]

Add common string operation methods to a StringAccessor class.

This class decorator dynamically attaches string operations (contains, startswith, endswith) to the given class using the _make_op helper.

Parameters:

cls (type) – The accessor class to decorate.

Returns:

The accessor class with string methods added.

Return type:

type

Notes

Used internally to implement the .str accessor API.

arkouda.to_csv(columns: Mapping[str, arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings] | List[arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings], prefix_path: str, names: List[str] | None = None, col_delim: str = ',', overwrite: bool = False)[source]

Write Arkouda object(s) to CSV file(s).

All CSV Files written by Arkouda include a header denoting data types of the columns.

Parameters:
  • columns (Mapping[str, pdarray] or List[pdarray]) – The objects to be written to CSV file. If a mapping is used and names is None the keys of the mapping will be used as the dataset names.

  • prefix_path (str) – The filename prefix to be used for saving files. Files will have _LOCALE#### appended when they are written to disk.

  • names (List[str] (Optional)) – names of dataset to be written. Order should correspond to the order of data provided in columns.

  • col_delim (str) – Defaults to “,”. Value to be used to separate columns within the file. Please be sure that the value used DOES NOT appear in your dataset.

  • overwrite (bool) – Defaults to False. If True, any existing files matching your provided prefix_path will be overwritten. If False, an error will be returned if existing files are found.

Raises:
  • ValueError – Raised if any datasets are present in all csv files or if one or more of the specified files do not exist

  • RuntimeError – Raised if one or more of the specified files cannot be opened. If allow_errors is true this may be raised if no values are returned from the server.

  • TypeError – Raised if we receive an unknown arkouda_type returned from the server

See also

read_csv

Notes

  • CSV format is not currently supported by load/load_all operations

  • The column delimiter is expected to be the same for column names and data

  • Be sure that column delimiters are not found within your data.

  • All CSV files must delimit rows using newline (\\n) at this time.

  • Unlike other file formats, CSV files store Strings as their UTF-8 format instead of storing bytes as uint(8).

arkouda.to_hdf(columns: Mapping[str, arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings | arkouda.numpy.segarray.SegArray] | List[arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings | arkouda.numpy.segarray.SegArray], prefix_path: str, names: List[str] | None = None, mode: Literal['truncate', 'append'] = 'truncate', file_type: Literal['single', 'distribute'] = 'distribute') None[source]

Save multiple named pdarrays to HDF5 files.

Parameters:
  • columns (dict or list of pdarrays) – Collection of arrays to save

  • prefix_path (str) – Directory and filename prefix for output files

  • names (list of str) – Dataset names for the pdarrays

  • mode ({"truncate", "append"}) – By default, truncate (overwrite) the output files if they exist. If ‘append’, attempt to create new dataset in existing files.

  • file_type ({"single", "distribute"}) – Default: distribute Single writes the dataset to a single file Distribute writes the dataset to a file per locale.

Raises:
  • ValueError – Raised if (1) the lengths of columns and values differ or (2) the mode is not ‘truncate’ or ‘append’

  • RuntimeError – Raised if a server-side error is thrown saving the pdarray

Notes

Creates one file per locale containing that locale’s chunk of each pdarray. If columns is a dictionary, the keys are used as the HDF5 dataset names. Otherwise, if no names are supplied, 0-up integers are used. By default, any existing files at path_prefix will be overwritten, unless the user specifies the ‘append’ mode, in which case arkouda will attempt to add <columns> as new datasets to existing files. If the wrong number of files is present or dataset names already exist, a RuntimeError is raised.

Examples

>>> import arkouda as ak
>>> a = ak.arange(25)
>>> b = ak.arange(25)

Save with mapping defining dataset names

>>> ak.to_hdf({'a': a, 'b': b}, 'path/name_prefix')

Save using names instead of mapping

>>> ak.to_hdf([a, b], 'path/name_prefix', names=['a', 'b'])
arkouda.to_parquet(columns: Mapping[str, arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings | arkouda.numpy.segarray.SegArray] | List[arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings | arkouda.numpy.segarray.SegArray], prefix_path: str, names: List[str] | None = None, mode: Literal['truncate', 'append'] = 'truncate', compression: str | None = None, convert_categoricals: bool = False) None[source]

Save multiple named arrays to Parquet files.

Parameters:
  • columns (Union[Mapping[str, Union[pdarray, Strings, SegArray]],) – List[Union[pdarray, Strings, SegArray]]] Collection of arrays to save.

  • prefix_path (str) – Directory and filename prefix for the output files.

  • names (Optional[List[str]], default=None) – Dataset names for the arrays when columns is provided as a list.

  • mode (Literal["truncate", "append"], default="truncate") –

    If "truncate", overwrite any existing output files. If "append", attempt to create a new dataset in existing files.

    "append" is deprecated. Use the multi-column write instead.

  • compression (Optional[str], default=None) – Compression type to use when writing the file. Supported values include "snappy", "gzip", "brotli", "zstd", and "lz4".

  • convert_categoricals (bool, default=False) – Parquet requires all columns to have the same size, and Categorical objects do not satisfy that requirement. If set to True, write the equivalent Strings in place of any Categorical columns.

Raises:
  • ValueError – Raised if the lengths of columns and names differ, or if mode is not "truncate" or "append".

  • RuntimeError – Raised if a server-side error occurs while saving the arrays.

See also

to_hdf, load, load_all, read

Notes

Creates one file per locale containing that locale’s chunk of each array.

If columns is a dictionary, its keys are used as the Parquet column names. Otherwise, if no names are supplied, integer names starting at 0 are used.

By default, any existing files at prefix_path are deleted regardless of whether they would be overwritten. If mode="append", Arkouda attempts to add columns as new datasets to existing files. If the wrong number of files is present or dataset names already exist, a RuntimeError is raised.

Examples

>>> import arkouda as ak
>>> a = ak.arange(25)
>>> b = ak.arange(25)

Save with a mapping defining dataset names:

>>> ak.to_parquet({"a": a, "b": b}, "path/name_prefix")

Save using names instead of a mapping:

>>> ak.to_parquet([a, b], "path/name_prefix", names=["a", "b"])
arkouda.to_zarr(store_path: str, arr: arkouda.numpy.pdarrayclass.pdarray, chunk_shape)[source]

Write a pdarray to disk as a Zarr store.

Supports multi-dimensional pdarrays of numeric types. To use this function, ensure you have installed the blosc dependency (make install-blosc) and have included ZarrMsg.chpl in the ServerModules.cfg file.

Parameters:
  • store_path (str) – The path at which Zarr store should be written

  • arr (pdarray) – The pdarray to be written to disk

  • chunk_shape (tuple) – The shape of the chunks to be used in the Zarr store

Raises:

ValueError – Raised if the number of dimensions in the chunk shape does not match the number of dimensions in the array or if the array is not a 32 or 64 bit numeric type

arkouda.update_hdf(columns: Mapping[str, arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings | arkouda.numpy.segarray.SegArray] | List[arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings | arkouda.numpy.segarray.SegArray], prefix_path: str, names: List[str] | None = None, repack: bool = True)[source]

Overwrite the datasets with name appearing in names or keys in columns if columns is a dictionary.

Parameters:
  • columns (dict or list of pdarrays) – Collection of arrays to save

  • prefix_path (str) – Directory and filename prefix for output files

  • names (list of str) – Dataset names for the pdarrays

  • repack (bool) – Default: True HDF5 does not release memory on delete. When True, the inaccessible data (that was overwritten) is removed. When False, the data remains, but is inaccessible. Setting to false will yield better performance, but will cause file sizes to expand.

Raises:

RuntimeError – Raised if a server-side error is thrown saving the datasets

Notes

  • If file does not contain File_Format attribute to indicate how it was saved, the file name is checked for _LOCALE#### to determine if it is distributed.

  • If the datasets provided do not exist, they will be added

  • Because HDF5 deletes do not release memory, this will create a copy of the file with the new data

  • This workflow is slightly different from to_hdf to prevent reading and creating a copy of the file for each dataset

arkouda.write_log(log_msg: str, tag: str = 'ClientGeneratedLog', log_lvl: LogLevel = LogLevel.INFO)[source]

Allow the user to write custom logs.

Parameters:
  • log_msg (str) – The message to be added to the server log

  • tag (str) – The tag to use in the log. This takes the place of the server function name. Allows for easy identification of custom logs. Defaults to “ClientGeneratedLog”

  • log_lvl (LogLevel) – The type of log to be written Defaults to LogLevel.INFO

See also

LogLevel

arkouda.xlogy(x: arkouda.numpy.pdarrayclass.pdarray | numpy.float64, y: arkouda.numpy.pdarrayclass.pdarray)[source]

Computes x * log(y).

Parameters:
  • x (pdarray or np.float64) – x must have a datatype that is castable to float64

  • y (pdarray)

Return type:

arkouda.numpy.pdarrayclass.pdarray

Examples

>>> import arkouda as ak
>>> from arkouda.scipy.special import xlogy
>>> xlogy( ak.array([1, 2, 3, 4]),  ak.array([5,6,7,8]))
array([1.6094379124341003 3.5835189384561099 5.8377304471659395 8.317766166719343])
>>> xlogy( 5.0, ak.array([1, 2, 3, 4]))
array([0.00000000000000000 3.4657359027997265 5.4930614433405491 6.9314718055994531])