arkouda¶
Arkouda: Exploratory data science at scale.
Arkouda is a Python API for exploratory data analysis on massive datasets. It leverages a Chapel-based backend to enable high-performance computing on distributed systems, while exposing a familiar NumPy- and Pandas-like interface to Python users.
Key Features¶
pdarray and Strings types for working with large numeric and string arrays.
Categorical, Series, DataFrame, and Index for labeled data analysis.
High-performance GroupBy, reductions, and broadcasting operations.
Interoperability with NumPy and Pandas for ease of use.
A scalable architecture suitable for HPC and cloud environments.
Example:¶
>>> import arkouda as ak
>>> ak.connect()
>>> a = ak.array([1, 2, 3])
>>> b = a + 5
>>> print(b)
array([6 7 8])
For full documentation, visit: https://bears-r-us.github.io/arkouda/
Submodules¶
Attributes¶
Classes¶
Arkouda-backed numeric/bool pandas ExtensionArray. |
|
Arkouda-backed arbitrary-precision integer dtype. |
|
Arkouda-backed boolean dtype. |
|
Arkouda-backed categorical pandas ExtensionArray. |
|
Arkouda-backed categorical dtype. |
|
Arkouda-backed 64-bit floating-point dtype. |
|
Extension dtype for Arkouda-backed 64-bit integers. |
|
Arkouda-backed string pandas ExtensionArray. |
|
Arkouda-backed string dtype. |
|
Arkouda-backed unsigned 64-bit integer dtype. |
|
Arkouda-backed unsigned 8-bit integer dtype. |
|
Descriptor for caching namespace-based accessors. |
|
Represents an array of values belonging to named categories. |
|
A DataFrame structure based on Arkouda arrays. |
|
A DataFrame that has been grouped by a subset of columns. |
|
Accessor for datetime-like operations on Arkouda Series. |
|
A column in a GroupBy that has been differenced. |
|
Sequence used for indexing and alignment. |
|
Enum for defining valid log levels used by ArkoudaLogger. |
|
A multi-level, or hierarchical, index object for Arkouda DataFrames and Series. |
|
The results of a power divergence statistical test. |
|
Base class for accessor implementations in Arkouda. |
|
Dictionary-like representation of a single row in an Arkouda |
|
One-dimensional Arkouda array with axis labels. |
|
Accessor for string operations on Arkouda Series. |
Functions¶
|
Apply a python function to a pdarray. |
|
Check that the left and right objects are approximately equal. |
|
Check that two objects are approximately equal. |
|
Check that two Arkouda arrays are equivalent. Supports pdarray, Strings, |
|
Check that two Arkouda-compatible arrays are equal. |
|
Check that two Arkouda pdarray objects are equivalent. |
|
Check that two Arkouda SegArray objects are equivalent. |
|
Check that two ak.Strings arrays are equivalent. |
|
Check that attributes are equal. Both objects must have the given attribute. |
|
Test that Categoricals are equivalent. |
|
Check classes are equal. |
|
Assert that a dictionary contains all the elements of an iterable. |
|
Check that the elements are equal, but not the same object. |
|
Assert that two dictionaries are equal. |
|
Wrap tm.assert_*_equal to dispatch to the appropriate test function. |
|
Dispatch to the appropriate assertion function depending on object types. |
|
Check that left and right DataFrame are equal. |
|
Check that two DataFrames are equal. |
|
Check that left and right Index are equal. |
|
Check that two Index objects are equal. |
|
Assert that the sequence is sorted. |
|
Check that left and right Series are equal. |
|
Check that two Series are equal. |
|
Computes the chi square statistic and p-value. |
|
Compute the internal size of a hypothetical join between a and b. Returns |
|
Add common datetime operation methods to a DatetimeAccessor class. |
|
Disables verbose logging. |
|
Enable verbose logging (DEBUG log level) for all ArkoudaLoggers. |
|
Export data from arkouda to pandas. |
|
Convert a pandas |
|
Generate a segmented array of variable-length, contiguous ranges between pairs of |
|
Get a list of column names from CSV file(s). |
|
Get the names of the datasets in the provide files. |
|
Get the type of a file accessible to the server. |
|
Get null indices of a string column in a Parquet file. |
|
Import data from a file saved by Pandas (HDF5/Parquet). |
|
Return a JSON formatted string containing information about the objects in names. |
|
Find the intersection of two arkouda arrays. |
|
Find all the rows that are in both dataframes. |
|
Find the inverse of a permutation array. |
|
Inner-join on equality between two integer arrays where the time-window predicate is also true. |
|
Return a list containing the names of all registered objects. |
|
Return a list containing the names of all objects in the symbol table. |
|
Load objects previously saved with |
|
Load multiple pdarrays, Strings, SegArrays, or Categoricals previously saved with |
|
Load server's state. |
|
List the contents of an HDF5 or Parquet file on the Arkouda server. |
|
List the datasets within a file when a CSV does not have a header. |
|
Merge Arkouda DataFrames with a database-style join. |
|
Computes the power divergence statistic and p-value. |
|
Print verbose information for each object in names in a human readable format. |
|
Read datasets from files. |
|
Read CSV file(s) into Arkouda objects. |
|
Read Arkouda objects from HDF5 files. |
|
Read Arkouda objects from Parquet files. |
|
Read datasets from files and tag each record with the file it was read from. |
|
Read a Zarr store from disk into a pdarray. |
|
Receive a pdarray sent by pdarray.transfer(). |
|
Receive a pdarray sent by dataframe.transfer(). |
|
Return data saved using ak.snapshot. |
|
Save the server's state. |
|
Create a snapshot of the current Arkouda namespace. |
|
Add common string operation methods to a StringAccessor class. |
|
Write Arkouda object(s) to CSV file(s). |
|
Save multiple named pdarrays to HDF5 files. |
|
Save multiple named arrays to Parquet files. |
|
Write a pdarray to disk as a Zarr store. |
|
Overwrite the datasets with name appearing in names or keys in columns if columns is a dictionary. |
|
Allow the user to write custom logs. |
|
Computes x * log(y). |
Package Contents¶
- arkouda.AllSymbols = '__AllSymbols__'¶
- class arkouda.ArkoudaArray(data: arkouda.numpy.pdarrayclass.pdarray | numpy.ndarray | Sequence[Any] | ArkoudaArray, dtype: Any = None, copy: bool = False)[source]¶
Bases:
arkouda.pandas.extension._arkouda_extension_array.ArkoudaExtensionArray,pandas.api.extensions.ExtensionArrayArkouda-backed numeric/bool pandas ExtensionArray.
Wraps or converts supported inputs into an Arkouda
pdarrayto serve as the backing store. Ensures the underlying array is 1-D and lives on the Arkouda server.- Parameters:
data (pdarray | ndarray | Sequence[Any] | ArkoudaArray) –
Input to wrap or convert. - If an Arkouda
pdarray, it is used directly unlessdtypeis givenor
copy=True, in which case a new array is created viaak.array.If a NumPy array, it is transferred to Arkouda via
ak.array.If a Python sequence, it is converted to NumPy then to Arkouda.
If another
ArkoudaArray, its underlyingpdarrayis reused.
dtype (Any, optional) – Desired dtype to cast to (NumPy dtype or Arkouda dtype string). If omitted, dtype is inferred from
data.copy (bool) – If True, attempt to copy the underlying data when converting/wrapping. Default is False.
- Raises:
TypeError – If
datacannot be interpreted as an Arkouda array-like object.ValueError – If the resulting array is not one-dimensional.
- default_fill_value¶
Sentinel used when filling missing values (default: -1).
- Type:
int
Examples
>>> import arkouda as ak >>> from arkouda.pandas.extension import ArkoudaArray >>> ArkoudaArray(ak.arange(5)) ArkoudaArray([0 1 2 3 4]) >>> ArkoudaArray([10, 20, 30]) ArkoudaArray([10 20 30])
- all(axis=0, skipna=True, **kwargs)[source]¶
Return whether all elements are True.
This is mainly to support pandas’ BaseExtensionArray.equals, which calls .all() on the result of a boolean expression.
- any(axis=0, skipna=True, **kwargs)[source]¶
Return whether any element is True.
Added for symmetry with .all() and to support potential pandas boolean-reduction calls.
- astype(dtype: numpy.dtype[Any], copy: bool = True) numpy.typing.NDArray[Any][source]¶
- astype(dtype: pandas.core.dtypes.dtypes.ExtensionDtype, copy: bool = True) pandas.api.extensions.ExtensionArray
- astype(dtype: Any, copy: bool = True) pandas.api.extensions.ExtensionArray | numpy.typing.NDArray[Any]
Cast the array to a specified dtype.
Casting rules:
If
dtyperequestsobject, returns a NumPyNDArray[Any]of dtypeobjectcontaining the array values.Otherwise, the target dtype is normalized using Arkouda’s dtype resolution rules.
If the normalized dtype matches the current dtype and
copy=False, returnsself.In all other cases, casts the underlying Arkouda array to the target dtype and returns an Arkouda-backed
ArkoudaExtensionArray.
- Parameters:
dtype (Any) – Target dtype. May be a NumPy dtype, pandas dtype, Arkouda dtype, or any dtype-like object accepted by Arkouda.
copy (bool) – Whether to force a copy when the target dtype matches the current dtype. Default is True.
- Returns:
The cast result. Returns a NumPy array only when casting to
object; otherwise returns an Arkouda-backed ExtensionArray.- Return type:
Union[ExtensionArray, NDArray[Any]]
Examples
Basic numeric casting returns an Arkouda-backed array:
>>> import arkouda as ak >>> from arkouda.pandas.extension import ArkoudaArray >>> a = ArkoudaArray(ak.array([1, 2, 3], dtype="int64")) >>> a.astype("float64").to_ndarray() array([1., 2., 3.])
Casting to the same dtype with
copy=Falsereturns the original object:>>> b = a.astype("int64", copy=False) >>> b is a True
Forcing a copy when the dtype is unchanged returns a new array:
>>> c = a.astype("int64", copy=True) >>> c is a False >>> c.to_ndarray() array([1, 2, 3])
Casting to
objectmaterializes the data to a NumPy array:>>> a.astype(object) array([1, 2, 3], dtype=object)
NumPy and pandas dtype objects are also accepted:
>>> import numpy as np >>> a.astype(np.dtype("bool")).to_ndarray() array([ True, True, True])
- default_fill_value: int = -1¶
- property dtype¶
An instance of ExtensionDtype.
See also
api.extensions.ExtensionDtypeBase class for extension dtypes.
api.extensions.ExtensionArrayBase class for extension array types.
api.extensions.ExtensionArray.dtypeThe dtype of an ExtensionArray.
Series.dtypeThe dtype of a Series.
DataFrame.dtypeThe dtype of a DataFrame.
Examples
>>> pd.array([1, 2, 3]).dtype Int64Dtype()
- equals(other)[source]¶
Return if another array is equivalent to this array.
Equivalent means that both arrays have the same shape and dtype, and all values compare equal. Missing values in the same location are considered equal (in contrast with normal equality).
- Parameters:
other (ExtensionArray) – Array to compare to this Array.
- Returns:
Whether the arrays are equivalent.
- Return type:
boolean
See also
numpy.array_equalEquivalent method for numpy array.
Series.equalsEquivalent method for Series.
DataFrame.equalsEquivalent method for DataFrame.
Examples
>>> arr1 = pd.array([1, 2, np.nan]) >>> arr2 = pd.array([1, 2, np.nan]) >>> arr1.equals(arr2) True
>>> arr1 = pd.array([1, 3, np.nan]) >>> arr2 = pd.array([1, 2, np.nan]) >>> arr1.equals(arr2) False
- isna() numpy.ndarray[source]¶
Return a boolean mask indicating missing values.
This method implements the pandas ExtensionArray.isna contract and always returns a NumPy ndarray of dtype
boolwith the same length as the array.- Returns:
A boolean mask where
Truemarks elements considered missing.- Return type:
np.ndarray
- Raises:
TypeError – If the underlying data buffer does not support missing-value detection or cannot produce a boolean mask.
- property nbytes¶
The number of bytes needed to store this object in memory.
See also
ExtensionArray.shapeReturn a tuple of the array dimensions.
ExtensionArray.sizeThe number of elements in the array.
Examples
>>> pd.array([1, 2, 3]).nbytes 27
- value_counts(dropna: bool = True) pandas.Series[source]¶
Return counts of unique values as a pandas Series.
This method computes the frequency of each distinct value in the underlying Arkouda array and returns the result as a pandas
Series, with the unique values as the index and their counts as the data.- Parameters:
dropna (bool) – Whether to exclude missing values. Currently, missing-value handling is supported only for floating-point data, where
NaNvalues are treated as missing. Default is True.- Returns:
A Series containing the counts of unique values. The index is an
ArkoudaArrayof unique values, and the values are anArkoudaArrayof counts.- Return type:
pd.Series
Notes
Only
dropna=Trueis supported.The following pandas options are not yet implemented:
normalize,sort, andbins.Counting is performed server-side in Arkouda; only the small result (unique values and counts) is materialized on the client.
Examples
>>> import arkouda as ak >>> from arkouda.pandas.extension import ArkoudaArray >>> >>> a = ArkoudaArray(ak.array([1, 2, 1, 3, 2, 1])) >>> a.value_counts() 1 3 2 2 3 1 dtype: int64
Floating-point data with NaN values:
>>> b = ArkoudaArray(ak.array([1.0, 2.0, float("nan"), 1.0])) >>> b.value_counts() 1.0 2 2.0 1 dtype: int64
- arkouda.ArkoudaArrayLike¶
- class arkouda.ArkoudaBigintDtype[source]¶
Bases:
_ArkoudaBaseDtypeArkouda-backed arbitrary-precision integer dtype.
This dtype integrates Arkouda’s server-backed
pdarray<bigint>with the pandas ExtensionArray interface viaArkoudaArray. It enables pandas objects (Series, DataFrame) to hold and operate on very large integers that exceed 64-bit precision, while keeping the data distributed on the Arkouda server.- construct_array_type()[source]¶
Returns the
ArkoudaArrayclass used for storage.
- classmethod construct_array_type()[source]¶
Return the ExtensionArray subclass that handles storage for this dtype.
- Returns:
The
ArkoudaArrayclass associated with this dtype.- Return type:
- kind = 'O'¶
A character code (one of ‘biufcmMOSUV’), default ‘O’
This should match the NumPy dtype used when the array is converted to an ndarray, which is probably ‘O’ for object if the extension type cannot be represented as a built-in NumPy type.
See also
numpy.dtype.kind
- na_value = -1¶
Default NA value to use for this type.
This is used in e.g. ExtensionArray.take. This should be the user-facing “boxed” version of the NA value, not the physical NA value for storage. e.g. for JSONArray, this is an empty dictionary.
- name = 'bigint'¶
A string identifying the data type.
Will be used for display in, e.g.
Series.dtype
- type¶
The scalar type for the array, e.g.
intIt’s expected
ExtensionArray[item]returns an instance ofExtensionDtype.typefor scalaritem, assuming that value is valid (not NA). NA values do not need to be instances of type.
- class arkouda.ArkoudaBoolDtype[source]¶
Bases:
_ArkoudaBaseDtypeArkouda-backed boolean dtype.
This dtype integrates Arkouda’s server-backed pdarray<bool> with the pandas ExtensionArray interface via
ArkoudaArray. It allows pandas objects (Series, DataFrame) to store and manipulate distributed boolean arrays without materializing them on the client.- construct_array_type()[source]¶
Returns the
ArkoudaArrayclass used for storage.
- classmethod construct_array_type()[source]¶
Return the ExtensionArray subclass that handles storage for this dtype.
- Returns:
The
ArkoudaArrayclass associated with this dtype.- Return type:
- kind = 'b'¶
A character code (one of ‘biufcmMOSUV’), default ‘O’
This should match the NumPy dtype used when the array is converted to an ndarray, which is probably ‘O’ for object if the extension type cannot be represented as a built-in NumPy type.
See also
numpy.dtype.kind
- na_value = False¶
Default NA value to use for this type.
This is used in e.g. ExtensionArray.take. This should be the user-facing “boxed” version of the NA value, not the physical NA value for storage. e.g. for JSONArray, this is an empty dictionary.
- name = 'bool_'¶
A string identifying the data type.
Will be used for display in, e.g.
Series.dtype
- type¶
The scalar type for the array, e.g.
intIt’s expected
ExtensionArray[item]returns an instance ofExtensionDtype.typefor scalaritem, assuming that value is valid (not NA). NA values do not need to be instances of type.
- class arkouda.ArkoudaCategorical(data: arkouda.pandas.categorical.Categorical | ArkoudaCategorical | numpy.ndarray | Sequence[Any])[source]¶
Bases:
arkouda.pandas.extension._arkouda_extension_array.ArkoudaExtensionArray,pandas.api.extensions.ExtensionArrayArkouda-backed categorical pandas ExtensionArray.
Ensures the underlying data is an Arkouda
Categorical. Accepts an existingCategoricalor converts from Python/NumPy sequences of labels.- Parameters:
data (Categorical | ArkoudaCategorical | ndarray | Sequence[Any]) – Input to wrap or convert. - If
Categorical, used directly. - If anotherArkoudaCategorical, its backing object is reused. - If list/tuple/ndarray, converted viaak.Categorical(ak.array(data)).- Raises:
TypeError – If
datacannot be converted to ArkoudaCategorical.
- astype(dtype: numpy.dtype[Any], copy: bool = True) numpy.typing.NDArray[Any][source]¶
- astype(dtype: pandas.core.dtypes.dtypes.ExtensionDtype, copy: bool = True) pandas.api.extensions.ExtensionArray
- astype(dtype: Any, copy: bool = True) pandas.api.extensions.ExtensionArray | numpy.typing.NDArray[Any]
Cast to a specified dtype.
If
dtypeis categorical (pandascategory/CategoricalDtype/ArkoudaCategoricalDtype), returns an Arkouda-backedArkoudaCategorical(optionally copied).If
dtyperequestsobject, returns a NumPyndarrayof dtype object containing the category labels (materialized to the client).If
dtyperequests a string dtype, returns an Arkouda-backedArkoudaStringArraycontaining the labels as strings.Otherwise, casts the labels (as strings) to the requested dtype and returns an Arkouda-backed ExtensionArray.
- Parameters:
dtype (Any) – Target dtype.
copy (bool) – Whether to force a copy when possible. If categorical-to-categorical and
copy=True, attempts to copy the underlying ArkoudaCategorical(if supported). Default is True.
- Returns:
The cast result. Returns a NumPy array only when casting to
object; otherwise returns an Arkouda-backed ExtensionArray.- Return type:
Union[ExtensionArray, NDArray[Any]]
Examples
Casting to
categoryreturns an Arkouda-backed categorical array:>>> import arkouda as ak >>> from arkouda.pandas.extension import ArkoudaCategorical >>> c = ArkoudaCategorical(ak.Categorical(ak.array(["x", "y", "x"]))) >>> out = c.astype("category") >>> out is c False
Forcing a copy when casting to the same categorical dtype returns a new array:
>>> out2 = c.astype("category", copy=True) >>> out2 is c False >>> out2.to_ndarray() array(['x', 'y', 'x'], dtype='<U...')
Casting to
objectmaterializes the category labels to a NumPy object array:>>> c.astype(object) array(['x', 'y', 'x'], dtype=object)
Casting to a string dtype returns an Arkouda-backed string array of labels:
>>> s = c.astype("string") >>> s.to_ndarray() array(['x', 'y', 'x'], dtype='<U1')
Casting to another dtype casts the labels-as-strings and returns an Arkouda-backed array:
>>> c_num = ArkoudaCategorical(ak.Categorical(ak.array(["1", "2", "3"]))) >>> a = c_num.astype("int64") >>> a.to_ndarray() array([1, 2, 3])
- property dtype¶
An instance of ExtensionDtype.
See also
api.extensions.ExtensionDtypeBase class for extension dtypes.
api.extensions.ExtensionArrayBase class for extension array types.
api.extensions.ExtensionArray.dtypeThe dtype of an ExtensionArray.
Series.dtypeThe dtype of a Series.
DataFrame.dtypeThe dtype of a DataFrame.
Examples
>>> pd.array([1, 2, 3]).dtype Int64Dtype()
- isna() numpy.ndarray[source]¶
# Return a boolean mask indicating missing values.
# This implements the pandas ExtensionArray.isna contract and returns a # NumPy ndarray[bool] of the same length as this categorical array.
# Returns # ——- # np.ndarray # Boolean mask where True indicates a missing value.
# Raises # —— # TypeError # If the underlying categorical cannot expose its codes or if missing # detection is unsupported. #
- value_counts(dropna: bool = True) pandas.Series[source]¶
Return counts of categories as a pandas Series.
This method computes category frequencies from the underlying Arkouda
Categoricaland returns them as a pandasSeries, where the index contains the category labels and the values contain the corresponding counts.- Parameters:
dropna (bool) – Whether to drop missing values from the result. When
True, the result is filtered using the categorical’sna_value. WhenFalse, all categories returned by the underlying computation are included. Default is True.- Returns:
A Series containing category counts. The index is an
ArkoudaStringArrayof category labels and the values are anArkoudaArrayof counts.- Return type:
pd.Series
Notes
The result is computed server-side in Arkouda; only the (typically small) output of categories and counts is materialized for the pandas
Series.This method does not yet support pandas options such as
normalize,sort, orbins.The handling of missing values depends on the Arkouda
Categoricaldefinition ofna_value.
Examples
>>> import arkouda as ak >>> from arkouda.pandas.extension import ArkoudaCategorical >>> >>> a = ArkoudaCategorical(["a", "b", "a", "c", "b", "a"]) >>> a.value_counts() a 3 b 2 c 1 dtype: int64
- class arkouda.ArkoudaCategoricalDtype[source]¶
Bases:
_ArkoudaBaseDtypeArkouda-backed categorical dtype.
This dtype integrates Arkouda’s distributed
Categoricaltype with the pandas ExtensionArray interface viaArkoudaCategorical. It enables pandas objects (Series, DataFrame) to hold categorical data stored and processed on the Arkouda server, while exposing familiar pandas APIs.- construct_array_type()[source]¶
Returns the
ArkoudaCategoricalused as the storage class.
- classmethod construct_array_type()[source]¶
Return the ExtensionArray subclass that handles storage for this dtype.
- Returns:
The
ArkoudaCategoricalclass associated with this dtype.- Return type:
- kind = 'O'¶
A character code (one of ‘biufcmMOSUV’), default ‘O’
This should match the NumPy dtype used when the array is converted to an ndarray, which is probably ‘O’ for object if the extension type cannot be represented as a built-in NumPy type.
See also
numpy.dtype.kind
- na_value = -1¶
Default NA value to use for this type.
This is used in e.g. ExtensionArray.take. This should be the user-facing “boxed” version of the NA value, not the physical NA value for storage. e.g. for JSONArray, this is an empty dictionary.
- name = 'category'¶
A string identifying the data type.
Will be used for display in, e.g.
Series.dtype
- type¶
The scalar type for the array, e.g.
intIt’s expected
ExtensionArray[item]returns an instance ofExtensionDtype.typefor scalaritem, assuming that value is valid (not NA). NA values do not need to be instances of type.
- class arkouda.ArkoudaFloat64Dtype[source]¶
Bases:
_ArkoudaBaseDtypeArkouda-backed 64-bit floating-point dtype.
This dtype integrates Arkouda’s server-backed pdarray<float64> with the pandas ExtensionArray interface via
ArkoudaArray. It allows pandas objects (Series, DataFrame) to store and manipulate large distributed float64 arrays without materializing them on the client.- construct_array_type()[source]¶
Returns the
ArkoudaArrayclass used for storage.
- classmethod construct_array_type()[source]¶
Return the ExtensionArray subclass that handles storage for this dtype.
- Returns:
The
ArkoudaArrayclass associated with this dtype.- Return type:
- kind = 'f'¶
A character code (one of ‘biufcmMOSUV’), default ‘O’
This should match the NumPy dtype used when the array is converted to an ndarray, which is probably ‘O’ for object if the extension type cannot be represented as a built-in NumPy type.
See also
numpy.dtype.kind
- na_value¶
Default NA value to use for this type.
This is used in e.g. ExtensionArray.take. This should be the user-facing “boxed” version of the NA value, not the physical NA value for storage. e.g. for JSONArray, this is an empty dictionary.
- name = 'float64'¶
A string identifying the data type.
Will be used for display in, e.g.
Series.dtype
- type¶
The scalar type for the array, e.g.
intIt’s expected
ExtensionArray[item]returns an instance ofExtensionDtype.typefor scalaritem, assuming that value is valid (not NA). NA values do not need to be instances of type.
- class arkouda.ArkoudaInt64Dtype[source]¶
Bases:
_ArkoudaBaseDtypeExtension dtype for Arkouda-backed 64-bit integers.
This dtype allows seamless use of Arkouda’s distributed
int64arrays inside pandas objects (Series,Index,DataFrame). It is backed byarkouda.pdarraywithdtype='int64'and integrates with pandas via theArkoudaArrayextension array.- construct_array_type()[source]¶
Return the associated extension array class (
ArkoudaArray).
- classmethod construct_array_type()[source]¶
Return the associated pandas ExtensionArray type.
This is part of the pandas ExtensionDtype interface and is used internally by pandas when constructing arrays of this dtype. It ensures that operations like
Series(..., dtype=ArkoudaInt64Dtype())produce the correct Arkouda-backed extension array.- Returns:
The
ArkoudaArrayclass that implements the storage and behavior for this dtype.- Return type:
Notes
This hook tells pandas which ExtensionArray to instantiate whenever this dtype is requested.
All Arkouda dtypes defined in this module will return
ArkoudaArray(or a subclass thereof).
Examples
>>> from arkouda.pandas.extension import ArkoudaInt64Dtype >>> ArkoudaInt64Dtype.construct_array_type() <class 'arkouda.pandas.extension._arkouda_array.ArkoudaArray'>
- kind = 'i'¶
A character code (one of ‘biufcmMOSUV’), default ‘O’
This should match the NumPy dtype used when the array is converted to an ndarray, which is probably ‘O’ for object if the extension type cannot be represented as a built-in NumPy type.
See also
numpy.dtype.kind
- na_value = -1¶
Default NA value to use for this type.
This is used in e.g. ExtensionArray.take. This should be the user-facing “boxed” version of the NA value, not the physical NA value for storage. e.g. for JSONArray, this is an empty dictionary.
- name = 'int64'¶
A string identifying the data type.
Will be used for display in, e.g.
Series.dtype
- type¶
The scalar type for the array, e.g.
intIt’s expected
ExtensionArray[item]returns an instance ofExtensionDtype.typefor scalaritem, assuming that value is valid (not NA). NA values do not need to be instances of type.
- class arkouda.ArkoudaStringArray(data: arkouda.numpy.strings.Strings | numpy.ndarray | Sequence[Any] | ArkoudaStringArray)[source]¶
Bases:
arkouda.pandas.extension._arkouda_extension_array.ArkoudaExtensionArray,pandas.api.extensions.ExtensionArrayArkouda-backed string pandas ExtensionArray.
Ensures the underlying data is an Arkouda
Stringsobject. Accepts existingStringsor converts from NumPy arrays and Python sequences of strings.- Parameters:
data (Strings | ndarray | Sequence[Any] | ArkoudaStringArray) – Input to wrap or convert. - If
Strings, used directly. - If NumPy/sequence, converted viaak.array. - If anotherArkoudaStringArray, its backingStringsis reused.- Raises:
TypeError – If
datacannot be converted to ArkoudaStrings.
- astype(dtype: numpy.dtype[Any], copy: bool = True) numpy.typing.NDArray[Any][source]¶
- astype(dtype: pandas.core.dtypes.dtypes.ExtensionDtype, copy: bool = True) pandas.api.extensions.ExtensionArray
- astype(dtype: Any, copy: bool = True) pandas.api.extensions.ExtensionArray | numpy.typing.NDArray[Any]
Cast to a specified dtype.
Casting rules:
If
dtyperequestsobject, returns a NumPyNDArray[Any]of dtypeobjectcontaining the string values.If
dtypeis a string dtype (e.g. pandasStringDtype, NumPy unicode, or Arkouda string dtype), returns anArkoudaStringArray. Ifcopy=True, attempts to copy the underlying ArkoudaStringsdata.For all other dtypes, casts the underlying Arkouda
StringsusingStrings.astypeand returns an Arkouda-backedArkoudaExtensionArrayconstructed from the result.
- Parameters:
dtype (Any) – Target dtype. May be a NumPy dtype, pandas dtype, or Arkouda dtype.
copy (bool) – Whether to force a copy when the result is an
ArkoudaStringArray. Default is True.
- Returns:
The cast result. Returns a NumPy array only when casting to
object; otherwise returns an Arkouda-backed ExtensionArray.- Return type:
Union[ExtensionArray, NDArray[Any]]
Examples
Casting to a string dtype returns an Arkouda-backed string array:
>>> import arkouda as ak >>> from arkouda.pandas.extension import ArkoudaStringArray >>> s = ArkoudaStringArray(ak.array(["a", "b", "c"])) >>> out = s.astype("string") >>> out is s False
Forcing a copy when casting to a string dtype returns a new array:
>>> out2 = s.astype("string", copy=True) >>> out2 is s False >>> out2.to_ndarray() array(['a', 'b', 'c'], dtype='<U1')
Casting to
objectmaterializes the data to a NumPy array:>>> s.astype(object) array(['a', 'b', 'c'], dtype=object)
Casting to a non-string dtype uses Arkouda to cast the underlying strings and returns an Arkouda-backed ExtensionArray:
>>> s_num = ArkoudaStringArray(ak.array(["1", "2", "3"])) >>> a = s_num.astype("int64") >>> a.to_ndarray() array([1, 2, 3])
NumPy and pandas dtype objects are also accepted:
>>> import numpy as np >>> a = s_num.astype(np.dtype("float64")) >>> a.to_ndarray() array([1., 2., 3.])
- property dtype¶
An instance of ExtensionDtype.
See also
api.extensions.ExtensionDtypeBase class for extension dtypes.
api.extensions.ExtensionArrayBase class for extension array types.
api.extensions.ExtensionArray.dtypeThe dtype of an ExtensionArray.
Series.dtypeThe dtype of a Series.
DataFrame.dtypeThe dtype of a DataFrame.
Examples
>>> pd.array([1, 2, 3]).dtype Int64Dtype()
- isna()[source]¶
A 1-D array indicating if each value is missing.
- Returns:
In most cases, this should return a NumPy ndarray. For exceptional cases like
SparseArray, where returning an ndarray would be expensive, an ExtensionArray may be returned.- Return type:
numpy.ndarray or pandas.api.extensions.ExtensionArray
See also
ExtensionArray.dropnaReturn ExtensionArray without NA values.
ExtensionArray.fillnaFill NA/NaN values using the specified method.
Notes
If returning an ExtensionArray, then
na_values._is_booleanshould be Truena_valuesshould implementExtensionArray._reduce()na_valuesshould implementExtensionArray._accumulate()na_values.anyandna_values.allshould be implemented
Examples
>>> arr = pd.array([1, 2, np.nan, np.nan]) >>> arr.isna() array([False, False, True, True])
- item(*args, **kwargs)[source]¶
Return the array element at the specified position as a Python scalar.
- Parameters:
index (int, optional) – Position of the element. If not provided, the array must contain exactly one element.
- Returns:
The element at the specified position.
- Return type:
scalar
- Raises:
ValueError – If no index is provided and the array does not have exactly one element.
IndexError – If the specified position is out of bounds.
See also
numpy.ndarray.itemReturn the item of an array as a scalar.
Examples
>>> arr = pd.array([1], dtype="Int64") >>> arr.item() np.int64(1)
>>> arr = pd.array([1, 2, 3], dtype="Int64") >>> arr.item(0) np.int64(1) >>> arr.item(2) np.int64(3)
- value_counts(dropna: bool = True) pandas.Series[source]¶
Return counts of unique strings as a pandas Series.
This method computes the frequency of each distinct string value in the underlying Arkouda
Stringsobject and returns the result as a pandasSeries, with the unique string values as the index and their counts as the data.- Parameters:
dropna (bool) – Whether to exclude missing values. Missing-value handling for Arkouda string arrays is not yet implemented, so this parameter is accepted for pandas compatibility but currently has no effect. Default is True.
- Returns:
A Series containing the counts of unique string values. The index is an
ArkoudaStringArrayof unique values, and the values are anArkoudaArrayof counts.- Return type:
pd.Series
Notes
The following pandas options are not yet implemented:
normalize,sort, andbins.Counting is performed server-side in Arkouda; only the small result (unique values and counts) is materialized on the client.
Examples
Basic usage:
>>> import arkouda as ak >>> from arkouda.pandas.extension import ArkoudaStringArray >>> >>> s = ArkoudaStringArray(["red", "blue", "red", "green", "blue", "red"]) >>> s.value_counts() red 3 blue 2 green 1 dtype: int64
Empty input:
>>> empty = ArkoudaStringArray([]) >>> empty.value_counts() Series([], dtype: int64)
- class arkouda.ArkoudaStringDtype[source]¶
Bases:
_ArkoudaBaseDtypeArkouda-backed string dtype.
This dtype integrates Arkouda’s distributed
Stringstype with the pandas ExtensionArray interface viaArkoudaStringArray. It enables pandas objects (Series, DataFrame) to hold large, server-backed string columns without converting to NumPy or Python objects.- construct_array_type()[source]¶
Returns the
ArkoudaStringArrayused as the storage class.
- classmethod construct_array_type()[source]¶
Return the ExtensionArray subclass that handles storage for this dtype.
- Returns:
The
ArkoudaStringArrayclass associated with this dtype.- Return type:
- kind = 'O'¶
A character code (one of ‘biufcmMOSUV’), default ‘O’
This should match the NumPy dtype used when the array is converted to an ndarray, which is probably ‘O’ for object if the extension type cannot be represented as a built-in NumPy type.
See also
numpy.dtype.kind
- na_value = ''¶
Default NA value to use for this type.
This is used in e.g. ExtensionArray.take. This should be the user-facing “boxed” version of the NA value, not the physical NA value for storage. e.g. for JSONArray, this is an empty dictionary.
- name = 'string'¶
A string identifying the data type.
Will be used for display in, e.g.
Series.dtype
- type¶
The scalar type for the array, e.g.
intIt’s expected
ExtensionArray[item]returns an instance ofExtensionDtype.typefor scalaritem, assuming that value is valid (not NA). NA values do not need to be instances of type.
- class arkouda.ArkoudaUint64Dtype[source]¶
Bases:
_ArkoudaBaseDtypeArkouda-backed unsigned 64-bit integer dtype.
This dtype integrates Arkouda’s
uint64arrays with pandas, allowing users to createpandas.Seriesorpandas.DataFrameobjects that store their data on the Arkouda server while still conforming to the pandas ExtensionArray API.- construct_array_type()[source]¶
Return the
ArkoudaArrayclass used as the storage container for this dtype.
Examples
>>> import arkouda as ak >>> import pandas as pd >>> from arkouda.pandas.extension import ArkoudaUint64Dtype, ArkoudaArray
>>> arr = ArkoudaArray(ak.array([1, 2, 3], dtype="uint64")) >>> s = pd.Series(arr, dtype=ArkoudaUint64Dtype()) >>> s 0 1 1 2 2 3 dtype: uint64
- classmethod construct_array_type()[source]¶
Return the ExtensionArray class associated with this dtype.
This is required by the pandas ExtensionDtype API. It tells pandas which
ExtensionArraysubclass should be used to hold data of this dtype inside apandas.Seriesorpandas.DataFrame.- Returns:
The
ArkoudaArrayclass, which implements the storage and operations for Arkouda-backed arrays.- Return type:
- kind = 'u'¶
A character code (one of ‘biufcmMOSUV’), default ‘O’
This should match the NumPy dtype used when the array is converted to an ndarray, which is probably ‘O’ for object if the extension type cannot be represented as a built-in NumPy type.
See also
numpy.dtype.kind
- na_value = -1¶
Default NA value to use for this type.
This is used in e.g. ExtensionArray.take. This should be the user-facing “boxed” version of the NA value, not the physical NA value for storage. e.g. for JSONArray, this is an empty dictionary.
- name = 'uint64'¶
A string identifying the data type.
Will be used for display in, e.g.
Series.dtype
- type¶
The scalar type for the array, e.g.
intIt’s expected
ExtensionArray[item]returns an instance ofExtensionDtype.typefor scalaritem, assuming that value is valid (not NA). NA values do not need to be instances of type.
- class arkouda.ArkoudaUint8Dtype[source]¶
Bases:
_ArkoudaBaseDtypeArkouda-backed unsigned 8-bit integer dtype.
This dtype integrates Arkouda’s
uint8arrays with the pandas ExtensionArray API, allowing pandasSeriesandDataFrameobjects to store and operate on Arkouda-backed unsigned 8-bit integers. The underlying storage is an Arkoudapdarray<uint8>, exposed through theArkoudaArrayextension array.- construct_array_type()[source]¶
Returns the
ArkoudaArraytype that provides the storage and behavior for this dtype.
- classmethod construct_array_type()[source]¶
Return the ExtensionArray subclass that handles storage for this dtype.
This method is required by the pandas ExtensionDtype interface. It tells pandas which ExtensionArray class to use when creating arrays of this dtype (for example, when calling
Series(..., dtype="arkouda.uint8")).- Returns:
The
ArkoudaArrayclass associated with this dtype.- Return type:
- kind = 'u'¶
A character code (one of ‘biufcmMOSUV’), default ‘O’
This should match the NumPy dtype used when the array is converted to an ndarray, which is probably ‘O’ for object if the extension type cannot be represented as a built-in NumPy type.
See also
numpy.dtype.kind
- na_value = -1¶
Default NA value to use for this type.
This is used in e.g. ExtensionArray.take. This should be the user-facing “boxed” version of the NA value, not the physical NA value for storage. e.g. for JSONArray, this is an empty dictionary.
- name = 'uint8'¶
A string identifying the data type.
Will be used for display in, e.g.
Series.dtype
- type¶
The scalar type for the array, e.g.
intIt’s expected
ExtensionArray[item]returns an instance ofExtensionDtype.typefor scalaritem, assuming that value is valid (not NA). NA values do not need to be instances of type.
- class arkouda.CachedAccessor(name: str, accessor)[source]¶
Descriptor for caching namespace-based accessors.
This custom property-like object enables lazy initialization of accessors (e.g., .str, .dt) on Series-like objects, similar to pandas-style extension accessors.
- Parameters:
Notes
The accessor class’s
__init__method must accept a single positional argument, which should be one ofSeries,DataFrame, orIndex.
- class arkouda.Categorical(values, **kwargs)[source]¶
Represents an array of values belonging to named categories.
Converting a Strings object to Categorical often saves memory and speeds up operations, especially if there are many repeated values, at the cost of some one-time work in initialization.
- Parameters:
values (Strings, Categorical, pd.Categorical) – Values to convert to categories
na_value (str scalar) – The value to use to represent missing/null data
- permutation¶
The permutation that groups the values in the same order as categories
- size¶
The number of items in the array
- Type:
- nlevels¶
The number of distinct categories
- Type:
- ndim¶
The rank of the array (currently only rank 1 arrays supported)
- Type:
- shape¶
The sizes of each dimension of the array
- Type:
tuple
- BinOps¶
- RegisterablePieces¶
- RequiredPieces¶
- argsort(algorithm: arkouda.numpy.sorting.SortingAlgorithm = SortingAlgorithm.RadixSortLSD, ascending: bool = True) arkouda.numpy.pdarrayclass.pdarray[source]¶
Return the permutation of indices that would sort the Categorical.
Sorting is based on the order of the Categorical’s categories, not on the underlying codes.
- Parameters:
algorithm (SortingAlgorithm, default SortingAlgorithm.RadixSortLSD) – The sorting algorithm to use.
ascending (bool, default True) – Whether to return indices that would sort the Categorical in ascending category order. If False, returns indices for descending order.
- Returns:
An array of indices such that self[index] is sorted by category order.
- Return type:
Examples
>>> import arkouda as ak
>>> cat = ak.Categorical(ak.array(['dog', 'cat', 'dog', 'bird'])) >>> cat.argsort() array([3 1 0 2])
>>> cat.argsort(ascending=False) array([2 0 1 3])
The result can be used to reorder the Categorical: >>> sorted_cat = cat[cat.argsort()] >>> sorted_cat array([‘bird’, ‘cat’, ‘dog’, ‘dog’])
- categories: arkouda.numpy.strings.Strings¶
- concatenate(others: Sequence[Categorical], ordered: bool = True) Categorical[source]¶
Merge this Categorical with other Categorical objects in the array.
Merge this Categorical with other Categorical objects in the array, concatenating the arrays and synchronizing the categories.
- Parameters:
others (Sequence[Categorical]) – The Categorical arrays to concatenate and merge with this one
ordered (bool) – If True (default), the arrays will be appended in the order given. If False, array data may be interleaved in blocks, which can greatly improve performance but results in non-deterministic ordering of elements.
- Returns:
The merged Categorical object
- Return type:
- Raises:
TypeError – Raised if any others array objects are not Categorical objects
Notes
This operation can be expensive – slower than concatenating Strings.
- contains(substr: bytes | arkouda.numpy.dtypes.str_scalars, regex: bool = False) arkouda.numpy.pdarrayclass.pdarray[source]¶
Check whether each element contains the given substring.
- Parameters:
substr (Union[bytes, str_scalars]) – The substring to search for
regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Returns:
True for elements that contain substr, False otherwise
- Return type:
- Raises:
TypeError – Raised if the substr parameter is not bytes or str_scalars
ValueError – Rasied if substr is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Notes
This method can be significantly faster than the corresponding method on Strings objects, because it searches the unique category labels instead of the full array.
- copy() Categorical[source]¶
Return an copy of the given Categorical.
- Returns:
A deep copy of the Categorical.
- Return type:
- dtype¶
- endswith(substr: bytes | arkouda.numpy.dtypes.str_scalars, regex: bool = False) arkouda.numpy.pdarrayclass.pdarray[source]¶
Check whether each element ends with the given substring.
- Parameters:
substr (Union[bytes, str_scalars]) – The substring to search for
regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Returns:
True for elements that end with substr, False otherwise
- Return type:
- Raises:
TypeError – Raised if the substr parameter is not bytes or str_scalars
ValueError – Rasied if substr is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Notes
This method can be significantly faster than the corresponding method on Strings objects, because it searches the unique category labels instead of the full array.
- equals(other) arkouda.numpy.dtypes.bool_scalars[source]¶
Whether Categoricals are the same size and all entries are equal.
- Parameters:
other (object) – object to compare.
- Returns:
True if the Categoricals are the same, o.w. False.
- Return type:
Examples
>>> import arkouda as ak
>>> c = Categorical(ak.array(["a", "b", "c"])) >>> c_cpy = Categorical(ak.array(["a", "b", "c"])) >>> c.equals(c_cpy) np.True_ >>> c2 = Categorical(ak.array(["a", "x", "c"])) >>> c.equals(c2) np.False_
- classmethod from_codes(codes: arkouda.numpy.pdarrayclass.pdarray, categories: arkouda.numpy.strings.Strings, permutation=None, segments=None, **kwargs) Categorical[source]¶
Make a Categorical from codes and categories arrays.
If codes and categories have already been pre-computed, this constructor saves time. If not, please use the normal constructor.
- Parameters:
- Returns:
The Categorical object created from the input parameters
- Return type:
- Raises:
TypeError – Raised if codes is not a pdarray of int64 objects or if categories is not a Strings object
- classmethod from_return_msg(rep_msg) Categorical[source]¶
Create categorical from return message from server.
Notes
This is currently only used when reading a Categorical from HDF5 files.
- group() arkouda.numpy.pdarrayclass.pdarray[source]¶
Return the permutation that groups the array, placing equivalent categories together.
All instances of the same category are guaranteed to lie in one contiguous block of the permuted array, but the blocks are not necessarily ordered.
- Returns:
The permutation that groups the array by value
- Return type:
See also
GroupBy,uniqueNotes
This method is faster than the corresponding Strings method. If the Categorical was created from a Strings object, then this function simply returns the cached permutation. Even if the Categorical was created using from_codes(), this function will be faster than Strings.group() because it sorts dense integer values, rather than 128-bit hash values.
- hash() Tuple[arkouda.numpy.pdarrayclass.pdarray, arkouda.numpy.pdarrayclass.pdarray][source]¶
Compute a 128-bit hash of each element of the Categorical.
- Returns:
A tuple of two int64 pdarrays. The ith hash value is the concatenation of the ith values from each array.
- Return type:
Notes
The implementation uses SipHash128, a fast and balanced hash function (used by Python for dictionaries and sets). For realistic numbers of strings (up to about 10**15), the probability of a collision between two 128-bit hash values is negligible.
- in1d(test: arkouda.numpy.strings.Strings | Categorical) arkouda.numpy.pdarrayclass.pdarray[source]¶
Whether each element is also present in the test Strings or Categorical object.
Returns a boolean array the same length as self that is True where an element of self is in test and False otherwise.
- Parameters:
test (Union[Strings,Categorical]) – The values against which to test each value of ‘self`.
- Returns:
The values self[in1d] are in the test Strings or Categorical object.
- Return type:
- Raises:
TypeError – Raised if test is not a Strings or Categorical object
See also
unique,intersect1d,union1dNotes
in1d can be considered as an element-wise function version of the python keyword in, for 1-D sequences.
in1d(a, b)is logically equivalent toak.array([item in b for item in a]), but is much faster and scales to arbitrarily largea.Examples
>>> import arkouda as ak >>> strings = ak.array([f'String {i}' for i in range(0,5)]) >>> cat = ak.Categorical(strings) >>> ak.in1d(cat,strings) array([True True True True True]) >>> strings = ak.array([f'String {i}' for i in range(5,9)]) >>> catTwo = ak.Categorical(strings) >>> ak.in1d(cat,catTwo) array([False False False False False])
- info() str[source]¶
Return a JSON formatted string containing information about all components of self.
- Returns:
JSON string containing information about all components of self
- Return type:
- is_registered() numpy.bool_[source]¶
Return True iff the object is contained in the registry or is a component of a registered object.
- Returns:
Indicates if the object is contained in the registry
- Return type:
- Raises:
RegistrationError – Raised if there’s a server-side error or a mis-match of registered components
See also
register,attach,unregister,unregister_categorical_by_nameNotes
Objects registered with the server are immune to deletion until they are unregistered.
- logger¶
- property nbytes¶
The size of the Categorical in bytes.
- Returns:
The size of the Categorical in bytes.
- Return type:
int
- nlevels: arkouda.numpy.dtypes.int_scalars¶
- objType = 'Categorical'¶
- permutation: arkouda.numpy.pdarrayclass.pdarray | None¶
- pretty_print_info() None[source]¶
Print information about all components of self in a human-readable format.
- register(user_defined_name: str) Categorical[source]¶
Register this Categorical object and underlying components with the Arkouda server.
- Parameters:
user_defined_name (str) – user defined name the Categorical is to be registered under, this will be the root name for underlying components
- Returns:
The same Categorical which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different Categoricals with the same name.
- Return type:
- Raises:
TypeError – Raised if user_defined_name is not a str
RegistrationError – If the server was unable to register the Categorical with the user_defined_name
See also
unregister,attach,unregister_categorical_by_name,is_registeredNotes
Objects registered with the server are immune to deletion until they are unregistered.
- reset_categories() Categorical[source]¶
Recompute the category labels, discarding any unused labels.
This method is often useful after slicing or indexing a Categorical array, when the resulting array only contains a subset of the original categories. In this case, eliminating unused categories can speed up other operations.
- Returns:
A Categorical object generated from the current instance
- Return type:
- segments: arkouda.numpy.pdarrayclass.pdarray | None¶
- set_categories(new_categories, na_value=None)[source]¶
Set categories to user-defined values.
- Parameters:
new_categories (Strings) – The array of new categories to use. Must be unique.
na_value (str scalar) – The value to use to represent missing/null data
- Returns:
A new Categorical with the user-defined categories. Old values present in new categories will appear unchanged. Old values not present will be assigned the NA value.
- Return type:
- shape: tuple¶
- sort_values()[source]¶
Return a sorted Categorical by category labels.
- Returns:
A new Categorical with values sorted by category.
- Return type:
- classmethod standardize_categories(arrays, na_value='N/A')[source]¶
Standardize an array of Categoricals so that they share the same categories.
- Parameters:
arrays (sequence of Categoricals) – The Categoricals to standardize
na_value (str scalar) – The value to use to represent missing/null data
- Returns:
A list of the original Categoricals remapped to the shared categories.
- Return type:
List of Categoricals
- startswith(substr: bytes | arkouda.numpy.dtypes.str_scalars, regex: bool = False) arkouda.numpy.pdarrayclass.pdarray[source]¶
Check whether each element starts with the given substring.
- Parameters:
substr (Union[bytes, str_scalars]) – The substring to search for
regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Returns:
True for elements that start with substr, False otherwise
- Return type:
- Raises:
TypeError – Raised if the substr parameter is not bytes or str_scalars
ValueError – Rasied if substr is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Notes
This method can be significantly faster than the corresponding method on Strings objects, because it searches the unique category labels instead of the full array.
- to_hdf(prefix_path: str, dataset: str = 'categorical_array', mode: Literal['truncate', 'append'] = 'truncate', file_type: Literal['single', 'distribute'] = 'distribute')[source]¶
Save the Categorical to HDF5.
The result is a collection of HDF5 files, one file per locale of the arkouda server, where each filename starts with prefix_path.
- Parameters:
prefix_path (str) – Directory and filename prefix that all output files will share
dataset (str) – Name prefix for saved data within the HDF5 file
mode ({'truncate', 'append'}) – By default, truncate (overwrite) output files, if they exist. If ‘append’, add data as a new column to existing files.
file_type ({"single", "distribute"}) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale.
See also
- to_ndarray() numpy.ndarray[source]¶
Convert the array to a np.ndarray.
Convert the array to a np.ndarray, transferring array data from the arkouda server to Python. This conversion discards category information and produces an ndarray of strings. If the arrays exceeds a built-in size limit, a RuntimeError is raised.
- Returns:
A numpy ndarray of strings corresponding to the values in this array
- Return type:
np.ndarray
Notes
The number of bytes in the array cannot exceed
ak.core.client.maxTransferBytes, otherwise aRuntimeErrorwill be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting ak.core.client.maxTransferBytes to a larger value, but proceed with caution.
- to_parquet(prefix_path: str, dataset: str = 'categorical_array', mode: str = 'truncate', compression: str | None = None) str[source]¶
[Not Yet Implemented] Save the Categorical to a Parquet dataset.
!!! This method is currently not supported and will raise a RuntimeError. !!! Parquet support for Categorical is under development.
When implemented, this method will write the Categorical to a set of Parquet files, one file per locale on the Arkouda server. Each file will be named using the prefix_path with locale-specific suffixes.
- Parameters:
prefix_path (str) – The directory and filename prefix shared by all output files.
dataset (str, default="categorical_array") – The dataset name to use to create the Parquet files.
mode ({'truncate', 'append'}, default='truncate') – Specifies write behavior. Use ‘truncate’ to overwrite existing files or ‘append’ to add to them. (Appending is not yet efficient.)
compression (str, optional) – Compression algorithm to use when writing the file. Supported values include: ‘snappy’, ‘gzip’, ‘brotli’, ‘zstd’, ‘lz4’. Default is None (no compression).
- Returns:
A message indicating the result of the operation.
- Return type:
- Raises:
RuntimeError – Always raised. Parquet export for Categorical is not yet supported.
Notes
The specified prefix_path must be writable and accessible to the Arkouda server.
The user must have write permission.
Output files will be named as
<prefix_path>_LOCALE<i>for each locale i.Appending mode requires that the existing files already match the server’s locale layout.
Appending mode is supported, but is not efficient.
File extensions are not used to determine file type.
If any of the output files already exist and
the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, a
RuntimeErrorwill result.See also
to_hdfSave the Categorical to HDF5 format (currently supported).
- to_strings() arkouda.numpy.strings.Strings[source]¶
Convert the Categorical to Strings.
- Returns:
A Strings object corresponding to the values in this Categorical.
- Return type:
Examples
>>> import arkouda as ak >>> a = ak.array(["a","b","c"]) >>> a array(['a', 'b', 'c']) >>> c = ak.Categorical(a) >>> c.to_strings() array(['a', 'b', 'c'])
>>> isinstance(c.to_strings(), ak.Strings) True
- tolist() List[str][source]¶
Convert the Categorical to a list.
Convert the Categorical to a list, transferring data from the arkouda server to Python. This conversion discards category information and produces a list of strings. If the arrays exceeds a built-in size limit, a RuntimeError is raised.
- Returns:
A list of strings corresponding to the values in this Categorical
- Return type:
List[str]
Notes
The number of bytes in the Categorical cannot exceed
ak.core.client.maxTransferBytes, otherwise aRuntimeErrorwill be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting ak.core.client.maxTransferBytes to a larger value, but proceed with caution.
- transfer(hostname: str, port: arkouda.numpy.dtypes.int_scalars)[source]¶
Send a Categorical object to a different Arkouda server.
- Parameters:
hostname (str) – The hostname where the Arkouda server intended to receive the Categorical is running.
port (int_scalars) – The port to send the array over. This needs to be an open port (i.e., not one that the Arkouda server is running on). This will open up numLocales ports, each of which in succession, so will use ports of the range {port..(port+numLocales)} (e.g., running an Arkouda server of 4 nodes, port 1234 is passed as port, Arkouda will use ports 1234, 1235, 1236, and 1237 to send the array data). This port much match the port passed to the call to ak.receive_array().
- Return type:
A message indicating a complete transfer
- Raises:
ValueError – Raised if the op is not within the pdarray.BinOps set
TypeError – Raised if other is not a pdarray or the pdarray.dtype is not a supported dtype
- unique() Categorical[source]¶
Return the unique category values in the Categorical.
- Returns:
A new Categorical containing only the unique category labels in use.
- Return type:
- unregister() None[source]¶
Unregister this Categorical object.
Unregister this Categorical object in the arkouda server which was previously registered using register() and/or attached to using attach()
- Raises:
RegistrationError – If the object is already unregistered or if there is a server error when attempting to unregister
See also
register,attach,unregister_categorical_by_name,is_registeredNotes
Objects registered with the server are immune to deletion until they are unregistered.
- update_hdf(prefix_path, dataset='categorical_array', repack=True)[source]¶
Overwrite the dataset with the name provided with this Categorical object.
If the dataset does not exist it is added.
- Parameters:
prefix_path (str) – Directory and filename prefix that all output files share
dataset (str) – Name of the dataset to create in files
repack (bool) – Default: True HDF5 does not release memory on delete. When True, the inaccessible data (that was overwritten) is removed. When False, the data remains, but is inaccessible. Setting to false will yield better performance, but will cause file sizes to expand.
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the Categorical
Notes
If file does not contain File_Format attribute to indicate how it was saved, the file name is checked for _LOCALE#### to determine if it is distributed.
If the dataset provided does not exist, it will be added
Because HDF5 deletes do not release memory, the repack option allows for automatic creation of a file without the inaccessible data.
- class arkouda.DataFrame(initialdata=None, index=None, columns=None)[source]¶
Bases:
collections.UserDictA DataFrame structure based on Arkouda arrays.
- Parameters:
initialdata (list or dict of lists, tuples, or pdarrays) – Each list or dictionary entry corresponds to one column of data and should be a homogeneous type. Different columns may have different types. If using a dictionary, keys should be strings.
index (Index, pdarray, or Strings) – Index for the resulting frame. Defaults to an integer range.
columns (list, tuple, pdarray, or Strings) – Column labels to use if the data does not include them. Elements must be strings. Defaults to a stringified integer range.
Examples
Create an empty
DataFrameand add a column of data:>>> import arkouda as ak >>> df = ak.DataFrame() >>> df["a"] = ak.array([1, 2, 3]) >>> df a 0 1 1 2 2 3 (3 rows x 1 columns)
Create a new
DataFrameusing a dictionary of data:>>> userName = ak.array(["Alice", "Bob", "Alice", "Carol", "Bob", "Alice"]) >>> userID = ak.array([111, 222, 111, 333, 222, 111]) >>> item = ak.array([0, 0, 1, 1, 2, 0]) >>> day = ak.array([5, 5, 6, 5, 6, 6]) >>> amount = ak.array([0.5, 0.6, 1.1, 1.2, 4.3, 0.6]) >>> df = ak.DataFrame( ... { ... "userName": userName, ... "userID": userID, ... "item": item, ... "day": day, ... "amount": amount, ... } ... ) >>> df userName userID item day amount 0 Alice 111 0 5 0.5 1 Bob 222 0 5 0.6 2 Alice 111 1 6 1.1 3 Carol 333 1 5 1.2 4 Bob 222 2 6 4.3 5 Alice 111 0 6 0.6 (6 rows x 5 columns)
Indexing works slightly differently than with pandas:
>>> df[0] {'userName': np.str_('Alice'), 'userID': np.int64(111), 'item': np.int64(0), 'day': np.int64(5), 'amount': np.float64(0.5)} >>> df["userID"] array([111 222 111 333 222 111]) >>> df["userName"] array(['Alice', 'Bob', 'Alice', 'Carol', 'Bob', 'Alice'])
Select rows by integer array:
>>> df[ak.array([1, 3, 5])] userName userID item day amount 1 Bob 222 0 5 0.6 3 Carol 333 1 5 1.2 5 Alice 111 0 6 0.6 (3 rows x 5 columns)
Slice rows:
>>> df[1:5:1] userName userID item day amount 1 Bob 222 0 5 0.6 2 Alice 111 1 6 1.1 3 Carol 333 1 5 1.2 4 Bob 222 2 6 4.3 (4 rows x 5 columns)
>>> df[ak.array([1, 2, 3])] userName userID item day amount 1 Bob 222 0 5 0.6 2 Alice 111 1 6 1.1 3 Carol 333 1 5 1.2 (3 rows x 5 columns)
Select columns by name:
>>> df[["userID", "day"]] userID day 0 111 5 1 222 5 2 111 6 3 333 5 4 222 6 5 111 6 (6 rows x 2 columns)
- all(axis=0) arkouda.pandas.series.Series | bool[source]¶
Return whether all elements are True, potentially over an axis.
Returns True unless there at least one element along a Dataframe axis that is False.
Currently, will ignore any columns that are not type bool. This is equivalent to the pandas option bool_only=True.
- Parameters:
axis ({0 or ‘index’, 1 or ‘columns’, None}, default = 0) –
Indicate which axis or axes should be reduced.
0 / ‘index’ : reduce the index, return a Series whose index is the original column labels.
1 / ‘columns’ : reduce the columns, return a Series whose index is the original index.
None : reduce all axes, return a scalar.
- Return type:
- Raises:
ValueError – Raised if axis does not have a value in {0 or ‘index’, 1 or ‘columns’, None}.
Examples
>>> import arkouda as ak >>> df = ak.DataFrame({"A":[True,True,True,False],"B":[True,True,True,False], ... "C":[True,False,True,False],"D":[True,True,True,True]}) >>> df A B C D 0 True True True True 1 True True False True 2 True True True True 3 False False False True (4 rows x 4 columns)
>>> df.all(axis=0) A False B False C False D True dtype: bool >>> df.all(axis=1) 0 True 1 False 2 True 3 False dtype: bool >>> df.all(axis=None) False
- any(axis=0) arkouda.pandas.series.Series | bool[source]¶
Return whether any element is True, potentially over an axis.
Returns False unless there is at least one element along a Dataframe axis that is True.
Currently, will ignore any columns that are not type bool. This is equivalent to the pandas option bool_only=True.
- Parameters:
axis ({0 or ‘index’, 1 or ‘columns’, None}, default = 0) –
Indicate which axis or axes should be reduced.
0 / ‘index’ : reduce the index, return a Series whose index is the original column labels.
1 / ‘columns’ : reduce the columns, return a Series whose index is the original index.
None : reduce all axes, return a scalar.
- Return type:
- Raises:
ValueError – Raised if axis does not have a value in {0 or ‘index’, 1 or ‘columns’, None}.
Examples
>>> import arkouda as ak >>> df = ak.DataFrame({"A":[True,True,True,False],"B":[True,True,True,False], ... "C":[True,False,True,False],"D":[False,False,False,False]}) >>> df A B C D 0 True True True False 1 True True False False 2 True True True False 3 False False False False (4 rows x 4 columns)
>>> df.any(axis=0) A True B True C True D False dtype: bool >>> df.any(axis=1) 0 True 1 True 2 True 3 False dtype: bool >>> df.any(axis=None) True
- append(other, ordered=True)[source]¶
Concatenate data from ‘other’ onto the end of this DataFrame, in place.
Explicitly, use the arkouda concatenate function to append the data from each column in other to the end of self. This operation is done in place, in the sense that the underlying pdarrays are updated from the result of the arkouda concatenate function, rather than returning a new DataFrame object containing the result.
- Parameters:
- Returns:
Appending occurs in-place, but result is returned for compatibility.
- Return type:
self
Examples
>>> import arkouda as ak
>>> df1 = ak.DataFrame({'col1': [1, 2], 'col2': [3, 4]}) >>> df1 col1 col2 0 1 3 1 2 4 (2 rows x 2 columns)
>>> df2 = ak.DataFrame({'col1': [3], 'col2': [5]}) >>> df2 col1 col2 0 3 5 (1 rows x 2 columns)
>>> df1.append(df2) col1 col2 0 1 3 1 2 4 2 3 5 (3 rows x 2 columns)
>>> df1 col1 col2 0 1 3 1 2 4 2 3 5 (3 rows x 2 columns)
- apply_permutation(perm)[source]¶
Apply a permutation to an entire DataFrame.
The operation is done in place and the original DataFrame will be modified.
This may be useful if you want to unsort an DataFrame, or even to apply an arbitrary permutation such as the inverse of a sorting permutation.
- Parameters:
perm (pdarray) – A permutation array. Should be the same size as the data arrays, and should consist of the integers [0,size-1] in some order. Very minimal testing is done to ensure this is a permutation.
See also
sortExamples
>>> import arkouda as ak >>> df = ak.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]}) >>> df col1 col2 0 1 4 1 2 5 2 3 6 (3 rows x 2 columns)
>>> perm_arry = ak.array([0, 2, 1]) >>> df.apply_permutation(perm_arry) >>> df col1 col2 0 1 4 2 3 6 1 2 5 (3 rows x 2 columns)
- argsort(key, ascending=True)[source]¶
Return the permutation that sorts the dataframe by key.
- Parameters:
- Returns:
The permutation array that sorts the data on key.
- Return type:
See also
Examples
>>> import arkouda as ak >>> df = ak.DataFrame({'col1': [1.1, 3.1, 2.1], 'col2': [6, 5, 4]}) >>> df col1 col2 0 1.1 6 1 3.1 5 2 2.1 4 (3 rows x 2 columns)
>>> df.argsort('col1') array([0 2 1]) >>> sorted_df1 = df[df.argsort('col1')] >>> sorted_df1 col1 col2 0 1.1 6 2 2.1 4 1 3.1 5 (3 rows x 2 columns)
>>> df.argsort('col2') array([2 1 0]) >>> sorted_df2 = df[df.argsort('col2')] >>> sorted_df2 col1 col2 2 2.1 4 1 3.1 5 0 1.1 6 (3 rows x 2 columns)
- assign(**kwargs) DataFrame[source]¶
Assign new columns to a DataFrame.
Return a new object with all original columns in addition to the newly assigned ones. Existing columns that are reassigned will be overwritten.
- Parameters:
**kwargs (dict of {str: callable or Series}) –
The column names are the keyword arguments. If the values are callable, they are computed on the DataFrame and assigned to the new columns. The callable must not modify the input DataFrame.
If the values are not callable (for example a
Series, scalar, or array), they are directly assigned.- Returns:
A new
DataFramewith the new columns added alongside the existing columns.- Return type:
Notes
Assigning multiple columns within the same
assigncall is supported. Later items in**kwargsmay refer to newly created or modified columns indf. Items are computed and assigned todfin order.Examples
>>> import arkouda as ak >>> df = ak.DataFrame( ... {"temp_c": [17.0, 25.0]}, ... index=["Portland", "Berkeley"], ... ) >>> df temp_c Portland 17.0 Berkeley 25.0 (2 rows x 1 columns)
When the value is a callable, it is evaluated on
df:>>> df.assign(temp_f=lambda x: x.temp_c * 9 / 5 + 32) temp_c temp_f Portland 17.0 62.6 Berkeley 25.0 77.0 (2 rows x 2 columns)
The same behavior can be achieved by referencing an existing column directly:
>>> df.assign(temp_f=df["temp_c"] * 9 / 5 + 32) temp_c temp_f Portland 17.0 62.6 Berkeley 25.0 77.0 (2 rows x 2 columns)
Multiple columns can be created in a single call where one depends on another defined within the same
assign:>>> df.assign( ... temp_f=lambda x: x["temp_c"] * 9 / 5 + 32, ... temp_k=lambda x: (x["temp_f"] + 459.67) * 5 / 9, ... ) temp_c temp_f temp_k Portland 17.0 62.6 290.15 Berkeley 25.0 77.0 298.15 (2 rows x 3 columns)
- coargsort(keys, ascending=True)[source]¶
Return the permutation that sorts the dataframe by keys.
Note: Sorting using Strings may not yield correct sort order.
- Parameters:
keys (list of str) – The keys to sort on.
- Returns:
The permutation array that sorts the data on keys.
- Return type:
Example
>>> import arkouda as ak >>> df = ak.DataFrame({'col1': [2, 2, 1], 'col2': [3, 4, 3], 'col3':[5, 6, 7]}) >>> df col1 col2 col3 0 2 3 5 1 2 4 6 2 1 3 7 (3 rows x 3 columns)
>>> df.coargsort(['col1', 'col2']) array([2 0 1]) >>>
- property columns¶
An Index where the values are the column names of the dataframe.
- Returns:
The values of the index are the column names of the dataframe.
- Return type:
Examples
>>> import arkouda as ak >>> df = ak.DataFrame({'col1': [1, 2], 'col2': [3, 4]}) >>> df col1 col2 0 1 3 1 2 4 (2 rows x 2 columns)
>>> df.columns Index(['col1', 'col2'], dtype='<U0')
- copy(deep=True)[source]¶
Make a copy of this object’s data.
When deep = True (default), a new object will be created with a copy of the calling object’s data. Modifications to the data of the copy will not be reflected in the original object.
When deep = False a new object will be created without copying the calling object’s data. Any changes to the data of the original object will be reflected in the shallow copy, and vice versa.
- Parameters:
deep (bool, default=True) – When True, return a deep copy. Otherwise, return a shallow copy.
- Returns:
A deep or shallow copy according to caller specification.
- Return type:
Examples
>>> import arkouda as ak >>> df = ak.DataFrame({'col1': [1, 2], 'col2': [3, 4]}) >>> df col1 col2 0 1 3 1 2 4 (2 rows x 2 columns)
>>> df_deep = df.copy(deep=True) >>> df_deep['col1'] +=1 >>> df col1 col2 0 1 3 1 2 4 (2 rows x 2 columns)
>>> df_shallow = df.copy(deep=False) >>> df_shallow['col1'] +=1 >>> df col1 col2 0 2 3 1 3 4 (2 rows x 2 columns)
- corr() DataFrame[source]¶
Return new DataFrame with pairwise correlation of columns.
- Returns:
Arkouda DataFrame containing correlation matrix of all columns.
- Return type:
- Raises:
RuntimeError – Raised if there’s a server-side error thrown.
See also
pdarray.corrNotes
Generate the correlation matrix using Pearson R for all columns.
Attempts to convert to numeric values where possible for inclusion in the matrix.
Example
>>> import arkouda as ak >>> df = ak.DataFrame({'col1': [1, 2], 'col2': [-1, -2]}) >>> df col1 col2 0 1 -1 1 2 -2 (2 rows x 2 columns)
>>> corr = df.corr() >>> corr col1 col2 col1 1.0 -1.0 col2 -1.0 1.0 (2 rows x 2 columns)
- count(axis: int | str = 0, numeric_only=False) arkouda.pandas.series.Series[source]¶
Count non-NA cells for each column or row.
The values np.NaN are considered NA.
- Parameters:
axis ({0 or 'index', 1 or 'columns'}, default 0) – If 0 or ‘index’ counts are generated for each column. If 1 or ‘columns’ counts are generated for each row.
numeric_only (bool = False) – Include only float, int or boolean data.
- Returns:
For each column/row the number of non-NA/null entries.
- Return type:
- Raises:
ValueError – Raised if axis is not 0, 1, ‘index’, or ‘columns’.
See also
GroupBy.countExamples
>>> import arkouda as ak >>> import numpy as np >>> df = ak.DataFrame({'col_A': ak.array([7, np.nan]), 'col_B':ak.array([1, 9])}) >>> df col_A col_B 0 7.0 1 1 NaN 9 (2 rows x 2 columns)
>>> df.count() col_A 1 col_B 2 dtype: int64
>>> df = ak.DataFrame({'col_A': ak.array(["a","b","c"]), 'col_B':ak.array([1, np.nan, np.nan])}) >>> df col_A col_B 0 a 1.0 1 b NaN 2 c NaN (3 rows x 2 columns)
>>> df.count() col_A 3 col_B 1 dtype: int64
>>> df.count(numeric_only=True) col_B 1 dtype: int64
>>> df.count(axis=1) 0 2 1 1 2 1 dtype: int64
- drop(keys: str | int | List[str | int], axis: str | int = 0, inplace: bool = False) None | DataFrame[source]¶
Drop rows or columns from the DataFrame.
- Parameters:
keys (Union[str, int, List[Union[str, int]]]) – Label or list of labels to drop along the specified axis.
axis (Union[str, int], default=0) –
Axis along which to drop.
0or"index"— drop rows1or"columns"— drop columns
inplace (bool, default=False) – If
True, perform the operation on the calling object. IfFalse, return a new object.
- Returns:
DataFramewheninplace=False; otherwiseNone.- Return type:
Examples
>>> import arkouda as ak >>> df = ak.DataFrame({"col1": [1, 2], "col2": [3, 4]}) >>> df col1 col2 0 1 3 1 2 4 (2 rows x 2 columns)
Drop a column:
>>> df.drop("col1", axis=1) col2 0 3 1 4 (2 rows x 1 columns)
Drop a row:
>>> df.drop(0, axis=0) col1 col2 1 2 4 (1 rows x 2 columns)
- drop_duplicates(subset=None, keep='first')[source]¶
Drop duplcated rows and returns resulting DataFrame.
If a subset of the columns are provided then only one instance of each duplicated row will be returned (keep determines which row).
- Parameters:
subset (Iterable) – Iterable of column names to use to dedupe.
keep ({'first', 'last'}, default='first') – Determines which duplicates (if any) to keep.
- Returns:
DataFrame with duplicates removed.
- Return type:
Example
>>> import arkouda as ak >>> df = ak.DataFrame({'col1': [1, 2, 2, 3], 'col2': [4, 5, 5, 6]}) >>> df col1 col2 0 1 4 1 2 5 2 2 5 3 3 6 (4 rows x 2 columns)
>>> df.drop_duplicates() col1 col2 0 1 4 1 2 5 3 3 6 (3 rows x 2 columns)
- dropna(axis: int | str = 0, how: str | None = None, thresh: int | None = None, ignore_index: bool = False) DataFrame[source]¶
Remove missing values.
- Parameters:
axis ({0 or 'index', 1 or 'columns'}, default = 0) –
Determine if rows or columns which contain missing values are removed.
0, or ‘index’: Drop rows which contain missing values.
1, or ‘columns’: Drop columns which contain missing value.
Only a single axis is allowed.
how ({'any', 'all'}, default='any') –
Determine if row or column is removed from DataFrame, when we have at least one NA or all NA.
’any’: If any NA values are present, drop that row or column.
’all’: If all values are NA, drop that row or column.
thresh (int, optional) – Require that many non - NA values.Cannot be combined with how.
ignore_index (bool, default
False) – IfTrue, the resulting axis will be labeled 0, 1, …, n - 1.
- Returns:
DataFrame with NA entries dropped from it.
- Return type:
Examples
>>> import arkouda as ak >>> import numpy as np >>> df = ak.DataFrame( ... { ... "A": [True, True, True, True], ... "B": [1, np.nan, 2, np.nan], ... "C": [1, 2, 3, np.nan], ... "D": [False, False, False, False], ... "E": [1, 2, 3, 4], ... "F": ["a", "b", "c", "d"], ... "G": [1, 2, 3, 4], ... } ... )
>>> df A B C D E F G 0 True 1.0 1.0 False 1 a 1 1 True NaN 2.0 False 2 b 2 2 True 2.0 3.0 False 3 c 3 3 True NaN NaN False 4 d 4 (4 rows x 7 columns)
>>> df.dropna() A B C D E F G 0 True 1.0 1.0 False 1 a 1 2 True 2.0 3.0 False 3 c 3 (2 rows x 7 columns)
>>> df.dropna(axis=1) A D E F G 0 True False 1 a 1 1 True False 2 b 2 2 True False 3 c 3 3 True False 4 d 4 (4 rows x 5 columns)
>>> df.dropna(axis=1, thresh=3) A C D E F G 0 True 1.0 False 1 a 1 1 True 2.0 False 2 b 2 2 True 3.0 False 3 c 3 3 True NaN False 4 d 4 (4 rows x 6 columns)
>>> df.dropna(axis=1, how="all") A B C D E F G 0 True 1.0 1.0 False 1 a 1 1 True NaN 2.0 False 2 b 2 2 True 2.0 3.0 False 3 c 3 3 True NaN NaN False 4 d 4 (4 rows x 7 columns)
- property dtypes¶
The dtypes of the dataframe.
- Returns:
dtypes – The dtypes of the dataframe.
- Return type:
Examples
>>> import arkouda as ak >>> df = ak.DataFrame({'col1': [1, 2], 'col2': ["a", "b"]}) >>> df col1 col2 0 1 a 1 2 b (2 rows x 2 columns)
>>> df.dtypes {'col1': 'int64', 'col2': 'str'}
- property empty¶
Whether the dataframe is empty.
- Returns:
True if the dataframe is empty, otherwise False.
- Return type:
Examples
>>> import arkouda as ak >>> df = ak.DataFrame({}) >>> df Empty DataFrame Columns: [] Index: [] (None rows x 0 columns)
>>> df.empty True
- filter_by_range(keys, low=1, high=None)[source]¶
Filter rows by the size of groups defined on one or more columns.
Group the DataFrame by the specified keys, compute the count of each group, and return a boolean mask indicating which rows belong to groups whose sizes fall within the inclusive range [low, high].
- Parameters:
- Returns:
A boolean mask array of length equal to the number of rows in the DataFrame, where True indicates the row’s group size is between low and high.
- Return type:
- Raises:
ValueError – If low is negative, or if high is not None and high < low.
TypeError – If keys is not a string or list of strings.
Examples
>>> import arkouda as ak >>> df = ak.DataFrame({'col1': [1, 2, 2, 2, 3, 3], 'col2': [4, 5, 6, 7, 8, 9]}) >>> df col1 col2 0 1 4 1 2 5 2 2 6 3 2 7 4 3 8 5 3 9 (6 rows x 2 columns)
>>> df.filter_by_range("col1", low=1, high=2) array([True False False False True True])
>>> filtered_df = df[df.filter_by_range("col1", low=1, high=2)] >>> filtered_df col1 col2 0 1 4 4 3 8 5 3 9 (3 rows x 2 columns)
- classmethod from_pandas(pd_df)[source]¶
Copy the data from a pandas DataFrame into a new arkouda.pandas.dataframe.DataFrame.
- Parameters:
pd_df (pandas.DataFrame) – A pandas DataFrame to convert.
- Return type:
Examples
>>> import arkouda as ak >>> import pandas as pd >>> pd_df = pd.DataFrame({"A":[1,2],"B":[3,4]}) >>> type(pd_df) <class 'pandas....DataFrame'> >>> pd_df A B 0 1 3 1 2 4
>>> ak_df = DataFrame.from_pandas(pd_df) >>> type(ak_df) <class 'arkouda....DataFrame'> >>> ak_df A B 0 1 3 1 2 4 (2 rows x 2 columns)
- classmethod from_return_msg(rep_msg)[source]¶
Create a DataFrame object from an arkouda server response message.
- Parameters:
rep_msg (string) – Server response message used to create a DataFrame.
- Return type:
- groupby(keys, use_series=True, as_index=True, dropna=True)[source]¶
Group the dataframe by a column or a list of columns.
Alias for GroupBy.
- Parameters:
keys (str or list of str) – An (ordered) list of column names or a single string to group by.
use_series (bool, default=True) – If True, returns an arkouda.pandas.dataframe.DataFrameGroupBy object. Otherwise an arkouda.pandas.groupbyclass.GroupBy object.
as_index (bool, default=True) – If True, groupby columns will be set as index otherwise, the groupby columns will be treated as DataFrame columns.
dropna (bool, default=True) – If True, and the groupby keys contain NaN values, the NaN values together with the corresponding row will be dropped. Otherwise, the rows corresponding to NaN values will be kept.
- Returns:
If use_series = True, returns an arkouda.pandas.dataframe.DataFrameGroupBy object. Otherwise returns an arkouda.pandas.groupbyclass.GroupBy object.
- Return type:
arkouda.pandas.dataframe.DataFrameGroupBy or arkouda.pandas.groupbyclass.GroupBy
See also
arkouda.GroupByExamples
>>> import arkouda as ak >>> df = ak.DataFrame({'col1': [1.0, 1.0, 2.0, np.nan], 'col2': [4, 5, 6, 7]}) >>> df col1 col2 0 1.0 4 1 1.0 5 2 2.0 6 3 NaN 7 (4 rows x 2 columns)
>>> df.groupby("col1") <arkouda.pandas.groupbyclass.GroupBy object at 0x795584773f00> >>> df.groupby("col1").size() col1 1.0 2 2.0 1 dtype: int64
>>> df.groupby("col1",use_series=True).size() col1 1.0 2 2.0 1 dtype: int64 >>> df.groupby("col1",use_series=True, as_index = False).size() col1 size 0 1.0 2 1 2.0 1 (2 rows x 2 columns)
- head(n=5)[source]¶
Return the first n rows.
This function returns the first n rows of the the dataframe. It is useful for quickly verifying data, for example, after sorting or appending rows.
- Parameters:
n (int, default = 5) – Number of rows to select.
- Returns:
The first n rows of the DataFrame.
- Return type:
See also
Examples
>>> import arkouda as ak >>> df = ak.DataFrame({'col1': ak.arange(10), 'col2': -1 * ak.arange(10)}) >>> df col1 col2 0 0 0 1 1 -1 2 2 -2 3 3 -3 4 4 -4 5 5 -5 6 6 -6 7 7 -7 8 8 -8 9 9 -9 (10 rows x 2 columns)
>>> df.head() col1 col2 0 0 0 1 1 -1 2 2 -2 3 3 -3 4 4 -4 (5 rows x 2 columns)
>>> df.head(n=2) col1 col2 0 0 0 1 1 -1 (2 rows x 2 columns)
- property index¶
The index of the dataframe.
- Returns:
The index of the dataframe.
- Return type:
arkouda.pandas.index.Index or arkouda.pandas.index.MultiIndex
Examples
>>> import arkouda as ak >>> df = ak.DataFrame({'col1': [1, 2], 'col2': [3, 4]}) >>> df col1 col2 0 1 3 1 2 4 (2 rows x 2 columns)
>>> df.index Index(array([0 1]), dtype='int64')
- property info¶
Return a summary string of this dataframe.
- Returns:
A summary string of this dataframe.
- Return type:
Examples
>>> import arkouda as ak >>> df = ak.DataFrame({'col1': [1, 2], 'col2': ["a", "b"]}) >>> df col1 col2 0 1 a 1 2 b (2 rows x 2 columns)
>>> df.info "DataFrame(['col1', 'col2'], 2 rows, 36.00 B)"
- is_registered() bool[source]¶
Return True if the object is contained in the registry.
- Returns:
Indicates if the object is contained in the registry.
- Return type:
- Raises:
RegistrationError – Raised if there’s a server-side error or a mismatch of registered components.
See also
register,unregister,unregister_dataframe_by_nameNotes
Objects registered with the server are immune to deletion until they are unregistered.
Example
>>> import arkouda as ak >>> df = ak.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]}) >>> df.register("my_table_name") col1 col2 0 1 4 1 2 5 2 3 6 (3 rows x 2 columns)
>>> df.is_registered() True >>> df.unregister() >>> df.is_registered() False
- isin(values: arkouda.numpy.pdarrayclass.pdarray | Dict | arkouda.pandas.series.Series | DataFrame) DataFrame[source]¶
Determine whether each element in the DataFrame is contained in
values.- Parameters:
values (Union[pdarray, Dict, Series, DataFrame]) – The values to check for in the DataFrame. A
Seriesmust have a single index.- Returns:
Arkouda
DataFrameof booleans indicating whether each element in the DataFrame is contained invalues.- Return type:
See also
ak.Series.isinNotes
Pandas supports
valuesbeing any iterable type. In Arkouda, this is replaced withpdarray.Pandas supports
~operations. Currently,ak.DataFramedoes not.
Examples
>>> import arkouda as ak >>> df = ak.DataFrame({"col_A": ak.array([7, 3]), "col_B": ak.array([1, 9])}) >>> df col_A col_B 0 7 1 1 3 9 (2 rows x 2 columns)
When
valuesis apdarray, every value in the DataFrame is checked to determine whether it exists invalues.>>> df.isin(ak.array([0, 1])) col_A col_B 0 False True 1 False False (2 rows x 2 columns)
When
valuesis adict, the dictionary values are used to check the column indicated by each key.>>> df.isin({"col_A": ak.array([0, 3])}) col_A col_B 0 False False 1 True False (2 rows x 2 columns)
When
valuesis aSeries, each column is checked positionally. ForTrueto be returned, the indexes must match.>>> i = ak.Index(ak.arange(2)) >>> s = ak.Series(data=[3, 9], index=i) >>> df.isin(s) col_A col_B 0 False False 1 False True (2 rows x 2 columns)
When
valuesis aDataFrame, the index and columns must match. Note that9is not found because the column name differs.>>> other_df = ak.DataFrame({"col_A": ak.array([7, 3]), "col_C": ak.array([0, 9])}) >>> df.isin(other_df) col_A col_B 0 True False 1 True False (2 rows x 2 columns)
- isna() DataFrame[source]¶
Detect missing values.
Return a boolean object of the same size indicating whether each value is missing.
numpy.nanvalues are mapped toTrue. All other values are mapped toFalse.- Returns:
Boolean mask for each element in the
DataFrameindicating whether the value is NA.- Return type:
Examples
>>> import arkouda as ak >>> import numpy as np >>> df = ak.DataFrame( ... { ... "A": [np.nan, 2, 2, 3], ... "B": [3, np.nan, 5, 6], ... "C": [1, np.nan, 2, np.nan], ... "D": ["a", "b", "c", "d"], ... } ... ) >>> df A B C D 0 NaN 3.0 1.0 a 1 2.0 NaN NaN b 2 2.0 5.0 2.0 c 3 3.0 6.0 NaN d (4 rows x 4 columns)
>>> df.isna() A B C D 0 True False False False 1 False True True False 2 False False False False 3 False False True False (4 rows x 4 columns)
- classmethod load(prefix_path, file_format='INFER')[source]¶
Load a DataFrame from a file.
The
file_formatparameter is included for consistency with other Arkoudaloadfunctions.- Parameters:
- Returns:
A
DataFrameloaded fromprefix_path.- Return type:
Examples
>>> import arkouda as ak >>> import os.path >>> from pathlib import Path
To store data in
<my_dir>/my_data_LOCALE0000, use"<my_dir>/my_data"as the prefix.>>> my_path = os.path.join(os.getcwd(), "hdf5_output", "my_data") >>> Path(my_path).mkdir(parents=True, exist_ok=True)
>>> df = ak.DataFrame({"A": ak.arange(5), "B": -1 * ak.arange(5)}) >>> df.to_parquet(my_path + "/my_data")
>>> df.load(my_path + "/my_data") B A 0 0 0 1 -1 1 2 -2 2 3 -3 3 4 -4 4 (5 rows x 2 columns)
- memory_usage(index=True, unit='B') arkouda.pandas.series.Series[source]¶
Return the memory usage of each column.
The memory usage can optionally include the contribution of the index.
- Parameters:
- Returns:
A
Serieswhose index contains the original column names and whose values represent the memory usage of each column in the specified unit.- Return type:
See also
arkouda.numpy.pdarrayclass.nbytes,arkouda.pandas.index.Index.memory_usage,arkouda.pandas.index.MultiIndex.memory_usage,arkouda.pandas.series.Series.memory_usageExamples
>>> import arkouda as ak >>> dtypes = {"int64": ak.int64, "float64": ak.float64, "bool": ak.bool_} >>> data = {t: ak.ones(5000, dtype=dtypes[t]) for t in dtypes} >>> df = ak.DataFrame(data) >>> df.head() int64 float64 bool 0 1 1.0 True 1 1 1.0 True 2 1 1.0 True 3 1 1.0 True 4 1 1.0 True (5 rows x 3 columns)
>>> df.memory_usage() Index 40000 int64 40000 float64 40000 bool 5000 dtype: int64
>>> df.memory_usage(index=False) int64 40000 float64 40000 bool 5000 dtype: int64
>>> df.memory_usage(unit="KB") Index 39.062500 int64 39.062500 float64 39.062500 bool 4.882812 dtype: float64
To get the approximate total memory usage:
>>> df.memory_usage(index=True).sum() np.int64(125000)
- memory_usage_info(unit='GB')[source]¶
Return a formatted string representation of the size of this DataFrame.
- Parameters:
unit (str, default = "GB") – Unit to return. One of {‘KB’, ‘MB’, ‘GB’}.
- Returns:
A string representation of the number of bytes used by this DataFrame in [unit]s.
- Return type:
Examples
>>> import arkouda as ak >>> df = ak.DataFrame({'col1': ak.arange(1000), 'col2': ak.arange(1000)}) >>> df.memory_usage_info() '0.00 GB'
>>> df.memory_usage_info(unit="KB") '23.44 KB'
- merge(right: DataFrame, on: str | List[str] | None = None, how: str = 'inner', left_suffix: str = '_x', right_suffix: str = '_y', convert_ints: bool = True, sort: bool = True) DataFrame[source]¶
Merge Arkouda DataFrames with a database-style join.
The resulting dataframe contains rows from both DataFrames as specified by the merge condition (based on the “how” and “on” parameters).
Based on pandas merge functionality. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html
- Parameters:
right (DataFrame) – The Right DataFrame to be joined.
on (Optional[Union[str, List[str]]] = None) – The name or list of names of the DataFrame column(s) to join on. If on is None, this defaults to the intersection of the columns in both DataFrames.
how ({"inner", "left", "right}, default = "inner") – The merge condition. Must be “inner”, “left”, or “right”.
left_suffix (str, default = "_x") – A string indicating the suffix to add to columns from the left dataframe for overlapping column names in both left and right. Defaults to “_x”. Only used when how is “inner”.
right_suffix (str, default = "_y") – A string indicating the suffix to add to columns from the right dataframe for overlapping column names in both left and right. Defaults to “_y”. Only used when how is “inner”.
convert_ints (bool = True) – If True, convert columns with missing int values (due to the join) to float64. This is to match pandas. If False, do not convert the column dtypes. This has no effect when how = “inner”.
sort (bool = True) – If True, DataFrame is returned sorted by “on”. Otherwise, the DataFrame is not sorted.
- Returns:
Joined Arkouda DataFrame.
- Return type:
Note
Multiple column joins are only supported for integer columns.
Examples
>>> import arkouda as ak >>> left_df = ak.DataFrame({'col1': ak.arange(5), 'col2': -1 * ak.arange(5)}) >>> left_df col1 col2 0 0 0 1 1 -1 2 2 -2 3 3 -3 4 4 -4 (5 rows x 2 columns)
>>> right_df = ak.DataFrame({'col1': 2 * ak.arange(5), 'col2': 2 * ak.arange(5)}) >>> right_df col1 col2 0 0 0 1 2 2 2 4 4 3 6 6 4 8 8 (5 rows x 2 columns)
>>> left_df.merge(right_df, on = "col1") col1 col2_x col2_y 0 0 0 0 1 2 -2 2 2 4 -4 4 (3 rows x 3 columns)
>>> left_df.merge(right_df, on = "col1", how = "left") col1 col2_x col2_y 0 0 0 0.0 1 1 -1 NaN 2 2 -2 2.0 3 3 -3 NaN 4 4 -4 4.0 (5 rows x 3 columns)
>>> left_df.merge(right_df, on = "col1", how = "right") col1 col2_x col2_y 0 0 0.0 0 1 2 -2.0 2 2 4 -4.0 4 3 6 NaN 6 4 8 NaN 8 (5 rows x 3 columns)
>>> left_df.merge(right_df, on = "col1", how = "outer") col1 col2_x col2_y 0 0 0.0 0.0 1 1 -1.0 NaN 2 2 -2.0 2.0 3 3 -3.0 NaN 4 4 -4.0 4.0 5 6 NaN 6.0 6 8 NaN 8.0 (7 rows x 3 columns)
- notna() DataFrame[source]¶
Detect existing (non-missing) values.
Return a boolean same-sized object indicating if the values are not NA. numpy.NaN values get mapped to False values.
- Returns:
Mask of bool values for each element in DataFrame that indicates whether an element is not an NA value.
- Return type:
Examples
>>> import arkouda as ak >>> import numpy as np >>> df = ak.DataFrame({"A": [np.nan, 2, 2, 3], "B": [3, np.nan, 5, 6], ... "C": [1, np.nan, 2, np.nan], "D":["a","b","c","d"]}) >>> df A B C D 0 NaN 3.0 1.0 a 1 2.0 NaN NaN b 2 2.0 5.0 2.0 c 3 3.0 6.0 NaN d (4 rows x 4 columns)
>>> df.notna() A B C D 0 False True True True 1 True False False True 2 True True True True 3 True True False True (4 rows x 4 columns)
- objType = 'DataFrame'¶
- classmethod read_csv(filename: str, col_delim: str = ',')[source]¶
Read the columns of a CSV file into an Arkouda DataFrame.
If the file contains the appropriately formatted header, typed data will be returned. Otherwise, all data will be returned as a Strings objects.
- Parameters:
- Returns:
Arkouda DataFrame containing the columns from the CSV file.
- Return type:
- Raises:
ValueError – Raised if all datasets are not present in all parquet files or if one or more of the specified files do not exist.
RuntimeError – Raised if one or more of the specified files cannot be opened. If allow_errors is true this may be raised if no values are returned from the server.
TypeError – Raised if we receive an unknown arkouda_type returned from the server.
See also
Notes
CSV format is not currently supported by load/load_all operations.
The column delimiter is expected to be the same for column names and data.
Be sure that column delimiters are not found within your data.
All CSV files must delimit rows using newline (”\n”) at this time.
Unlike other file formats, CSV files store Strings as their UTF-8 format instead of storing
bytes as uint(8).
Examples
>>> import arkouda as ak >>> import os.path >>> from pathlib import Path >>> my_path = os.path.join(os.getcwd(), 'csv_output','my_data') >>> Path(my_path).mkdir(parents=True, exist_ok=True)
>>> df = ak.DataFrame({"A":[1,2],"B":[3,4]}) >>> df.to_csv(my_path) >>> df2 = ak.DataFrame.read_csv(my_path + "_LOCALE0000") >>> df2 A B 0 1 3 1 2 4 (2 rows x 2 columns)
- register(user_defined_name: str) DataFrame[source]¶
Register this DataFrame object and underlying components with the Arkouda server.
- Parameters:
user_defined_name (str) – User defined name the DataFrame is to be registered under. This will be the root name for underlying components.
- Returns:
The same DataFrame which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different DataFrames with the same name.
- Return type:
- Raises:
TypeError – Raised if user_defined_name is not a str.
RegistrationError – If the server was unable to register the DataFrame with the user_defined_name.
See also
unregister,unregister_dataframe_by_name,is_registeredNotes
Objects registered with the server are immune to deletion until they are unregistered.
Any changes made to a DataFrame object after registering with the server may not be reflected in attached copies.
Example
>>> import arkouda as ak >>> df = ak.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]}) >>> df.register("my_table_name") col1 col2 0 1 4 1 2 5 2 3 6 (3 rows x 2 columns) >>> df.is_registered() True >>> df.unregister() >>> df.is_registered() False
- registered_name = None¶
- rename(mapper: Callable | Dict | None = None, index: Callable | Dict | None = None, column: Callable | Dict | None = None, axis: str | int = 0, inplace: bool = False) DataFrame | None[source]¶
Rename indexes or columns according to a mapping.
- Parameters:
mapper (Optional[Union[Callable, Dict]]) – Function or dictionary mapping existing values to new values. Nonexistent names will not raise an error. The value of
axisdetermines whether the mapping is applied to the index or columns.index (Optional[Union[Callable, Dict]]) – Function or dictionary mapping existing index names to new index names. Nonexistent names will not raise an error. When this is set,
axisis ignored.column (Optional[Union[Callable, Dict]]) – Function or dictionary mapping existing column names to new column names. Nonexistent names will not raise an error. When this is set,
axisis ignored.axis (Union[str, int], default=0) –
Axis to perform the rename operation on.
0or"index"— rename index values1or"column"— rename column names
inplace (bool, default=False) – If
True, perform the operation on the calling object. IfFalse, return a new object.
- Returns:
DataFramewheninplace=False; otherwiseNone.- Return type:
Examples
>>> import arkouda as ak >>> df = ak.DataFrame({"A": ak.array([1, 2, 3]), "B": ak.array([4, 5, 6])}) >>> df A B 0 1 4 1 2 5 2 3 6 (3 rows x 2 columns)
Rename columns using a mapping:
>>> df.rename(column={"A": "a", "B": "c"}) a c 0 1 4 1 2 5 2 3 6 (3 rows x 2 columns)
Rename indexes using a mapping:
>>> df.rename(index={0: 99, 2: 11}) A B 99 1 4 1 2 5 11 3 6 (3 rows x 2 columns)
Rename using the axis-style parameter:
>>> df.rename(str.lower, axis="column") a b 0 1 4 1 2 5 2 3 6 (3 rows x 2 columns)
- reset_index(size: int | None = None, inplace: bool = False) None | DataFrame[source]¶
Set the index to an integer range.
Useful if this dataframe is the result of a slice operation from another dataframe, or if you have permuted the rows and no longer need to keep that ordering on the rows.
- Parameters:
size (int, optional) – If size is passed, do not attempt to determine size based on existing column sizes. Assume caller handles consistency correctly.
inplace (bool, default=False) – When True, perform the operation on the calling object. When False, return a new object.
- Returns:
DateFrame when inplace=False; None when inplace=True.
- Return type:
DataFrame or None
Note
Pandas adds a column ‘index’ to indicate the original index. Arkouda does not currently support this behavior.
Example
>>> import arkouda as ak >>> df = ak.DataFrame({"A": ak.array([1, 2, 3]), "B": ak.array([4, 5, 6])}) >>> df A B 0 1 4 1 2 5 2 3 6 (3 rows x 2 columns)
>>> perm_df = df[ak.array([0,2,1])] >>> perm_df A B 0 1 4 2 3 6 1 2 5 (3 rows x 2 columns)
>>> perm_df.reset_index() A B 0 1 4 1 3 6 2 2 5 (3 rows x 2 columns)
- sample(n=5) DataFrame[source]¶
Return a random sample of
nrows.- Parameters:
n (int, default=5) – Number of rows to return.
- Returns:
A
DataFramecontainingnrandomly sampled rows.- Return type:
Examples
>>> import arkouda as ak >>> df = ak.DataFrame({"A": ak.arange(5), "B": -1 * ak.arange(5)}) >>> df A B 0 0 0 1 1 -1 2 2 -2 3 3 -3 4 4 -4 (5 rows x 2 columns)
Random output of size 3:
>>> df.sample(n=3) A B 4 4 -4 3 3 -3 1 1 -1 (3 rows x 2 columns)
- property shape¶
The shape of the dataframe.
- Returns:
Tuple of array dimensions.
- Return type:
tuple of int
Examples
>>> import arkouda as ak >>> df = ak.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]}) >>> df col1 col2 0 1 4 1 2 5 2 3 6 (3 rows x 2 columns)
>>> df.shape (3, 2)
- property size¶
Return the number of bytes on the arkouda server.
- Returns:
The number of bytes on the arkouda server.
- Return type:
int
Examples
>>> import arkouda as ak >>> df = ak.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]}) >>> df col1 col2 0 1 4 1 2 5 2 3 6 (3 rows x 2 columns)
>>> df.size 6
- sort_index(ascending=True)[source]¶
Sort the DataFrame by indexed columns.
- Note: Fails on sort order of arkouda.numpy.strings.Strings columns when
multiple columns being sorted.
- Parameters:
ascending (bool, default = True) – Sort values in ascending (default) or descending order.
Example
>>> import arkouda as ak >>> df = ak.DataFrame({'col1': [1.1, 3.1, 2.1], 'col2': [6, 5, 4]}, ... index = Index(ak.array([2,0,1]), name="idx"))
>>> df col1 col2 idx 2 1.1 6 0 3.1 5 1 2.1 4 (3 rows x 2 columns)
>>> df.sort_index() col1 col2 idx 0 3.1 5 1 2.1 4 2 1.1 6 (3 rows x 2 columns)
- sort_values(by=None, ascending=True)[source]¶
Sort the DataFrame by one or more columns.
If no column is specified, all columns are used.
Note: Fails on order of arkouda.numpy.strings.Strings columns when multiple columns being sorted.
- Parameters:
See also
Examples
>>> import arkouda as ak >>> df = ak.DataFrame({'col1': [2, 2, 1], 'col2': [3, 4, 3], 'col3':[5, 6, 7]}) >>> df col1 col2 col3 0 2 3 5 1 2 4 6 2 1 3 7 (3 rows x 3 columns)
>>> df.sort_values() col1 col2 col3 2 1 3 7 0 2 3 5 1 2 4 6 (3 rows x 3 columns)
>>> df.sort_values("col3") col1 col2 col3 0 2 3 5 1 2 4 6 2 1 3 7 (3 rows x 3 columns)
- tail(n=5)[source]¶
Return the last n rows.
This function returns the last n rows for the dataframe. It is useful for quickly testing if your object has the right type of data in it.
- Parameters:
n (int, default=5) – Number of rows to select.
- Returns:
The last n rows of the DataFrame.
- Return type:
See also
arkouda.pandas.dataframe.headExamples
>>> import arkouda as ak >>> df = ak.DataFrame({'col1': ak.arange(10), 'col2': -1 * ak.arange(10)}) >>> df col1 col2 0 0 0 1 1 -1 2 2 -2 3 3 -3 4 4 -4 5 5 -5 6 6 -6 7 7 -7 8 8 -8 9 9 -9 (10 rows x 2 columns)
>>> df.tail() col1 col2 5 5 -5 6 6 -6 7 7 -7 8 8 -8 9 9 -9 (5 rows x 2 columns)
>>> df.tail(n=2) col1 col2 8 8 -8 9 9 -9 (2 rows x 2 columns)
- to_csv(path: str, index: bool = False, columns: List[str] | None = None, col_delim: str = ',', overwrite: bool = False)[source]¶
Write DataFrame to CSV file(s).
File will contain a column for each column in the DataFrame. All CSV Files written by Arkouda include a header denoting data types of the columns. Unlike other file formats, CSV files store Strings as their UTF-8 format instead of storing bytes as uint(8).
- Parameters:
path (str) – The filename prefix to be used for saving files. Files will have _LOCALE#### appended when they are written to disk.
index (bool, default=False) – If True, the index of the DataFrame will be written to the file as a column.
columns (list of str (Optional)) – Column names to assign when writing data.
col_delim (str, default=",") – Value to be used to separate columns within the file. Please be sure that the value used DOES NOT appear in your dataset.
overwrite (bool, default=False) – If True, any existing files matching your provided prefix_path will be overwritten. If False, an error will be returned if existing files are found.
- Raises:
ValueError – Raised if all datasets are not present in all parquet files or if one or more of the specified files do not exist.
RuntimeError – Raised if one or more of the specified files cannot be opened. If allow_errors is true this may be raised if no values are returned from the server.
TypeError – Raised if we receive an unknown arkouda_type returned from the server.
Notes
CSV format is not currently supported by load/load_all operations.
The column delimiter is expected to be the same for column names and data.
Be sure that column delimiters are not found within your data.
All CSV files must delimit rows using newline (”\n”) at this time.
Examples
>>> import arkouda as ak >>> import os.path >>> from pathlib import Path >>> my_path = os.path.join(os.getcwd(), 'csv_output') >>> Path(my_path).mkdir(parents=True, exist_ok=True)
>>> df = ak.DataFrame({"A":[1,2],"B":[3,4]}) >>> df.to_csv(my_path) >>> df2 = ak.DataFrame.read_csv(my_path + "_LOCALE0000") >>> df2 A B 0 1 3 1 2 4 (2 rows x 2 columns)
- to_hdf(path, index=False, columns=None, file_type: Literal['single', 'distribute'] = 'distribute')[source]¶
Save DataFrame to disk as hdf5, preserving column names.
- Parameters:
path (str) – File path to save data.
index (bool, default=False) – If True, save the index column. By default, do not save the index.
columns (List, default = None) – List of columns to include in the file. If None, writes out all columns.
file_type ({"single", "distribute"}) – Whether to save to a single file or distribute across Locales. Default is “distribute”.
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray.
Notes
This method saves one file per locale of the arkouda server. All files are prefixed by the path argument and suffixed by their locale number.
See also
Examples
>>> import arkouda as ak >>> import os.path >>> from pathlib import Path >>> my_path = os.path.join(os.getcwd(), 'hdf_output') >>> Path(my_path).mkdir(parents=True, exist_ok=True)
>>> df = ak.DataFrame({"A":[1,2],"B":[3,4]}) >>> df.to_hdf(my_path + "/my_data") >>> df.load(my_path + "/my_data") A B 0 1 3 1 2 4 (2 rows x 2 columns)
- to_markdown(mode='wt', index=True, tablefmt='grid', storage_options=None, **kwargs)[source]¶
Print the DataFrame in a Markdown-friendly format.
- Parameters:
mode (str, optional) – Mode in which the file is opened, by default
"wt".index (bool, optional, default=True) – Whether to include index (row) labels.
tablefmt (str, default="grid") – Table format passed to
tabulate. See https://pypi.org/project/tabulate/ for available formats.storage_options (dict, optional) – Extra options for a particular storage connection (for example host, port, username, password) when using a URL handled by
fsspecsuch as"s3://","gcs://". An error will be raised if this argument is provided with a non-fsspec URL. See the fsspec and backend storage implementation documentation for the set of allowed keys and values.**kwargs – Additional keyword arguments passed to
tabulate.
Notes
This function should only be used with small DataFrames because it calls
pandas.DataFrame.to_markdowninternally: https://pandas.pydata.org/pandas-docs/version/1.2.4/reference/api/pandas.DataFrame.to_markdown.htmlExamples
>>> import arkouda as ak >>> df = ak.DataFrame({"animal_1": ["elk", "pig"], "animal_2": ["dog", "quetzal"]}) >>> print(df.to_markdown()) +----+------------+------------+ | | animal_1 | animal_2 | +====+============+============+ | 0 | elk | dog | +----+------------+------------+ | 1 | pig | quetzal | +----+------------+------------+
Suppress the index:
>>> print(df.to_markdown(index=False)) +------------+------------+ | animal_1 | animal_2 | +============+============+ | elk | dog | +------------+------------+ | pig | quetzal | +------------+------------+
- to_pandas(datalimit=maxTransferBytes, retain_index=False)[source]¶
Send this DataFrame to a pandas DataFrame.
- Parameters:
datalimit (int, default=arkouda.core.client.maxTransferBytes) – The maximum number size, in megabytes to transfer. The requested DataFrame will be converted to a pandas DataFrame only if the estimated size of the DataFrame does not exceed this value.
retain_index (bool, default=False) – Normally, to_pandas() creates a new range index object. If you want to keep the index column, set this to True.
- Returns:
The result of converting this DataFrame to a pandas DataFrame.
- Return type:
pandas.DataFrame
Examples
>>> import arkouda as ak >>> ak_df = ak.DataFrame({"A": ak.arange(2), "B": -1 * ak.arange(2)}) >>> type(ak_df) <class 'arkouda...DataFrame'> >>> ak_df A B 0 0 0 1 1 -1 (2 rows x 2 columns)
>>> import pandas as pd >>> pd_df = ak_df.to_pandas() >>> type(pd_df) <class 'pandas...DataFrame'> >>> pd_df A B 0 0 0 1 1 -1
- to_parquet(path, index=False, columns=None, compression: str | None = None, convert_categoricals: bool = False)[source]¶
Save DataFrame to disk as parquet, preserving column names.
- Parameters:
path (str) – File path to save data.
index (bool, default=False) – If True, save the index column. By default, do not save the index.
columns (list) – List of columns to include in the file. If None, writes out all columns.
compression (str (Optional), default=None) – Provide the compression type to use when writing the file. Supported values: snappy, gzip, brotli, zstd, lz4
convert_categoricals (bool, default=False) – Parquet requires all columns to be the same size and Categoricals don’t satisfy that requirement. If set, write the equivalent Strings in place of any Categorical columns.
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
This method saves one file per locale of the arkouda server. All files are prefixed by the path argument and suffixed by their locale number.
Examples
>>> import arkouda as ak >>> import os.path >>> from pathlib import Path >>> my_path = os.path.join(os.getcwd(), 'parquet_output') >>> Path(my_path).mkdir(parents=True, exist_ok=True)
>>> df = ak.DataFrame({"A":[1,2],"B":[3,4]}) >>> df.to_parquet(my_path + "/my_data")
>>> df.load(my_path + "/my_data") B A 0 3 1 1 4 2 (2 rows x 2 columns)
- transfer(hostname, port)[source]¶
Send a DataFrame to a different Arkouda server.
- Parameters:
hostname (str) – The hostname where the Arkouda server intended to receive the DataFrame is running.
port (int_scalars) – The port to send the array over. This needs to be an open port (i.e., not one that the Arkouda server is running on). This will open up numLocales ports, each of which in succession, so will use ports of the range {port..(port+numLocales)} (e.g., running an Arkouda server of 4 nodes, port 1234 is passed as port, Arkouda will use ports 1234, 1235, 1236, and 1237 to send the array data). This port much match the port passed to the call to ak.receive_array().
- Returns:
A message indicating a complete transfer.
- Return type:
- Raises:
ValueError – Raised if the op is not within the pdarray.BinOps set
TypeError – Raised if other is not a pdarray or the pdarray.dtype is not a supported dtype
- unregister()[source]¶
Unregister this DataFrame object in the arkouda server.
Unregister this DataFrame object in the arkouda server which was previously registered using register() and/or attached to using attach().
- Raises:
RegistrationError – If the object is already unregistered or if there is a server error when attempting to unregister.
See also
register,unregister_dataframe_by_name,is_registeredNotes
Objects registered with the server are immune to deletion until they are unregistered.
Example
>>> import arkouda as ak >>> df = ak.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]}) >>> df.register("my_table_name") col1 col2 0 1 4 1 2 5 2 3 6 (3 rows x 2 columns)
>>> df.is_registered() True >>> df.unregister() >>> df.is_registered() False
- update_hdf(prefix_path: str, index=False, columns=None, repack: bool = True)[source]¶
Overwrite the dataset with the name provided with this dataframe.
If the dataset does not exist it is added.
- Parameters:
prefix_path (str) – Directory and filename prefix that all output files share.
index (bool, default=False) – If True, save the index column. By default, do not save the index.
columns (List, default=None) – List of columns to include in the file. If None, writes out all columns.
repack (bool, default=True) – HDF5 does not release memory on delete. When True, the inaccessible data (that was overwritten) is removed. When False, the data remains, but is inaccessible. Setting to false will yield better performance, but will cause file sizes to expand.
- Returns:
Success message if successful.
- Return type:
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray.
Notes
- If file does not contain File_Format attribute to indicate how it was saved,
the file name is checked for _LOCALE#### to determine if it is distributed.
If the dataset provided does not exist, it will be added.
Examples
>>> import arkouda as ak >>> import os.path >>> from pathlib import Path >>> my_path = os.path.join(os.getcwd(), 'hdf_output') >>> Path(my_path).mkdir(parents=True, exist_ok=True)
>>> df = ak.DataFrame({"A":[1,2],"B":[3,4]}) >>> df.to_hdf(my_path + "/my_data") >>> df.load(my_path + "/my_data") A B 0 1 3 1 2 4 (2 rows x 2 columns)
>>> df2 = ak.DataFrame({"A":[5,6],"B":[7,8]}) >>> df2.update_hdf(my_path + "/my_data") >>> df.load(my_path + "/my_data") A B 0 5 7 1 6 8 (2 rows x 2 columns)
- class arkouda.DataFrameGroupBy(gb, df, gb_key_names=None, as_index=True)[source]¶
A DataFrame that has been grouped by a subset of columns.
- Parameters:
- gb¶
GroupBy object, where the aggregation keys are values of column(s) of a dataframe, usually in preparation for aggregating with respect to the other columns.
- Type:
- gb_key_names¶
The column name(s) associated with the aggregated columns.
- as_index¶
If True the grouped values of the aggregation keys will be treated as an index. Defaults to True.
- Type:
- all_non_nan = False¶
- broadcast(x, permute=True)[source]¶
Fill each group’s segment with a constant value.
- Parameters:
- Returns:
A Series with the Index of the original frame and the values of the broadcast.
- Return type:
Examples
>>> import arkouda as ak >>> from arkouda.pandas.dataframe import DataFrameGroupBy >>> df = ak.DataFrame({"A":[1,2,2,3],"B":[3,4,5,6]}) >>> df A B 0 1 3 1 2 4 2 2 5 3 3 6 (4 rows x 2 columns)
>>> gb = df.groupby("A") >>> x = ak.array([10,11,12]) >>> s = DataFrameGroupBy.broadcast(gb, x) >>> df["C"] = s.values >>> df A B C 0 1 3 10 1 2 4 11 2 2 5 11 3 3 6 12 (4 rows x 3 columns)
- diff(colname)[source]¶
Create a difference aggregate for the given column.
For each group, the difference between successive values is calculated. Aggregate operations (mean,min,max,std,var) can be done on the results.
- Parameters:
colname (str) – Name of the column to compute the difference on.
- Returns:
Object containing the differences, which can be aggregated.
- Return type:
Examples
>>> import arkouda as ak >>> df = ak.DataFrame({"A":[1,2,2,2,3,3],"B":[3,9,11,27,86,100]}) >>> df A B 0 1 3 1 2 9 2 2 11 3 2 27 4 3 86 5 3 100 (6 rows x 2 columns)
>>> gb = df.groupby("A") >>> gb.diff("B").values array([nan nan 2.00000000000000000 16.00000000000000000 nan 14.00000000000000000])
- dropna¶
- head(n: int = 5, sort_index: bool = True) DataFrame[source]¶
Return the first n rows from each group.
- Parameters:
n (int, optional, default = 5) – Maximum number of rows to return for each group. If the number of rows in a group is less than n, all the values from that group will be returned.
sort_index (bool, default = True) – If true, return the DataFrame with indices sorted.
- Return type:
Examples
>>> import arkouda as ak >>> df = ak.DataFrame({"a":ak.arange(10) %3 , "b":ak.arange(10)}) >>> df a b 0 0 0 1 1 1 2 2 2 3 0 3 4 1 4 5 2 5 6 0 6 7 1 7 8 2 8 9 0 9 (10 rows x 2 columns)
>>> df.groupby("a").head(2) a b 0 0 0 1 1 1 2 2 2 3 0 3 4 1 4 5 2 5 (6 rows x 2 columns)
- sample(n=None, frac=None, replace=False, weights=None, random_state=None)[source]¶
Return a random sample from each group.
You can specify either the number of elements to sample or the fraction of elements to sample.
random_statecan be used for reproducibility.- Parameters:
n (int, optional) – Number of items to return for each group. Cannot be used with
fracand must be no larger than the smallest group unlessreplaceisTrue. Defaults to1iffracisNone.frac (float, optional) – Fraction of items to return. Cannot be used with
n.replace (bool, default=False) – Whether to allow sampling of the same row more than once.
weights (pdarray, optional) – If
None, all rows are given equal probability. If apdarrayis provided, it must have the same length as the underlyingDataFrameand will be used as sampling probabilities after normalization within each group. Weights must be non-negative, with at least one positive element in each group.random_state (int or ak.random.Generator, optional) – If an
int, it is used as the seed for the random number generator. If anak.random.Generator, it is used directly.
- Returns:
A new
DataFramecontaining items randomly sampled from each group, sorted according to the grouped columns.- Return type:
Examples
>>> import arkouda as ak >>> df = ak.DataFrame({"A": [3, 1, 2, 1, 2, 3], "B": [3, 4, 5, 6, 7, 8]}) >>> df A B 0 3 3 1 1 4 2 2 5 3 1 6 4 2 7 5 3 8 (6 rows x 2 columns)
Sample one row per group:
>>> df.groupby("A").sample(random_state=6) A B 3 1 6 4 2 7 5 3 8 (3 rows x 2 columns)
Sample a fraction of each group with weights:
>>> df.groupby("A").sample( ... frac=0.5, ... random_state=3, ... weights=ak.array([1, 1, 1, 0, 0, 0]), ... ) A B 1 1 4 2 2 5 0 3 3 (3 rows x 2 columns)
Sample with replacement:
>>> df.groupby("A").sample( ... n=3, ... replace=True, ... random_state=ak.random.default_rng(7), ... ) A B 1 1 4 3 1 6 1 1 4 4 2 7 4 2 7 4 2 7 0 3 3 5 3 8 5 3 8 (9 rows x 2 columns)
- size(as_series=None, sort_index=True)[source]¶
Compute the size of each value as the total number of rows, including NaN values.
- Parameters:
- Return type:
arkouda.pandas.dataframe.DataFrame or arkouda.pandas.series.Series
Examples
>>> import arkouda as ak >>> df = ak.DataFrame({"A":[1,2,2,3],"B":[3,4,5,6]}) >>> df A B 0 1 3 1 2 4 2 2 5 3 3 6 (4 rows x 2 columns)
>>> df.groupby("A").size(as_series = False) size A 1 1 2 2 3 1 (3 rows x 1 columns)
- tail(n: int = 5, sort_index: bool = True) DataFrame[source]¶
Return the last n rows from each group.
- Parameters:
n (int, optional, default = 5) – Maximum number of rows to return for each group. If the number of rows in a group is less than n, all the rows from that group will be returned.
sort_index (bool, default = True) – If true, return the DataFrame with indices sorted.
- Return type:
Examples
>>> import arkouda as ak >>> df = ak.DataFrame({"a":ak.arange(10) %3 , "b":ak.arange(10)}) >>> df a b 0 0 0 1 1 1 2 2 2 3 0 3 4 1 4 5 2 5 6 0 6 7 1 7 8 2 8 9 0 9 (10 rows x 2 columns)
>>> df.groupby("a").tail(2) a b 4 1 4 5 2 5 6 0 6 7 1 7 8 2 8 9 0 9 (6 rows x 2 columns)
- where_not_nan = None¶
- class arkouda.DatetimeAccessor(series)[source]¶
Bases:
PropertiesAccessor for datetime-like operations on Arkouda Series.
Provides datetime methods such as .floor(), .ceil(), and .round(), mirroring the .dt accessor in pandas.
This accessor is automatically attached to Series objects that wrap arkouda.Datetime values. It should not be instantiated directly.
- Parameters:
series (arkouda.pandas.Series) – The Series object containing Datetime values.
- Raises:
AttributeError – If the underlying Series values are not of type arkouda.Datetime.
Examples
>>> import arkouda as ak >>> from arkouda import Datetime, Series >>> s = Series(Datetime(ak.array([1_000_000_000_000]))) >>> s.dt.floor("D") 0 1970-01-01 dtype: datetime64[ns]
- series¶
- class arkouda.DiffAggregate(gb, series)[source]¶
A column in a GroupBy that has been differenced.
Aggregation operations can be done on the result.
- gb¶
GroupBy object, where the aggregation keys are values of column(s) of a dataframe.
- Type:
- class arkouda.Index(values: List | arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings | arkouda.pandas.categorical.Categorical | pandas.Index | Index | pandas.Categorical, name: str | None = None, allow_list=False, max_list_size=1000)[source]¶
Sequence used for indexing and alignment.
The basic object storing axis labels for all DataFrame objects.
- Parameters:
values (List, pdarray, Strings, Categorical, pandas.Categorical, pandas.Index, or Index)
name (str, default=None) – Name to be stored in the index.
False (allow_list =) – If False, list values will be converted to a pdarray. If True, list values will remain as a list, provided the data length is less than max_list_size.
- :paramIf False, list values will be converted to a pdarray.
If True, list values will remain as a list, provided the data length is less than max_list_size.
- Parameters:
1000 (max_list_size =) – This is the maximum allowed data length for the values to be stored as a list object.
- Raises:
ValueError – Raised if allow_list=True and the size of values is > max_list_size.
See also
Examples
>>> import arkouda as ak >>> ak.Index([1, 2, 3]) Index(array([1 2 3]), dtype='int64')
>>> ak.Index(list('abc')) Index(array(['a', 'b', 'c']), dtype='<U0')
>>> ak.Index([1, 2, 3], allow_list=True) Index([1, 2, 3], dtype='int64')
- argsort(ascending: bool = True) list | arkouda.numpy.pdarrayclass.pdarray[source]¶
Return the permutation that sorts the Index.
- Parameters:
ascending (bool, optional) – If True (default), sort in ascending order. If False, sort in descending order.
- Returns:
Indices that would sort the Index.
- Return type:
list or pdarray
Examples
>>> import arkouda as ak >>> idx = ak.Index([10, 3, 5]) >>> idx.argsort() array([1 2 0])
- equals(other: Index) arkouda.numpy.dtypes.bool_scalars[source]¶
Whether Indexes are the same size, and all entries are equal.
- Parameters:
other (Index) – object to compare.
- Returns:
True if the Indexes are the same, o.w. False.
- Return type:
Examples
>>> import arkouda as ak >>> i = ak.Index([1, 2, 3]) >>> i_cpy = ak.Index([1, 2, 3]) >>> i.equals(i_cpy) np.True_ >>> i2 = ak.Index([1, 2, 4]) >>> i.equals(i2) np.False_
MultiIndex case:
>>> arrays = [ak.array([1, 1, 2, 2]), ak.array(["red", "blue", "red", "blue"])] >>> m = ak.MultiIndex(arrays, names=["numbers2", "colors2"]) >>> m.equals(m) True >>> arrays2 = [ak.array([1, 1, 2, 2]), ak.array(["red", "blue", "red", "green"])] >>> m2 = ak.MultiIndex(arrays2, names=["numbers2", "colors2"]) >>> m.equals(m2) False
- static factory(index)[source]¶
Construct an Index or MultiIndex based on the input.
- Parameters:
index (array-like or tuple of array-like) – If a single array-like, returns an Index. If a tuple of array-like objects, returns a MultiIndex.
- Returns:
An Index if input is a single array-like, or a MultiIndex otherwise.
- Return type:
Index or MultiIndex
- classmethod from_return_msg(rep_msg)[source]¶
Reconstruct an Index or MultiIndex from a return message.
- Parameters:
rep_msg (str) – A string return message containing encoded index information.
- Returns:
The reconstructed Index or MultiIndex instance.
- Return type:
Index or MultiIndex
- is_registered()[source]¶
Return whether the object is registered.
Return True iff the object is contained in the registry or is a component of a registered object.
- Returns:
Indicates if the object is contained in the registry
- Return type:
- Raises:
RegistrationError – Raised if there’s a server-side error or a mis-match of registered components
See also
register,attach,unregisterNotes
Objects registered with the server are immune to deletion until they are unregistered.
- property is_unique¶
Property indicating if all values in the index are unique.
- Return type:
bool - True if all values are unique, False otherwise.
- lookup(key)[source]¶
Check for presence of key(s) in the Index.
- Parameters:
key (pdarray or scalar) – The value(s) to look up in the Index. If a scalar is provided, it will be converted to a one-element array.
- Returns:
A boolean array of length
len(self), indicating which entries of the Index are present in key.- Return type:
- Raises:
TypeError – If key cannot be converted to an arkouda array.
- map(arg: dict | arkouda.pandas.series.Series) Index[source]¶
Map values of Index according to an input mapping.
- Parameters:
arg (dict or Series) – The mapping correspondence.
- Returns:
A new index with the values transformed by the mapping correspondence.
- Return type:
- Raises:
TypeError – Raised if arg is not of type dict or arkouda.pandas.Series. Raised if index values not of type pdarray, Categorical, or Strings.
Examples
>>> import arkouda as ak
>>> idx = ak.Index(ak.array([2, 3, 2, 3, 4])) >>> idx Index(array([2 3 2 3 4]), dtype='int64') >>> idx.map({4: 25.0, 2: 30.0, 1: 7.0, 3: 5.0}) Index(array([30.00000000000000000 5.00000000000000000 30.00000000000000000 5.00000000000000000 25.00000000000000000]), dtype='float64') >>> s2 = ak.Series(ak.array(["a","b","c","d"]), index = ak.array([4,2,1,3])) >>> idx.map(s2) Index(array(['b', 'd', 'b', 'd', 'a']), dtype='<U0')
- max_list_size = 1000¶
- memory_usage(unit='B')[source]¶
Return the memory usage of the Index values.
- Parameters:
unit (str, default = "B") – Unit to return. One of {‘B’, ‘KB’, ‘MB’, ‘GB’}.
- Returns:
Bytes of memory consumed.
- Return type:
int
See also
arkouda.numpy.pdarrayclass.nbytes,arkouda.pandas.index.MultiIndex.memory_usage,arkouda.pandas.series.Series.memory_usage,arkouda.pandas.dataframe.DataFrame.memory_usageExamples
>>> import arkouda as ak >>> idx = Index(ak.array([1, 2, 3])) >>> idx.memory_usage() 24
- property names¶
Return Index or MultiIndex names.
- property ndim¶
Number of dimensions of the underlying data, by definition 1.
See also
- property nlevels¶
Integer number of levels in this Index.
An Index will always have 1 level.
See also
- objType = 'Index'¶
- register(user_defined_name)[source]¶
Register this Index object and underlying components with the Arkouda server.
- Parameters:
user_defined_name (str) – user defined name the Index is to be registered under, this will be the root name for underlying components
- Returns:
The same Index which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different Indexes with the same name.
- Return type:
- Raises:
TypeError – Raised if user_defined_name is not a str
RegistrationError – If the server was unable to register the Index with the user_defined_name
See also
unregister,attach,is_registeredNotes
Objects registered with the server are immune to deletion until they are unregistered.
- set_dtype(dtype)[source]¶
Change the data type of the index.
Currently only aku.ip_address and ak.array are supported.
- property shape¶
Return the shape of the Index.
- Returns:
A tuple representing the shape of the Index (size,).
- Return type:
tuple
- sort_values(return_indexer: bool = False, ascending: bool = True, na_position: str = 'last') Index | Tuple[Index, arkouda.numpy.pdarrayclass.pdarray | list][source]¶
Return a sorted copy of the index.
- Parameters:
return_indexer (bool, default False) – If True, also return the integer positions that sort the index.
ascending (bool, default True) – Sort in ascending order. Use False for descending.
na_position ({'first', 'last'}, default 'last') – Where to position NaNs. ‘first’ puts NaNs at the beginning, ‘last’ at the end.
- Returns:
- sorted_indexarkouda.Index
A new Index whose values are sorted.
- indexerUnion[arkouda.pdarray, list], optional
The indices that would sort the original index. Only returned when
return_indexer=True.
- Return type:
Examples
>>> import arkouda as ak >>> idx = ak.Index([10, 100, 1, 1000]) >>> idx Index(array([10 100 1 1000]), dtype='int64')
Sort in ascending order (default): >>> idx.sort_values() Index(array([1 10 100 1000]), dtype=’int64’)
Sort in descending order and get the sort positions: >>> idx.sort_values(ascending=False, return_indexer=True) (Index(array([1000 100 10 1]), dtype=’int64’), array([3 1 0 2]))
- to_csv(prefix_path: str, dataset: str = 'index', col_delim: str = ',', overwrite: bool = False)[source]¶
Write Index to CSV file(s).
File will contain a single column with the pdarray data. All CSV Files written by Arkouda include a header denoting data types of the columns.
- Parameters:
prefix_path (str) – The filename prefix to be used for saving files. Files will have _LOCALE#### appended when they are written to disk.
dataset (str) – Column name to save the pdarray under. Defaults to “array”.
col_delim (str) – Defaults to “,”. Value to be used to separate columns within the file. Please be sure that the value used DOES NOT appear in your dataset.
overwrite (bool) – Defaults to False. If True, any existing files matching your provided prefix_path will be overwritten. If False, an error will be returned if existing files are found.
- Return type:
str reponse message
- Raises:
ValueError – Raised if all datasets are not present in all parquet files or if one or more of the specified files do not exist.
RuntimeError – Raised if one or more of the specified files cannot be opened. If allow_errors is true this may be raised if no values are returned from the server.
TypeError – Raised if we receive an unknown arkouda_type returned from the server. Raised if the Index values are a list.
Notes
CSV format is not currently supported by load/load_all operations
The column delimiter is expected to be the same for column names and data
Be sure that column delimiters are not found within your data.
All CSV files must delimit rows using newline (n) at this time.
- to_hdf(prefix_path: str, dataset: str = 'index', mode: Literal['truncate', 'append'] = 'truncate', file_type: Literal['single', 'distribute'] = 'distribute') str[source]¶
Save the Index to HDF5.
The object can be saved to a collection of files or single file.
- Parameters:
prefix_path (str) – Directory and filename prefix that all output files share
dataset (str) – Name of the dataset to create in files (must not already exist)
mode (str {'truncate' | 'append'}) – By default, truncate (overwrite) output files, if they exist. If ‘append’, attempt to create new dataset in existing files.
file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.
- Return type:
string message indicating result of save operation
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
TypeError – Raised if the Index values are a list.
Notes
The prefix_path must be visible to the arkouda server and the user must
have write permission. - Output files have names of the form
<prefix_path>_LOCALE<i>, where<i>ranges from 0 tonumLocalesfor file_type=’distribute’. Otherwise, the file name will be prefix_path. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, aRuntimeErrorwill result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.
- to_ndarray()[source]¶
Convert the Index values to a NumPy ndarray.
- Returns:
A NumPy array representation of the Index values.
- Return type:
numpy.ndarray
- to_pandas()[source]¶
Convert this Arkouda-backed index wrapper to an equivalent pandas Index.
This method materializes the underlying values into a local NumPy array (or pandas Categorical, when applicable) and returns the corresponding pandas
Index(orCategoricalIndex).- Returns:
A pandas Index representing the same logical values. For categorical data, a
pandas.CategoricalIndexis returned.- Return type:
pandas.Index
Notes
If the underlying values are categorical, this returns a
pandas.CategoricalIndex.For unicode string-like data (or object arrays inferred as strings), this attempts to return a pandas “string” dtype Index to match pandas’ missing-value behavior (e.g., NA handling).
Fixed-width bytes data is preserved as bytes (no implicit decoding).
Examples
>>> import arkouda as ak >>> import pandas >>> idx = ak.Index(ak.array([1,2,3])) >>> pidx = idx.to_pandas() >>> pidx.dtype dtype('<i8')
- to_parquet(prefix_path: str, dataset: str = 'index', mode: Literal['truncate', 'append'] = 'truncate', compression: str | None = None)[source]¶
Save the Index to Parquet.
The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. Each locale saves its chunk of the array to its corresponding file.
- Parameters:
prefix_path (str) – Directory and filename prefix that all output files share
dataset (str) – Name of the dataset to create in files (must not already exist)
mode ({'truncate' | 'append'}) – By default, truncate (overwrite) output files, if they exist. If ‘append’, attempt to create new dataset in existing files.
compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Sets the compression type used with Parquet files
- Return type:
string message indicating result of save operation
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
TypeError – Raised if the Index values are a list.
Notes
The prefix_path must be visible to the arkouda server and the user must
have write permission. - Output files have names of the form
<prefix_path>_LOCALE<i>, where<i>ranges from 0 tonumLocalesfor file_type=’distribute’. - ‘append’ write mode is supported, but is not efficient. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, aRuntimeErrorwill result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.
- tolist()[source]¶
Convert the Index values to a Python list.
- Returns:
A list containing the Index values.
- Return type:
list
- unregister()[source]¶
Unregister this Index object in the arkouda server.
Unregister this Index object in the arkouda server, which was previously registered using register() and/or attached to using attach().
- Raises:
RegistrationError – If the object is already unregistered or if there is a server error when attempting to unregister
See also
register,attach,is_registeredNotes
Objects registered with the server are immune to deletion until they are unregistered.
- update_hdf(prefix_path: str, dataset: str = 'index', repack: bool = True)[source]¶
Overwrite the dataset with the name provided with this Index object.
If the dataset does not exist it is added.
- Parameters:
prefix_path (str) – Directory and filename prefix that all output files share
dataset (str) – Name of the dataset to create in files
repack (bool) – Default: True HDF5 does not release memory on delete. When True, the inaccessible data (that was overwritten) is removed. When False, the data remains, but is inaccessible. Setting to false will yield better performance, but will cause file sizes to expand.
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the index
Notes
If file does not contain File_Format attribute to indicate how it was saved, the file name is checked for _LOCALE#### to determine if it is distributed.
If the dataset provided does not exist, it will be added
Because HDF5 deletes do not release memory, this will create a copy of the file with the new data
- class arkouda.LogLevel(*args, **kwds)[source]¶
Bases:
enum.EnumEnum for defining valid log levels used by ArkoudaLogger.
Members¶
- INFOstr
Confirmation that things are working as expected.
- DEBUGstr
Detailed information, typically of interest only when diagnosing problems.
- WARNstr
An indication that something unexpected happened, or indicative of some problem.
- ERRORstr
A more serious problem, the software has not been able to perform some function.
- CRITICALstr
An extremely serious error, indicating the program itself may be unable to continue.
Notes
This enum provides a controlled vocabulary for setting log levels on ArkoudaLogger instances. These are mapped internally to the standard Python logging levels.
- CRITICAL = 'CRITICAL'¶
- DEBUG = 'DEBUG'¶
- ERROR = 'ERROR'¶
- INFO = 'INFO'¶
- WARN = 'WARN'¶
- class arkouda.MultiIndex(data: list | tuple | pandas.MultiIndex | MultiIndex, name: str | None = None, names: Iterable[Hashable | None] | None = None)[source]¶
Bases:
IndexA multi-level, or hierarchical, index object for Arkouda DataFrames and Series.
A MultiIndex allows you to represent multiple dimensions of indexing using a single object, enabling advanced indexing and grouping operations.
This class mirrors the behavior of pandas’ MultiIndex while leveraging Arkouda’s distributed data structures. Internally, it stores a list of Index objects, each representing one level of the hierarchy.
Examples
>>> import arkouda as ak >>> from arkouda.pandas.index import MultiIndex >>> a = ak.array([1, 2, 3]) >>> b = ak.array(['a', 'b', 'c']) >>> mi = MultiIndex([a, b]) >>> mi[1] MultiIndex([np.int64(2), np.str_('b')])
- concat(other)[source]¶
Concatenate this MultiIndex with another, preserving duplicates and order.
- Parameters:
other (MultiIndex) – The other MultiIndex to concatenate with.
- Returns:
A new MultiIndex containing values from both inputs, preserving order.
- Return type:
- Raises:
TypeError – If the type of other does not match.
- property dtype: numpy.dtype¶
Return the dtype object of the underlying data.
- equal_levels(other: MultiIndex) bool[source]¶
Return True if the levels of both MultiIndex objects are the same.
- get_level_values(level: str | int)[source]¶
Return the values at a particular level of the MultiIndex.
- Parameters:
level (int or str) – The level number or name. If a string is provided, it must match an entry in self.names.
- Returns:
An Index object corresponding to the requested level.
- Return type:
- Raises:
RuntimeError – If self.names is None and a string level is provided.
ValueError – If the provided string is not in self.names, or if the level index is out of bounds.
- property index¶
Return the levels of the MultiIndex.
- Returns:
A list of Index objects representing the levels of the MultiIndex.
- Return type:
list
- property inferred_type: str¶
Return the inferred type of the MultiIndex.
- Returns:
The string “mixed”, indicating the MultiIndex may contain multiple types.
- Return type:
- is_registered()[source]¶
Check if the MultiIndex is registered with the Arkouda server.
- Returns:
True if the MultiIndex has a registered name and is recognized by the server, False otherwise.
- Return type:
- levels: list[arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings | arkouda.pandas.categorical.Categorical]¶
- lookup(key: list[Any] | tuple[Any, Ellipsis]) arkouda.pandas.groupbyclass.groupable[source]¶
Perform element-wise lookup on the MultiIndex.
- Parameters:
key (list or tuple) –
A sequence of values, one for each level of the MultiIndex.
If the elements are scalars (e.g.,
(1, "red")), they are treated as a single row key: the result is a boolean mask over rows where all levels match the corresponding scalar.If the elements are arkouda arrays (e.g., list of pdarrays / Strings), they must align one-to-one with the levels, and the lookup is delegated to
in1d(self.index, key)for multi-column membership.
- Returns:
A boolean array indicating which rows in the MultiIndex match the key.
- Return type:
- Raises:
TypeError – If key is not a list or tuple.
ValueError – If the length of key does not match the number of levels.
- memory_usage(unit='B')[source]¶
Return the memory usage of the MultiIndex levels.
- Parameters:
unit (str, default = "B") – Unit to return. One of {‘B’, ‘KB’, ‘MB’, ‘GB’}.
- Returns:
Bytes of memory consumed.
- Return type:
int
See also
arkouda.numpy.pdarrayclass.nbytes,arkouda.pandas.index.Index.memory_usage,arkouda.pandas.series.Series.memory_usage,arkouda.pandas.dataframe.DataFrame.memory_usageExamples
>>> import arkouda as ak
>>> m = ak.pandas.index.MultiIndex([ak.array([1,2,3]),ak.array([4,5,6])]) >>> m.memory_usage() 48
- property name¶
Return Index or MultiIndex name.
- property names¶
Return Index or MultiIndex names.
- property ndim¶
Number of dimensions of the underlying data, by definition 1.
See also
- property nlevels: int¶
Integer number of levels in this MultiIndex.
See also
- objType = 'MultiIndex'¶
- register(user_defined_name)[source]¶
Register this Index object and underlying components with the Arkouda server.
- Parameters:
user_defined_name (str) – user defined name the Index is to be registered under, this will be the root name for underlying components
- Returns:
The same Index which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different Indexes with the same name.
- Return type:
- Raises:
TypeError – Raised if user_defined_name is not a str
RegistrationError – If the server was unable to register the Index with the user_defined_name
See also
unregister,attach,is_registeredNotes
Objects registered with the server are immune to deletion until they are unregistered.
- set_dtype(dtype)[source]¶
Change the data type of the index.
Currently only aku.ip_address and ak.array are supported.
- to_dict(labels=None)[source]¶
Convert the MultiIndex to a dictionary representation.
- Parameters:
labels (list of str, optional) – A list of column names for the index levels. If not provided, defaults to [‘idx_0’, ‘idx_1’, …, ‘idx_n’].
- Returns:
A dictionary mapping each label to the corresponding Index object.
- Return type:
dict
- to_hdf(prefix_path: str, dataset: str = 'index', mode: Literal['truncate', 'append'] = 'truncate', file_type: Literal['single', 'distribute'] = 'distribute') str[source]¶
Save the Index to HDF5.
The object can be saved to a collection of files or single file.
- Parameters:
prefix_path (str) – Directory and filename prefix that all output files share
dataset (str) – Name of the dataset to create in files (must not already exist)
mode ({'truncate' | 'append'}) – By default, truncate (overwrite) output files, if they exist. If ‘append’, attempt to create new dataset in existing files.
file_type ({"single" | "distribute"}) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.
- Return type:
string message indicating result of save operation
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray.
Notes
The prefix_path must be visible to the arkouda server and the user must
have write permission. - Output files have names of the form
<prefix_path>_LOCALE<i>, where<i>ranges from 0 tonumLocalesfor file_type=’distribute’. Otherwise, the file name will be prefix_path. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, aRuntimeErrorwill result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.
- to_ndarray()[source]¶
Convert the MultiIndex to a NumPy ndarray of arrays.
- Returns:
A NumPy array where each element is an array corresponding to one level of the MultiIndex. Categorical levels are converted to their underlying arrays.
- Return type:
numpy.ndarray
- to_pandas()[source]¶
Convert the MultiIndex to a pandas.MultiIndex object.
- Returns:
A pandas MultiIndex with the same levels and names.
- Return type:
pandas.MultiIndex
Notes
Categorical levels are converted to pandas categorical arrays, while others are converted to NumPy arrays.
- tolist()[source]¶
Convert the MultiIndex to a list of lists.
- Returns:
A list of Python lists, where each inner list corresponds to one level of the MultiIndex.
- Return type:
list
- unregister()[source]¶
Unregister this MultiIndex from the Arkouda server.
- Raises:
RegistrationError – If the MultiIndex is not currently registered.
- update_hdf(prefix_path: str, dataset: str = 'index', repack: bool = True)[source]¶
Overwrite the dataset with the name provided with this Index object.
If the dataset does not exist it is added.
- Parameters:
prefix_path (str) – Directory and filename prefix that all output files share
dataset (str) – Name of the dataset to create in files
repack (bool) – Default: True HDF5 does not release memory on delete. When True, the inaccessible data (that was overwritten) is removed. When False, the data remains, but is inaccessible. Setting to false will yield better performance, but will cause file sizes to expand.
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the index
TypeError – Raised if the Index levels are a list.
Notes
If file does not contain File_Format attribute to indicate how it was saved, the file name is checked for _LOCALE#### to determine if it is distributed.
If the dataset provided does not exist, it will be added
Because HDF5 deletes do not release memory, this will create a copy of the file with the new data
- class arkouda.Power_divergenceResult[source]¶
Bases:
Power_divergenceResultThe results of a power divergence statistical test.
- class arkouda.Properties[source]¶
Base class for accessor implementations in Arkouda.
Provides the _make_op class method to dynamically generate accessor methods that wrap underlying Strings or Datetime operations and return new Series.
Notes
This class is subclassed by StringAccessor and DatetimeAccessor, and is not intended to be used directly.
Examples
Subclasses should define _make_op(“operation_name”), which will generate a method that applies series.values.operation_name(…) and returns a new Series.
- arkouda.RegisteredSymbols = '__RegisteredSymbols__'¶
- class arkouda.Row(dict=None, /, **kwargs)[source]¶
Bases:
collections.UserDictDictionary-like representation of a single row in an Arkouda
DataFrame.Wraps the column-to-value mapping for one row and provides convenient ASCII and HTML formatting for display.
- Parameters:
data (dict) – Mapping of column names to their corresponding values for this row.
Examples
>>> import arkouda as ak >>> from arkouda.pandas.row import Row >>> df = ak.DataFrame({"x": ak.array([10, 20]), "y": ak.array(["a", "b"])})
Suppose
df[0]returns{"x": 10, "y": "a}:>>> row = Row({"x": 10, "y": "a"}) >>> print(row) keys values ------ -------- x 10 y a
- class arkouda.Series(data: Tuple | List | arkouda.pandas.groupbyclass.groupable_element_type | Series | arkouda.numpy.segarray.SegArray | pandas.Series | pandas.Categorical, name=None, index: arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings | Tuple | List | arkouda.pandas.index.Index | None = None)[source]¶
One-dimensional Arkouda array with axis labels.
- Parameters:
- Raises:
TypeError – Raised if
indexis not a pdarray or Strings object. Raised ifdatais not a supported type.ValueError – Raised if the index size does not match the data size.
Notes
The Series class accepts either positional arguments or keyword arguments.
- Positional arguments
Series(data):datais provided and an index is generated automatically.Series(data, index): bothdataandindexare provided.
- Keyword arguments
Series(data=..., index=...):indexis optional but must match the size ofdatawhen provided.
- property at: _LocIndexer¶
Accesses entries of a Series by label.
- Returns:
An indexer for label-based access to Series entries.
- Return type:
_LocIndexer
- static concat(arrays: List, axis: int = 0, index_labels: List[str] | None = None, value_labels: List[str] | None = None, ordered: bool = False) arkouda.pandas.dataframe.DataFrame | Series[source]¶
Concatenate a list of Arkouda Series or grouped arrays horizontally or vertically.
If a list of grouped Arkouda arrays is passed, they are converted to Series. Each grouping is a 2-tuple where the first item is the key(s) and the second is the value. If concatenating horizontally (axis=1), all series/groupings must have the same length and the same index. The index is converted to a column in the resulting DataFrame; if it’s a MultiIndex, each level is converted to a separate column.
- Parameters:
arrays (List) – A list of Series or groupings (tuples of index and values) to concatenate.
axis (int) – The axis to concatenate along: - 0 = vertical (stack series into one) - 1 = horizontal (align by index and produce a DataFrame) Defaults to 0.
index_labels (List[str] or None, optional) – Column name(s) to label the index when axis=1.
value_labels (List[str] or None, optional) – Column names to label the values of each Series.
ordered (bool) – Unused parameter. Reserved for future support of deterministic vs. performance-optimized concatenation. Defaults to False.
- Returns:
If axis=0: a new Series
If axis=1: a new DataFrame
- Return type:
- diff() Series[source]¶
Diffs consecutive values of the series.
Returns a new series with the same index and length. First value is set to NaN.
- dt¶
- property dtype: numpy.dtype¶
- fillna(value: supported_scalars | Series | arkouda.numpy.pdarrayclass.pdarray) Series[source]¶
Fill NA/NaN values using the specified method.
- Parameters:
value (supported_scalars, Series, or pdarray) – Value to use to fill holes (e.g. 0), alternately a Series of values specifying which value to use for each index. Values not in the Series will not be filled. This value cannot be a list.
- Returns:
Object with missing values filled.
- Return type:
Examples
>>> import arkouda as ak >>> from arkouda import Series >>> import numpy as np
>>> data = ak.Series([1, np.nan, 3, np.nan, 5]) >>> data 0 1.0 1 NaN 2 3.0 3 NaN 4 5.0 dtype: float64
>>> fill_values1 = ak.ones(5) >>> data.fillna(fill_values1) 0 1.0 1 1.0 2 3.0 3 1.0 4 5.0 dtype: float64
>>> fill_values2 = Series(ak.ones(5)) >>> data.fillna(fill_values2) 0 1.0 1 1.0 2 3.0 3 1.0 4 5.0 dtype: float64
>>> fill_values3 = 100.0 >>> data.fillna(fill_values3) 0 1.0 1 100.0 2 3.0 3 100.0 4 5.0 dtype: float64
- classmethod from_return_msg(rep_msg: str) Series[source]¶
Return a Series instance pointing to components created by the arkouda server.
The user should not call this function directly.
- Parameters:
rep_msg (builtin_str) –
delimited string containing the values and indexes.
- Returns:
A Series representing a set of pdarray components on the server.
- Return type:
- Raises:
RuntimeError – Raised if a server-side error is thrown in the process of creating the Series instance.
- has_repeat_labels() bool[source]¶
Return whether the Series has any labels that appear more than once.
- hasnans() arkouda.numpy.dtypes.bool_scalars[source]¶
Return True if there are any NaNs.
- Return type:
Examples
>>> import arkouda as ak >>> from arkouda import Series >>> import numpy as np
>>> s = ak.Series(ak.array([1, 2, 3, np.nan])) >>> s 0 1.0 1 2.0 2 3.0 3 NaN dtype: float64
>>> s.hasnans() np.True_
- property iat: _iLocIndexer¶
Accesses entries of a Series by position.
- Returns:
An indexer for position-based access to a single element.
- Return type:
_iLocIndexer
- property iloc: _iLocIndexer¶
Accesses entries of a Series by position.
- Returns:
An indexer for position-based access to Series entries.
- Return type:
_iLocIndexer
- is_registered() bool[source]¶
- Return True iff the object is contained in the registry or is a component of a
registered object.
- Returns:
Indicates if the object is contained in the registry
- Return type:
- Raises:
RegistrationError – Raised if there’s a server-side error or a mis-match of registered components
See also
register,attach,unregisterNotes
Objects registered with the server are immune to deletion until they are unregistered.
- isin(lst: arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings | List) Series[source]¶
Find Series elements whose values are in the specified list.
- isna() Series[source]¶
Detect missing values.
Return a boolean same-sized object indicating if the values are NA. NA values, such as numpy.NaN, gets mapped to True values. Everything else gets mapped to False values. Characters such as empty strings ‘’ are not considered NA values.
- Returns:
Mask of bool values for each element in Series that indicates whether an element is an NA value.
- Return type:
Examples
>>> import arkouda as ak >>> from arkouda import Series >>> import numpy as np
>>> s = Series(ak.array([1, 2, np.nan]), index = ak.array([1, 2, 4])) >>> s.isna() 1 False 2 False 4 True dtype: bool
- isnull() Series[source]¶
Series.isnull is an alias for Series.isna.
Detect missing values.
Return a boolean same-sized object indicating if the values are NA. NA values, such as numpy.NaN, gets mapped to True values. Everything else gets mapped to False values. Characters such as empty strings ‘’ are not considered NA values.
- Returns:
Mask of bool values for each element in Series that indicates whether an element is an NA value.
- Return type:
Examples
>>> import arkouda as ak >>> from arkouda import Series >>> import numpy as np
>>> s = Series(ak.array([1, 2, np.nan]), index = ak.array([1, 2, 4])) >>> s.isnull() 1 False 2 False 4 True dtype: bool
- property loc: _LocIndexer¶
Accesses entries of a Series by label.
- Returns:
An indexer for label-based access to Series entries.
- Return type:
_LocIndexer
- locate(key: int | arkouda.numpy.pdarrayclass.pdarray | arkouda.pandas.index.Index | Series | List | Tuple) Series[source]¶
Lookup values by index label.
- Parameters:
key (int, pdarray, Index, Series, List, or Tuple) –
The key or keys to look up. This can be: - A scalar - A list of scalars - A list of lists (for MultiIndex) - A Series (in which case labels are preserved, and its values are used as keys)
Keys will be converted to Arkouda arrays as needed.
- Returns:
A Series containing the values corresponding to the key.
- Return type:
- map(arg: dict | arkouda.Series) arkouda.Series[source]¶
Map values of Series according to an input mapping.
- Parameters:
arg (dict or Series) – The mapping correspondence.
- Returns:
A new series with the same index as the caller. When the input Series has Categorical values, the return Series will have Strings values. Otherwise, the return type will match the input type.
- Return type:
- Raises:
TypeError – Raised if arg is not of type dict or arkouda.Series. Raised if series values not of type pdarray, Categorical, or Strings.
Examples
>>> import arkouda as ak >>> s = ak.Series(ak.array([2, 3, 2, 3, 4])) >>> s 0 2 1 3 2 2 3 3 4 4 dtype: int64
>>> s.map({4: 25.0, 2: 30.0, 1: 7.0, 3: 5.0}) 0 30.0 1 5.0 2 30.0 3 5.0 4 25.0 dtype: float64
>>> s2 = ak.Series(ak.array(["a","b","c","d"]), index = ak.array([4,2,1,3])) >>> s.map(s2) 0 b 1 d 2 b 3 d 4 a dtype: ...
- memory_usage(index: bool = True, unit: Literal['B', 'KB', 'MB', 'GB'] = 'B') int[source]¶
Return the memory usage of the Series.
The memory usage can optionally include the contribution of the index.
- Parameters:
index (bool) – Specifies whether to include the memory usage of the Series index. Defaults to True.
unit ({"B", "KB", "MB", "GB"}) – Unit to return. One of {‘B’, ‘KB’, ‘MB’, ‘GB’}. Defaults to “B”.
- Returns:
Bytes of memory consumed.
- Return type:
int
See also
arkouda.numpy.pdarrayclass.nbytes,arkouda.Index.memory_usage,arkouda.pandas.series.Series.memory_usage,arkouda.pandas.datafame.DataFrame.memory_usageExamples
>>> import arkouda as ak >>> from arkouda.pandas.series import Series >>> s = ak.Series(ak.arange(3)) >>> s.memory_usage() 48
Not including the index gives the size of the rest of the data, which is necessarily smaller:
>>> s.memory_usage(index=False) 24
Select the units:
>>> s = ak.Series(ak.arange(3000)) >>> s.memory_usage(unit="KB") 46.875
- property ndim: int¶
- notna() Series[source]¶
Detect existing (non-missing) values.
Return a boolean same-sized object indicating if the values are not NA. Non-missing values get mapped to True. Characters such as empty strings ‘’ are not considered NA values. NA values, such as numpy.NaN, get mapped to False values.
- Returns:
Mask of bool values for each element in Series that indicates whether an element is not an NA value.
- Return type:
Examples
>>> import arkouda as ak >>> from arkouda import Series >>> import numpy as np
>>> s = Series(ak.array([1, 2, np.nan]), index = ak.array([1, 2, 4])) >>> s.notna() 1 True 2 True 4 False dtype: bool
- notnull() Series[source]¶
Series.notnull is an alias for Series.notna.
Detect existing (non-missing) values.
Return a boolean same-sized object indicating if the values are not NA. Non-missing values get mapped to True. Characters such as empty strings ‘’ are not considered NA values. NA values, such as numpy.NaN, get mapped to False values.
- Returns:
Mask of bool values for each element in Series that indicates whether an element is not an NA value.
- Return type:
Examples
>>> import arkouda as ak >>> from arkouda import Series >>> import numpy as np
>>> s = Series(ak.array([1, 2, np.nan]), index = ak.array([1, 2, 4])) >>> s.notnull() 1 True 2 True 4 False dtype: bool
- objType = 'Series'¶
- static pdconcat(arrays: List, axis: int = 0, labels: arkouda.numpy.strings.Strings | None = None) pandas.Series | pandas.DataFrame[source]¶
Concatenate a list of Arkouda Series or grouped arrays, returning a local pandas object.
If a list of grouped Arkouda arrays is passed, they are converted to Series. Each grouping is a 2-tuple with the first item being the key(s) and the second the value.
If axis=1 (horizontal), each Series or grouping must have the same length and the same index. The index is converted to a column in the resulting DataFrame. If it is a MultiIndex, each level is converted to a separate column.
- Parameters:
arrays (List) – A list of Series or groupings (tuples of index and values) to concatenate.
axis (int) – The axis along which to concatenate: - 0 = vertical (stack into a Series) - 1 = horizontal (align by index into a DataFrame) Defaults to 0.
labels (Strings or None, optional) – Names to assign to the resulting columns in the DataFrame.
- Returns:
If axis=0: a local pandas Series
If axis=1: a local pandas DataFrame
- Return type:
- register(user_defined_name: str)[source]¶
Register this Series object and underlying components with the Arkouda server.
- Parameters:
user_defined_name (builtin_str) – User-defined name the Series is to be registered under. This will be the root name for the underlying components.
- Returns:
The same Series which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different Series with the same name.
- Return type:
- Raises:
TypeError – Raised if user_defined_name is not a str
RegistrationError – If the server was unable to register the Series with the user_defined_name
See also
unregister,attach,is_registeredNotes
Objects registered with the server are immune to deletion until they are unregistered.
- property shape: Tuple[int]¶
- size¶
- str¶
- to_dataframe(index_labels: List[str] | None = None, value_label: str | None = None) arkouda.pandas.dataframe.DataFrame[source]¶
Convert the Series to an Arkouda DataFrame.
- to_markdown(mode='wt', index=True, tablefmt='grid', storage_options=None, **kwargs)[source]¶
Print Series in Markdown-friendly format.
- Parameters:
mode (str, optional) – Mode in which file is opened, “wt” by default.
index (bool, optional, default True) – Add index (row) labels.
tablefmt (str = "grid") – Table format to call from tablulate: https://pypi.org/project/tabulate/
storage_options (dict, optional) – Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc., if using a URL that will be parsed by fsspec, e.g., starting “s3://”, “gcs://”. An error will be raised if providing this argument with a non-fsspec URL. See the fsspec and backend storage implementation docs for the set of allowed keys and values.
**kwargs – These parameters will be passed to tabulate.
Note
This function should only be called on small Series as it calls pandas.Series.to_markdown: https://pandas.pydata.org/docs/reference/api/pandas.Series.to_markdown.html
Examples
>>> import arkouda as ak
>>> s = ak.Series(["elk", "pig", "dog", "quetzal"], name="animal") >>> print(s.to_markdown()) +----+----------+ | | animal | +====+==========+ | 0 | elk | +----+----------+ | 1 | pig | +----+----------+ | 2 | dog | +----+----------+ | 3 | quetzal | +----+----------+
Output markdown with a tabulate option.
>>> print(s.to_markdown(tablefmt="grid")) +----+----------+ | | animal | +====+==========+ | 0 | elk | +----+----------+ | 1 | pig | +----+----------+ | 2 | dog | +----+----------+ | 3 | quetzal | +----+----------+
- to_pandas() pandas.Series[source]¶
Convert the series to a local PANDAS series.
- topn(n: int = 10) Series[source]¶
Return the top values of the Series.
- Parameters:
n (int) – Number of values to return. Defaults to 10.
- Returns:
A new Series containing the top n values.
- Return type:
- unregister()[source]¶
Unregister this Series object in the arkouda server which was previously registered using register() and/or attached to using attach().
- Raises:
RegistrationError – If the object is already unregistered or if there is a server error when attempting to unregister
See also
register,attach,is_registeredNotes
Objects registered with the server are immune to deletion until they are unregistered.
- validate_key(key: Series | arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings | arkouda.pandas.categorical.Categorical | List | supported_scalars | arkouda.numpy.segarray.SegArray) arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings | arkouda.pandas.categorical.Categorical | supported_scalars | arkouda.numpy.segarray.SegArray[source]¶
Validate type requirements for keys when reading or writing the Series.
Also converts list and tuple arguments into pdarrays.
- Parameters:
key (Series, pdarray, Strings, Categorical, List, supported_scalars, or SegArray) – The key or container of keys that might be used to index into the Series.
- Return type:
The validated key(s), with lists and tuples converted to pdarrays
- Raises:
TypeError – Raised if keys are not boolean values or the type of the labels Raised if key is not one of the supported types
KeyError – Raised if container of keys has keys not present in the Series
IndexError – Raised if the length of a boolean key array is different from the Series
- validate_val(val: arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings | supported_scalars | List) arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings | supported_scalars[source]¶
Validate type requirements for values being written into the Series.
Also converts list and tuple arguments into pdarrays.
- Parameters:
val (pdarray, Strings, supported_scalars, or List) – The value or container of values that might be assigned into the Series.
- Return type:
The validated value, with lists converted to pdarrays
- Raises:
TypeError –
- Raised if val is not the same type or a container with elements
of the same time as the Series
Raised if val is a string or Strings type. Raised if val is not one of the supported types
- class arkouda.StringAccessor(series)[source]¶
Bases:
PropertiesAccessor for string operations on Arkouda Series.
Provides string-like methods such as .contains(), .startswith(), and .endswith() via the .str accessor, similar to pandas.
This accessor is automatically attached to Series objects that wrap arkouda.Strings or arkouda.Categorical values. It should not be instantiated directly.
- Parameters:
series (arkouda.pandas.Series) – The Series object containing Strings or Categorical values.
- Raises:
AttributeError – If the underlying Series values are not Strings or Categorical.
Examples
>>> import arkouda as ak >>> from arkouda import Series >>> s = Series(["apple", "banana", "apricot"]) >>> s.str.startswith("a") 0 True 1 False 2 True dtype: bool
- series¶
- arkouda.apply(arr: arkouda.numpy.pdarrayclass.pdarray, func: Callable | str, result_dtype: numpy.dtype | str | None = None) arkouda.numpy.pdarrayclass.pdarray[source]¶
Apply a python function to a pdarray.
The function should take one argument and return a new value. The function will then be called on each element in the pdarray.
Warning
This function is experimental and may not work as expected.
Known limitations:
Any python modules used inside of the function must be installed on the server.
- Parameters:
arr (pdarray) – The pdarray to which the function is applied.
func (Union[Callable, str]) – The function to apply to the array. This can be a callable function or a string, but either way it should take a single argument and return a single value. If a string, it should be a lambda function that takes a single argument, e.g.
"lambda x,: x+1". Note the dangling comma after the argument; this is required for string functions.result_dtype (Optional[Union[np.dtype, str]]) – The dtype of the resulting pdarray. If None, the dtype of the resulting pdarray will be the same as the input pdarray. If a string, it should be a valid NumPy dtype string, e.g.
"float64". If a NumPy dtype, it should be a valid NumPy dtype object, e.g.np.float64. This is not supported for functions passed as strings.
- Returns:
The pdarray resulting from applying the function to the input array.
- Return type:
Examples
>>> import arkouda as ak >>> arr = ak.apply(ak.array([1, 2, 3]), lambda x: x + 1) >>> arr array([2 3 4])
>>> import arkouda as ak >>> import math >>> arr = ak.randint(0, 10, 4, seed=1) >>> def times_pi(x): ... return x * math.pi >>> arr = ak.apply(arr, times_pi, "float64") >>> arr array([21.991148575128552 28.274333882308138 15.707963267948966 3.1415926535897931])
- arkouda.assert_almost_equal(left, right, rtol: float = 1e-05, atol: float = 1e-08, **kwargs) None[source]¶
Check that the left and right objects are approximately equal.
By approximately equal, we refer to objects that are numbers or that contain numbers which may be equivalent to specific levels of precision.
- Parameters:
left (object)
right (object)
rtol (float, default 1e-5) – Relative tolerance.
atol (float, default 1e-8) – Absolute tolerance.
Warning
This function cannot be used on pdarray of size > ak.core.client.maxTransferBytes because it converts pdarrays to numpy arrays and calls np.allclose.
- arkouda.assert_almost_equivalent(left, right, rtol: float = 1e-05, atol: float = 1e-08) None[source]¶
Check that two objects are approximately equal.
By approximately equal, we refer to objects that are numbers or that contain numbers which may be equivalent to specific levels of precision.
If the objects are pandas or numpy objects, they are converted to Arkouda objects. Then assert_almost_equal is applied to the result.
- Parameters:
left (object) – First object to compare.
right (object) – Second object to compare.
rtol (float) – Relative tolerance. Default is 1e-5.
atol (float) – Absolute tolerance. Default is 1e-8.
- Raises:
TypeError – If either input is not a supported numeric-like type.
Warning
This function cannot be used on pdarrays of size > ak.core.client.maxTransferBytes because it converts pdarrays to numpy arrays and calls np.allclose.
See also
Examples
>>> import arkouda as ak >>> from arkouda.testing import assert_almost_equivalent >>> assert_almost_equivalent(0.123456, 0.123457, rtol=1e-4)
- arkouda.assert_arkouda_array_equal(left: arkouda.pdarray | arkouda.Strings | arkouda.Categorical | arkouda.SegArray, right: arkouda.pdarray | arkouda.Strings | arkouda.Categorical | arkouda.SegArray, check_dtype: bool = True, err_msg=None, check_same=None, obj: str = 'pdarray', index_values=None) None[source]¶
Check that two Arkouda arrays are equivalent. Supports pdarray, Strings, Categorical, and SegArray.
- Parameters:
left (pdarray or Strings or Categorical or SegArray) – The first array to compare.
right (pdarray or Strings or Categorical or SegArray) – The second array to compare.
check_dtype (bool) – Whether to check dtype if both left and right are ak.pdarray. Defaults to True.
err_msg (str or None) – Custom assertion message, if provided. Defaults to None.
check_same ({'copy', 'same'} or None) – If not None, assert whether left and right share the same memory. - ‘copy’: assert that they do not share memory. - ‘same’: assert that they do share memory. Defaults to None.
obj (str) – Object name used in assertion error messages. Defaults to ‘pdarray’.
index_values (Index or pdarray or None) – Optional index shared by both left and right, used to enhance output in error messages. Defaults to None.
- arkouda.assert_arkouda_array_equivalent(left: arkouda.pdarray | arkouda.Strings | arkouda.Categorical | arkouda.SegArray | numpy.ndarray | pandas.Categorical, right: arkouda.pdarray | arkouda.Strings | arkouda.Categorical | arkouda.SegArray | numpy.ndarray | pandas.Categorical, check_dtype: bool = True, err_msg=None, check_same=None, obj: str = 'pdarray', index_values=None) None[source]¶
Check that two Arkouda-compatible arrays are equal.
Supported types include numpy arrays, pandas Categorical, and Arkouda arrays.
- Parameters:
left (pdarray, Strings, Categorical, SegArray, np.ndarray, or pd.Categorical) – First array to compare.
right (pdarray, Strings, Categorical, SegArray, np.ndarray, or pd.Categorical) – Second array to compare.
check_dtype (bool) – Whether to verify that dtypes match. Default is True.
err_msg (str or None) – Optional message to display on failure.
check_same (None or {"copy", "same"}) – Whether to ensure identity or separation in memory. Default is None.
obj (str) – Object label for error messages. Default is “pdarray”.
index_values (Index or pdarray, optional) – Shared index used in error output. Default is None.
- Raises:
TypeError – If either input is not a supported array type.
See also
Examples
>>> import arkouda as ak >>> from arkouda import Strings >>> from arkouda.testing import assert_arkouda_array_equivalent >>> a = ak.array([1, 2, 3]) >>> b = ak.array([1, 2, 3]) >>> assert_arkouda_array_equivalent(a, b) >>> s1 = ak.array(['x', 'y']) >>> s2 = ak.array(['x', 'y']) >>> assert_arkouda_array_equivalent(s1, s2)
- arkouda.assert_arkouda_pdarray_equal(left: arkouda.pdarray, right: arkouda.pdarray, check_dtype: bool = True, err_msg=None, check_same=None, obj: str = 'pdarray', index_values=None) None[source]¶
Check that two Arkouda pdarray objects are equivalent.
- Parameters:
left (pdarray) – The first array to compare.
right (pdarray) – The second array to compare.
check_dtype (bool) – Whether to check dtype if both arrays are pdarrays. Defaults to True.
err_msg (str or None) – Custom assertion message to display on failure. Defaults to None.
check_same ({'copy', 'same'} or None) – If not None, asserts whether left and right share the same memory: - ‘copy’: assert they do not share memory - ‘same’: assert they do share memory Defaults to None.
obj (str) – A name for the object being compared, used in assertion messages. Defaults to ‘pdarray’.
index_values (Index or pdarray or None) – Optional index shared by both arrays, used to enhance output on failure. Defaults to None.
- arkouda.assert_arkouda_segarray_equal(left: arkouda.SegArray, right: arkouda.SegArray, check_dtype: bool = True, err_msg=None, check_same=None, obj: str = 'segarray') None[source]¶
Check that two Arkouda SegArray objects are equivalent.
- Parameters:
left (SegArray) – The first SegArray to compare.
right (SegArray) – The second SegArray to compare.
check_dtype (bool) – Whether to check dtype if both arrays contain pdarrays. Defaults to True.
err_msg (str or None) – Custom assertion message. Defaults to None.
check_same ({'copy', 'same'} or None) – If not None, asserts whether left and right share the same memory. - ‘copy’: assert that they do not share memory. - ‘same’: assert that they do share memory. Defaults to None.
obj (str) – Name of the object being compared (used in assertion messages). Defaults to ‘segarray’.
- arkouda.assert_arkouda_strings_equal(left, right, err_msg=None, check_same=None, obj: str = 'Strings', index_values=None) None[source]¶
Check that two ak.Strings arrays are equivalent.
- Parameters:
left (Strings) – The first Strings object to compare.
right (Strings) – The second Strings object to compare.
err_msg (str or None) – Custom assertion message. Defaults to None.
check_same ({'copy', 'same'} or None) – If not None, assert whether left and right share the same memory. - ‘copy’: assert that they do not share memory - ‘same’: assert that they do share memory Defaults to None.
obj (str) – A name for the object being compared, used in assertion messages. Defaults to ‘Strings’.
index_values (Index or pdarray or None) – Optional index shared by both arrays, used in output. Defaults to None.
- arkouda.assert_attr_equal(attr: str, left, right, obj: str = 'Attributes') None[source]¶
Check that attributes are equal. Both objects must have the given attribute.
- arkouda.assert_categorical_equal(left, right, check_dtype: bool = True, check_category_order: bool = True, obj: str = 'Categorical') None[source]¶
Test that Categoricals are equivalent.
- Parameters:
left (Categorical) – The first Categorical to compare.
right (Categorical) – The second Categorical to compare.
check_dtype (bool) – Whether to check that the integer dtype of the codes is the same. Defaults to True.
check_category_order (bool) – Whether to compare the order of the categories (which implies identical integer codes). If False, only the resulting values are compared. The ordered attribute is always checked. Defaults to True.
obj (str) – A name for the object being compared, used in assertion messages. Defaults to ‘Categorical’.
- arkouda.assert_class_equal(left, right, exact: bool = True, obj: str = 'Input') None[source]¶
Check classes are equal.
- arkouda.assert_contains_all(iterable, dic) None[source]¶
Assert that a dictionary contains all the elements of an iterable.
- Parameters:
iterable (iterable)
dic (dict)
- arkouda.assert_copy(iter1, iter2, **eql_kwargs) None[source]¶
Check that the elements are equal, but not the same object.
Does not check that items in sequences are also not the same object.
- Parameters:
iter1 (iterable) – Iterables that produce elements comparable with assert_almost_equal.
iter2 (iterable) – Iterables that produce elements comparable with assert_almost_equal.
- arkouda.assert_dict_equal(left, right, compare_keys: bool = True) None[source]¶
Assert that two dictionaries are equal.
Values must be arkouda objects.
- Parameters:
left (dict) – The dictionaries to be compared.
right (dict) – The dictionaries to be compared.
compare_keys (bool) – Whether to compare the keys. Defaults to True. If False, only the values are compared.
- arkouda.assert_equal(left, right, **kwargs) None[source]¶
Wrap tm.assert_*_equal to dispatch to the appropriate test function.
- arkouda.assert_equivalent(left, right, **kwargs) None[source]¶
Dispatch to the appropriate assertion function depending on object types.
- Parameters:
left (Any) – First object to compare. Type determines which assertion function is used.
right (Any) – Second object to compare.
**kwargs (dict) – Keyword arguments passed to the specific assertion function.
- Raises:
AssertionError – If values are not equivalent.
Examples
>>> import arkouda as ak >>> import pandas as pd >>> from arkouda.testing import assert_equivalent >>> ak_series = ak.Series([1, 2, 3]) >>> pd_series = pd.Series([1, 2, 3]) >>> assert_equivalent(ak_series, pd_series)
- arkouda.assert_frame_equal(left: arkouda.DataFrame, right: arkouda.DataFrame, check_dtype: bool = True, check_index_type: bool = True, check_column_type: bool | Literal['equiv'] = 'equiv', check_frame_type: bool = True, check_names: bool = True, check_exact: bool = True, check_categorical: bool = True, check_like: bool = False, rtol: float = 1e-05, atol: float = 1e-08, obj: str = 'DataFrame') None[source]¶
Check that left and right DataFrame are equal.
This function is intended to compare two DataFrames and output any differences. It is mostly intended for use in unit tests. Additional parameters allow varying the strictness of the equality checks performed.
- Parameters:
left (DataFrame) – First DataFrame to compare.
right (DataFrame) – Second DataFrame to compare.
check_dtype (bool) – Whether to check the DataFrame dtype is identical. Defaults to True.
check_index_type (bool) – Whether to check the Index class, dtype, and inferred_type are identical. Defaults to True.
check_column_type (bool or {'equiv'}) – Whether to check the column class, dtype, and inferred_type are identical. Passed as the
exactargument ofassert_index_equal(). Defaults to ‘equiv’.check_frame_type (bool) – Whether to check the DataFrame class is identical. Defaults to True.
check_names (bool) – Whether to check that the names attribute for both the index and column attributes of the DataFrame is identical. Defaults to True.
check_exact (bool) – Whether to compare numbers exactly. Defaults to False.
check_categorical (bool) – Whether to compare internal Categoricals exactly. Defaults to True.
check_like (bool) – If True, ignore the order of index and columns. Note: index labels must match their respective rows (as in columns); same labels must be with the same data. Defaults to False.
rtol (float) – Relative tolerance. Only used when check_exact is False. Defaults to 1e-5.
atol (float) – Absolute tolerance. Only used when check_exact is False. Defaults to 1e-8.
obj (str) – A name for the object being compared, used in assertion messages. Defaults to ‘DataFrame’.
See also
assert_series_equalEquivalent method for asserting Series equality.
Examples
>>> import arkouda as ak
This example shows comparing two DataFrames that are equal but with columns of differing dtypes.
>>> from arkouda.testing import assert_frame_equal >>> df1 = ak.DataFrame({'a': [1, 2], 'b': [3, 4]}) >>> df2 = ak.DataFrame({'a': [1, 2], 'b': [3.0, 4.0]})
df1 equals itself. >>> assert_frame_equal(df1, df1)
df1 differs from df2 as column ‘b’ is of a different type. >>> assert_frame_equal(df1, df2) # doctest: +SKIP Traceback (most recent call last): … AssertionError: Attributes of DataFrame.iloc[:, 1] (column name=”b”) are different
Attribute “dtype” are different [left]: int64 [right]: float64
Ignore differing dtypes in columns with check_dtype.
>>> assert_frame_equal(df1, df2, check_dtype=False)
- arkouda.assert_frame_equivalent(left: arkouda.DataFrame | pandas.DataFrame, right: arkouda.DataFrame | pandas.DataFrame, check_dtype: bool = True, check_index_type: bool = True, check_column_type: bool = True, check_frame_type: bool = True, check_names: bool = True, check_exact: bool = True, check_categorical: bool = True, check_like: bool = False, rtol: float = 1e-05, atol: float = 1e-08, obj: str = 'DataFrame') None[source]¶
Check that two DataFrames are equal.
This function compares two DataFrames and raises an assertion if they differ. It is intended primarily for use in unit tests. pandas DataFrames are converted to Arkouda equivalents before comparison.
- Parameters:
left (DataFrame or pd.DataFrame) – First DataFrame to compare.
right (DataFrame or pd.DataFrame) – Second DataFrame to compare.
check_dtype (bool) – Whether to check that dtypes are identical. Default is True.
check_index_type (bool) – Whether to check that index class, dtype, and inferred type are identical. Default is True.
check_column_type (bool) – Whether to check that column class, dtype, and inferred type are identical. Default is True.
check_frame_type (bool) – Whether to check that the DataFrame class is identical. Default is True.
check_names (bool) – Whether to check that the index and column names are identical. Default is True.
check_exact (bool) – Whether to compare values exactly. Default is True.
check_categorical (bool) – Whether to compare internal categoricals exactly. Default is True.
check_like (bool) – Whether to ignore the order of index and columns. Labels must still match their data. / Default is False.
rtol (float) – Relative tolerance used when check_exact is False. Default is 1e-5.
atol (float) – Absolute tolerance used when check_exact is False. Default is 1e-8.
obj (str) – Object name used in error messages. Default is “DataFrame”.
- Raises:
TypeError – If either input is not a DataFrame or pd.DataFrame.
See also
Examples
>>> import arkouda as ak >>> import pandas as pd >>> from arkouda.testing import assert_frame_equivalent >>> df1 = ak.DataFrame({'a': [1, 2], 'b': [3, 4]}) >>> df2 = pd.DataFrame({'a': [1, 2], 'b': [3.0, 4.0]})
Fails because dtypes are different: >>> assert_frame_equivalent(df1, df2) # doctest: +SKIP
- arkouda.assert_index_equal(left: arkouda.Index, right: arkouda.Index, exact: bool = True, check_names: bool = True, check_exact: bool = True, check_categorical: bool = True, check_order: bool = True, rtol: float = 1e-05, atol: float = 1e-08, obj: str = 'Index') None[source]¶
Check that left and right Index are equal.
- Parameters:
left (Index) – The first Index to compare.
right (Index) – The second Index to compare.
exact (bool) – Whether to check that the Index class, dtype, and inferred_type are identical. Defaults to True.
check_names (bool) – Whether to check the name attribute. Defaults to True.
check_exact (bool) – Whether to compare numbers exactly. Defaults to True.
check_categorical (bool) – Whether to compare internal Categorical values exactly. Defaults to True.
check_order (bool) – Whether to compare the order of index entries as well as their values. If True, both indexes must contain the same elements, in the same order. If False, both indexes must contain the same elements, but in any order. Defaults to True.
rtol (float) – Relative tolerance. Only used when check_exact is False. Defaults to 1e-5.
atol (float) – Absolute tolerance. Only used when check_exact is False. Defaults to 1e-8.
obj (str) – A name for the object being compared, used in assertion messages. Defaults to ‘Index’.
Examples
>>> import arkouda as ak >>> from arkouda import testing as tm >>> a = ak.Index([1, 2, 3]) >>> b = ak.Index([1, 2, 3]) >>> tm.assert_index_equal(a, b)
- arkouda.assert_index_equivalent(left: arkouda.Index | pandas.Index, right: arkouda.Index | pandas.Index, exact: bool = True, check_names: bool = True, check_exact: bool = True, check_categorical: bool = True, check_order: bool = True, rtol: float = 1e-05, atol: float = 1e-08, obj: str = 'Index') None[source]¶
Check that two Index objects are equal.
If the objects are pandas Index, they are converted to Arkouda Index. Then assert_index_equal is applied to the result.
- Parameters:
left (Index or pd.Index) – First Index to compare.
right (Index or pd.Index) – Second Index to compare.
exact (bool) – Whether to check that class, dtype, and inferred type are identical. Default is True.
check_names (bool) – Whether to check the names attribute. Default is True.
check_exact (bool) – Whether to compare values exactly. Default is True.
check_categorical (bool) – Whether to compare internal Categoricals exactly. Default is True.
check_order (bool) – Whether to require identical order in index values. Default is True.
rtol (float) – Relative tolerance used when check_exact is False. Default is 1e-5.
atol (float) – Absolute tolerance used when check_exact is False. Default is 1e-8.
obj (str) – Object name used in error messages. Default is “Index”.
- Raises:
TypeError – If either input is not an Index or pd.Index.
See also
Examples
>>> import arkouda as ak >>> from arkouda import testing as tm >>> import pandas as pd >>> a = ak.Index([1, 2, 3]) >>> b = pd.Index([1, 2, 3]) >>> tm.assert_index_equivalent(a, b)
- arkouda.assert_series_equal(left, right, check_dtype: bool = True, check_index_type: bool = True, check_series_type: bool = True, check_names: bool = True, check_exact: bool = False, check_categorical: bool = True, check_category_order: bool = True, rtol: float = 1e-05, atol: float = 1e-08, obj: str = 'Series', *, check_index: bool = True, check_like: bool = False) None[source]¶
Check that left and right Series are equal.
- Parameters:
left (Series) – First Series to compare.
right (Series) – Second Series to compare.
check_dtype (bool) – Whether to check the Series dtype is identical. Defaults to True.
check_index_type (bool) – Whether to check the Index class, dtype, and inferred_type are identical. Defaults to True.
check_series_type (bool) – Whether to check that the Series class is identical. Defaults to True.
check_names (bool) – Whether to check the Series and Index name attribute. Defaults to True.
check_exact (bool) – Whether to compare numbers exactly. Defaults to False.
check_categorical (bool) – Whether to compare internal Categoricals exactly. Defaults to True.
check_category_order (bool) – Whether to compare the category order of internal Categoricals. Defaults to True.
rtol (float) – Relative tolerance. Only used when check_exact is False. Defaults to 1e-5.
atol (float) – Absolute tolerance. Only used when check_exact is False. Defaults to 1e-8.
obj (str) – Name of the object being compared, used in assertion messages. Defaults to ‘Series’.
check_index (bool) – Whether to check index equivalence. If False, only the values are compared. Defaults to True.
check_like (bool) – If True, ignore the order of the index. Must be False if check_index is False. Note: same labels must be with the same data. Defaults to False.
Examples
>>> import arkouda as ak >>> from arkouda import testing as tm >>> a = ak.Series([1, 2, 3, 4]) >>> b = ak.Series([1, 2, 3, 4]) >>> tm.assert_series_equal(a, b)
- arkouda.assert_series_equivalent(left: arkouda.Series | pandas.Series, right: arkouda.Series | pandas.Series, check_dtype: bool = True, check_index_type: bool = True, check_series_type: bool = True, check_names: bool = True, check_exact: bool = False, check_categorical: bool = True, check_category_order: bool = True, rtol: float = 1e-05, atol: float = 1e-08, obj: str = 'Series', *, check_index: bool = True, check_like: bool = False) None[source]¶
Check that two Series are equal.
This function compares two Series and raises an assertion if they differ. pandas Series are converted to Arkouda equivalents before comparison. The comparison can be customized using the provided keyword arguments.
- Parameters:
left (Series or pd.Series) – First Series to compare.
right (Series or pd.Series) – Second Series to compare.
check_dtype (bool) – Whether to check that dtypes are identical. Default is True.
check_index_type (bool) – Whether to check that index class, dtype, and inferred type are identical. Default is True.
check_series_type (bool) – Whether to check that the Series class is identical. Default is True.
check_names (bool) – Whether to check that the Series and Index name attributes are identical. Default is True.
check_exact (bool) – Whether to compare numbers exactly. Default is False.
check_categorical (bool) – Whether to compare internal Categoricals exactly. Default is True.
check_category_order (bool) – Whether to compare category order in internal Categoricals. Default is True.
rtol (float) – Relative tolerance used when check_exact is False. Default is 1e-5.
atol (float) – Absolute tolerance used when check_exact is False. Default is 1e-8.
obj (str) – Object name used in error messages. Default is “Series”.
check_index (bool) – Whether to check index equivalence. If False, only values are compared. Default is True.
check_like (bool) – If True, ignore the order of the index. Must be False if check_index is False. Note: identical labels must still correspond to the same data. Default is False.
- Raises:
TypeError – If either input is not a Series or pd.Series.
See also
Examples
>>> import arkouda as ak >>> from arkouda import testing as tm >>> import pandas as pd >>> a = ak.Series([1, 2, 3, 4]) >>> b = pd.Series([1, 2, 3, 4]) >>> tm.assert_series_equivalent(a, b)
- arkouda.chisquare(f_obs, f_exp=None, ddof=0)[source]¶
Computes the chi square statistic and p-value.
- Parameters:
- Return type:
arkouda.akstats.Power_divergenceResult
Examples
>>> import arkouda as ak >>> from arkouda.scipy import chisquare >>> chisquare(ak.array([10, 20, 30, 10]), ak.array([10, 30, 20, 10])) Power_divergenceResult(statistic=np.float64(8.333333333333334), pvalue=np.float64(0.03960235520...))
See also
scipy.stats.chisquare,arkouda.akstats.power_divergenceReferences
[1] “Chi-squared test”, https://en.wikipedia.org/wiki/Chi-squared_test
[2] “scipy.stats.chisquare”, https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html
- arkouda.compute_join_size(a: arkouda.numpy.pdarrayclass.pdarray, b: arkouda.numpy.pdarrayclass.pdarray) Tuple[int, int][source]¶
Compute the internal size of a hypothetical join between a and b. Returns both the number of elements and number of bytes required for the join.
- arkouda.date_operators(cls)[source]¶
Add common datetime operation methods to a DatetimeAccessor class.
This class decorator dynamically attaches datetime operations (floor, ceil, round) to the given class using the _make_op helper.
- Parameters:
cls (type) – The accessor class to decorate.
- Returns:
The accessor class with datetime methods added.
- Return type:
Notes
Used internally to implement the .dt accessor API.
- arkouda.disable_verbose(logLevel: LogLevel = LogLevel.INFO) None[source]¶
Disables verbose logging.
Disables verbose logging (DEBUG log level) for all ArkoudaLoggers, setting the log level for each to the logLevel parameter.
- Parameters:
logLevel (LogLevel) – The new log level, defaultts to LogLevel.INFO
- Raises:
TypeError – Raised if logLevel is not a LogLevel enum
- arkouda.enable_verbose() None[source]¶
Enable verbose logging (DEBUG log level) for all ArkoudaLoggers.
- arkouda.export(read_path: str, dataset_name: str = 'ak_data', write_file: str | None = None, return_obj: bool = True, index: bool = False)[source]¶
Export data from arkouda to pandas.
Export data from Arkouda file (Parquet/HDF5) to Pandas object or file formatted to be readable by Pandas.
- Parameters:
read_path (str) – path to file where arkouda data is stored.
dataset_name (str) – name to store dataset under
write_file (str) – path to file to write pandas formatted data to. Only write the file if this is set. Default is None.
return_obj (bool) – When True (default) return the Pandas DataFrame object, otherwise return None.
index (bool) – Default False. When True, maintain the indexes loaded from the pandas file
- Raises:
RuntimeError –
Unsupported file type
- Returns:
When return_obj=True
- Return type:
pd.DataFrame
See also
pandas.DataFrame.to_parquet,pandas.DataFrame.to_hdf,pandas.DataFrame.read_parquet,pandas.DataFrame.read_hdf,ak.import_dataNotes
If Arkouda file is exported for pandas, the format will not change. This mean parquet files will remain parquet and hdf5 will remain hdf5.
Export can only be performed from hdf5 or parquet files written by Arkouda. The result will be the same file type, but formatted to be read by Pandas.
- arkouda.from_series(series: pandas.Series, dtype: type | str | None = None) arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings[source]¶
Convert a pandas
Seriesto an ArkoudapdarrayorStrings.If
dtypeis not provided, the dtype is inferred from the pandasSeries(using pandas dtype metadata). Ifdtypeis provided, it is used as an override and normalized via Arkouda’s dtype resolution rules.In addition to the core numeric and boolean types, this function supports datetime and timedelta
Seriesof any resolution (ns,us,ms, etc.) by converting them to anint64pdarrayof nanoseconds.- Parameters:
series (pd.Series) – The pandas
Seriesto convert.dtype (Optional[Union[type, str]], optional) –
Optional dtype override. This may be a Python type (e.g.
bool), a NumPy scalar type (e.g.np.int64), or a dtype string.String-like spellings are normalized to Arkouda string dtype, including
"object","str","string","string[python]", and"string[pyarrow]".
- Returns:
An Arkouda
pdarrayfor numeric, boolean, datetime, or timedelta inputs, or an ArkoudaStringsfor string inputs.- Return type:
- Raises:
ValueError – Raised if the dtype cannot be interpreted or is unsupported for conversion.
Examples
>>> import arkouda as ak >>> import numpy as np >>> import pandas as pd
Integers:
>>> np.random.seed(1701) >>> ak.from_series(pd.Series(np.random.randint(0, 10, 5))) array([4 3 3 5 0])
>>> ak.from_series(pd.Series(['1', '2', '3', '4', '5']), dtype=np.int64) array([1 2 3 4 5])
Floats:
>>> np.random.seed(1701) >>> ak.from_series(pd.Series(np.random.uniform(low=0.0, high=1.0, size=3))) array([0.089433234324597599 0.1153776854774361 0.51874393620990389])
Booleans:
>>> np.random.seed(1864) >>> ak.from_series(pd.Series(np.random.choice([True, False], size=5))) array([True True True False False])
Strings (pandas dtype spellings normalized to Arkouda
Strings):>>> ak.from_series(pd.Series(['a', 'b', 'c', 'd', 'e'], dtype="string")) array(['a', 'b', 'c', 'd', 'e'])
>>> ak.from_series(pd.Series(['a', 'b', 'c'], dtype="string[pyarrow]")) array(['a', 'b', 'c'])
Datetime (any resolution is accepted and returned as
int64nanoseconds):>>> ak.from_series(pd.Series(pd.to_datetime(['1/1/2018', np.datetime64('2018-01-01')]))) array([1514764800000000000 1514764800000000000])
Notes
Datetime and timedelta
Seriesare converted toint64nanoseconds.String-like pandas dtypes (including
object) are treated as string and converted to ArkoudaStrings.
- arkouda.gen_ranges(starts, ends, stride=1, return_lengths=False)[source]¶
Generate a segmented array of variable-length, contiguous ranges between pairs of start- and end-points.
- Parameters:
- Returns:
- segmentspdarray, int64
The starting index of each range in the resulting array
- rangespdarray, int64
The actual ranges, flattened into a single array
- lengthspdarray, int64
The lengths of each segment. Only returned if return_lengths=True.
- Return type:
- arkouda.get_columns(filenames: str | List[str], col_delim: str = ',', allow_errors: bool = False) List[str][source]¶
Get a list of column names from CSV file(s).
- arkouda.get_datasets(filenames: str | List[str], allow_errors: bool = False, column_delim: str = ',', read_nested: bool = True) List[str][source]¶
Get the names of the datasets in the provide files.
- Parameters:
filenames (str or List[str]) – Name of the file/s from which to return datasets
allow_errors (bool) – Default: False Whether or not to allow errors while accessing datasets
column_delim (str) – Column delimiter to be used if dataset is CSV. Otherwise, unused.
read_nested (bool) – Default True, when True, SegArray objects will be read from the file. When False, SegArray (or other nested Parquet columns) will be ignored. Only used for Parquet Files.
- Return type:
- Raises:
RuntimeError –
If no datasets are returned
Notes
This function currently supports HDF5 and Parquet formats.
Future updates to Parquet will deprecate this functionality on that format,
but similar support will be added for Parquet at that time. - If a list of files is provided, only the datasets in the first file will be returned
See also
- arkouda.get_filetype(filenames: str | List[str]) str[source]¶
Get the type of a file accessible to the server.
Supported file types and possible return strings are ‘HDF5’ and ‘Parquet’.
- Parameters:
filenames (Union[str, List[str]]) – A file or list of files visible to the arkouda server
- Returns:
Type of the file returned as a string, either ‘HDF5’, ‘Parquet’ or ‘CSV
- Return type:
- Raises:
ValueError – Raised if filename is empty or contains only whitespace
Notes
When list provided, it is assumed that all files are the same type
CSV Files without the Arkouda Header are not supported
See also
- arkouda.get_null_indices(filenames: str | List[str], datasets: str | List[str] | None = None) arkouda.numpy.pdarrayclass.pdarray | Mapping[str, arkouda.numpy.pdarrayclass.pdarray][source]¶
Get null indices of a string column in a Parquet file.
- Parameters:
- Returns:
Dictionary of {datasetName: pdarray}
- Return type:
returns a dictionary of Arkouda pdarrays
- Raises:
RuntimeError – Raised if one or more of the specified files cannot be opened.
TypeError – Raised if we receive an unknown arkouda_type returned from the server
See also
- arkouda.import_data(read_path: str, write_file: str | None = None, return_obj: bool = True, index: bool = False)[source]¶
Import data from a file saved by Pandas (HDF5/Parquet).
Import data from a file saved by Pandas (HDF5/Parquet) to Arkouda object and/or a file formatted to be read by Arkouda.
- Parameters:
read_path (str) – path to file where pandas data is stored. This can be glob expression for parquet formats.
write_file (str, optional) – path to file to write arkouda formatted data to. Only write file if provided
return_obj (bool) – If True (default), return the Arkouda DataFrame object. If False, return None.
index (bool) – If True, maintain the indexes loaded from the pandas file. Default is False.
- Raises:
RuntimeWarning –
Export attempted on Parquet file. Arkouda formatted Parquet files are readable by pandas.
RuntimeError –
Unsupported file type
- Returns:
When return_obj=True
- Return type:
pd.DataFrame
See also
pandas.DataFrame.to_parquet,pandas.DataFrame.to_hdf,pandas.DataFrame.read_parquet,pandas.DataFrame.read_hdf,ak.exportNotes
Import can only be performed from hdf5 or parquet files written by pandas.
- arkouda.information(names: List[str] | str = RegisteredSymbols) str[source]¶
Return a JSON formatted string containing information about the objects in names.
- Parameters:
names (Union[List[str], str]) – names is either the name of an object or list of names of objects to retrieve info if names is ak.AllSymbols, retrieves info for all symbols in the symbol table if names is ak.RegisteredSymbols, retrieves info for all symbols in the registry
- Returns:
JSON formatted string containing a list of information for each object in names
- Return type:
- Raises:
RuntimeError – Raised if a server-side error is thrown in the process of retrieving information about the objects in names
- arkouda.intersect(a, b, positions=True, unique=False)[source]¶
Find the intersection of two arkouda arrays.
This function can be especially useful when positions=True so that the caller gets the indices of values present in both arrays.
- Parameters:
positions (bool, default=True) – Return tuple of boolean pdarrays that indicate positions in a and b of the intersection values.
unique (bool, default=False) – If the number of distinct values in a (and b) is equal to the size of a (and b), there is a more efficient method to compute the intersection.
- Returns:
(arkouda.numpy.pdarrayclass.pdarray, arkouda.numpy.pdarrayclass.pdarray) or
arkouda.numpy.pdarrayclass.pdarray – The indices of a and b where any element occurs at least once in both arrays.
Examples
>>> import arkouda as ak >>> a = ak.arange(10) >>> print(a) [0 1 2 3 4 5 6 7 8 9]
>>> b = 2 * ak.arange(10) >>> print(b) [0 2 4 6 8 10 12 14 16 18]
>>> intersect(a,b, positions=True) (array([True False True False True False True False True False]), array([True True True True True False False False False False]))
>>> intersect(a,b, positions=False) array([0 2 4 6 8])
- arkouda.intx(a, b)[source]¶
Find all the rows that are in both dataframes.
Columns should be in identical order.
Note: does not work for columns of floating point values, but does work for Strings, pdarrays of int64 type, and Categorical should work.
Examples
>>> import arkouda as ak >>> a = ak.DataFrame({'a':ak.arange(5),'b': 2* ak.arange(5)}) >>> a a b 0 0 0 1 1 2 2 2 4 3 3 6 4 4 8 (5 rows x 2 columns)
>>> b = ak.DataFrame({'a':ak.arange(5),'b':ak.array([0,3,4,7,8])}) >>> b a b 0 0 0 1 1 3 2 2 4 3 3 7 4 4 8 (5 rows x 2 columns)
>>> intx(a,b) array([True False True False True]) >>> intersect_df = a[intx(a,b)] >>> intersect_df a b 0 0 0 2 2 4 4 4 8 (3 rows x 2 columns)
- arkouda.invert_permutation(perm)[source]¶
Find the inverse of a permutation array.
- Parameters:
perm (pdarray) – The permutation array.
- Returns:
The inverse of the permutation array.
- Return type:
Examples
>>> import arkouda as ak >>> from arkouda.pandas.index import Index >>> i = Index(ak.array([1,2,0,5,4])) >>> perm = i.argsort() >>> print(perm) [2 0 1 4 3] >>> invert_permutation(perm) array([1 2 0 4 3])
- arkouda.join_on_eq_with_dt(a1: arkouda.numpy.pdarrayclass.pdarray, a2: arkouda.numpy.pdarrayclass.pdarray, t1: arkouda.numpy.pdarrayclass.pdarray, t2: arkouda.numpy.pdarrayclass.pdarray, dt: int | numpy.int64, pred: str, result_limit: int | numpy.int64 = 1000) Tuple[arkouda.numpy.pdarrayclass.pdarray, arkouda.numpy.pdarrayclass.pdarray][source]¶
Inner-join on equality between two integer arrays where the time-window predicate is also true.
- Parameters:
a1 (pdarray) – Values to join (must be int64 dtype).
a2 (pdarray) – Values to join (must be int64 dtype).
t1 (pdarray) – timestamps in millis corresponding to the a1 pdarray
t2 (pdarray) – timestamps in millis corresponding to the a2 pdarray
dt (Union[int,np.int64]) – time delta
pred (str) – time window predicate
result_limit (Union[int,np.int64]) – size limit for returned result
- Returns:
- result_array_onepdarray, int64
a1 indices where a1 == a2
- result_array_onepdarray, int64
a2 indices where a2 == a1
- Return type:
- Raises:
TypeError – Raised if a1, a2, t1, or t2 is not a pdarray, or if dt or result_limit is not an int
ValueError – if a1, a2, t1, or t2 dtype is not int64, pred is not ‘true_dt’, ‘abs_dt’, or ‘pos_dt’, or result_limit is < 0
- arkouda.list_registry(detailed: bool = False)[source]¶
Return a list containing the names of all registered objects.
- Parameters:
detailed (bool) – Default = False Return details of registry objects. Currently includes object type for any objects
- Returns:
Dict containing keys “Components” and “Objects”.
- Return type:
dict
- Raises:
RuntimeError – Raised if there’s a server-side error thrown
- arkouda.list_symbol_table() List[str][source]¶
Return a list containing the names of all objects in the symbol table.
- Returns:
List of all object names in the symbol table
- Return type:
list
- Raises:
RuntimeError – Raised if there’s a server-side error thrown
- arkouda.load(path_prefix: str, file_format: str = 'INFER', dataset: str = 'array', calc_string_offsets: bool = False, column_delim: str = ',') Mapping[str, arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings | arkouda.numpy.segarray.SegArray | arkouda.pandas.categorical.Categorical | arkouda.pandas.dataframe.DataFrame | arkouda.client_dtypes.IPv4 | arkouda.numpy.timeclass.Datetime | arkouda.numpy.timeclass.Timedelta | arkouda.pandas.index.Index][source]¶
Load objects previously saved with
pdarray.save().- Parameters:
path_prefix (str) – Filename prefix used when saving the original object.
file_format (str, default="INFER") – File format to load. One of
"INFER","HDF5", or"Parquet". If"INFER", the format will be detected automatically.dataset (str, default="array") – Dataset name where the object was saved.
calc_string_offsets (bool, default=False) – If
True, the server ignores the segmentedStringsoffsetsarray and derives offsets from null-byte terminators.column_delim (str, default=",") – Column delimiter used if the dataset is CSV. Otherwise unused.
- Returns:
DataFrame, IPv4, Datetime, Timedelta, Index]] Dictionary mapping
datasetNameto the loaded object. The values may bepdarray,Strings,SegArray,Categorical,DataFrame,IPv4,Datetime,Timedelta, orIndex.- Return type:
- Raises:
TypeError – Raised if either
path_prefixordatasetis not astr.ValueError – Raised if an invalid
file_formatis given, if the dataset is not present in all HDF5 files, or ifpath_prefixdoes not correspond to files accessible to Arkouda.RuntimeError – Raised if the HDF5 files are present but an error occurs while opening one or more of them.
See also
Notes
If a previously saved Parquet file raises a
FileNotFoundError, try loading it with.parquetappended topath_prefix. Older versions of Arkouda always stored Parquet files with a.parquetextension.ak.loaddoes not support loading a single file. To load a single HDF5 file without the_LOCALE####suffix, useak.read().CSV files without the Arkouda header are not supported.
Examples
>>> import arkouda as ak
Loading from file without extension:
>>> obj = ak.load("path/prefix")
This loads the array from
numLocalesfiles with the namecwd/path/name_prefix_LOCALE####. The file type is inferred automatically.Loading with an extension (HDF5):
>>> obj = ak.load("path/prefix.test")
This loads the object from
numLocalesfiles with the namecwd/path/name_prefix_LOCALE####.testwhere####corresponds to each locale number. Because the file type is inferred, the extension does not need to correspond to a specific format.
- arkouda.load_all(path_prefix: str, file_format: str = 'INFER', column_delim: str = ',', read_nested: bool = True) Mapping[str, arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings | arkouda.numpy.segarray.SegArray | arkouda.pandas.categorical.Categorical | arkouda.pandas.dataframe.DataFrame | arkouda.client_dtypes.IPv4 | arkouda.numpy.timeclass.Datetime | arkouda.numpy.timeclass.Timedelta | arkouda.pandas.index.Index][source]¶
Load multiple pdarrays, Strings, SegArrays, or Categoricals previously saved with
save_all().- Parameters:
path_prefix (str) – Filename prefix used to save the original pdarray
file_format (str) – ‘INFER’, ‘HDF5’, ‘Parquet’, or ‘CSV’. Defaults to ‘INFER’. Indicates the format being loaded. When ‘INFER’ the processing will detect the format Defaults to ‘INFER’
column_delim (str) – Column delimiter to be used if dataset is CSV. Otherwise, unused.
read_nested (bool) – Default True, when True, SegArray objects will be read from the file. When False, SegArray (or other nested Parquet columns) will be ignored. Parquet files only
- Returns:
Dictionary of {datsetName: Union[pdarray, Strings, SegArray, Categorical]} with the previously saved pdarrays, Strings, SegArrays, or Categoricals
- Return type:
Mapping[str, Union[pdarray, Strings, SegArray, Categorical]]
- Raises:
TypeError – Raised if path_prefix is not a str
ValueError – Raised if file_format/extension is encountered that is not hdf5 or parquet or if all datasets are not present in all hdf5/parquet files or if the path_prefix does not correspond to files accessible to Arkouda
RuntimeError – Raised if the hdf5 files are present but there is an error in opening one or more of them
See also
to_parquet,to_hdf,load,readNotes
This function has been updated to determine the file extension based on the file format variable
This function will be deprecated when glob flags are added to read_* methods
CSV files without the Arkouda Header are not supported.
- arkouda.load_checkpoint(name, path='.akdata')[source]¶
Load server’s state.
The server metadata must match the current configuration (e.g. same number of locales must be used).
- Parameters:
- Returns:
The checkpoint name, which will be the same as the
nameargument.- Return type:
Examples
>>> import arkouda as ak >>> arr = ak.zeros(10, int) >>> arr[2] = 2 >>> arr[2] np.int64(2) >>> cp_name = ak.save_checkpoint() >>> arr[2] = 3 >>> arr[2] np.int64(3) >>> ak.load_checkpoint(cp_name) >>> arr[2] np.int64(3)
See also
- arkouda.ls(filename: str, col_delim: str = ',', read_nested: bool = True) List[str][source]¶
List the contents of an HDF5 or Parquet file on the Arkouda server.
This function invokes the HDF5 h5ls utility on a file visible to the Arkouda server, or simulates a similar listing for Parquet files. For CSV files without headers, see ls_csv.
- Parameters:
filename (str) – Path to the file on the Arkouda server. Must be a non-empty string.
col_delim (str) – Delimiter to use when interpreting CSV files. Default is “,”.
read_nested (bool) – If True, include nested Parquet columns (e.g., SegArray). If False, nested columns are ignored. Only applies to Parquet files. Default is True.
- Returns:
A list of lines describing each dataset or column in the file.
- Return type:
List[str]
- Raises:
TypeError – If filename is not a string.
ValueError – If filename is empty or contains only whitespace.
RuntimeError – If an error occurs when running h5ls or simulating the Parquet listing.
Notes
Parquet support is limited and may change in future releases.
Output lines mirror the format of the HDF5 h5ls output.
For CSV files lacking headers, use ls_csv.
See also
ls_csvList the contents of CSV files without headers.
- arkouda.ls_csv(filename: str, col_delim: str = ',') List[str][source]¶
List the datasets within a file when a CSV does not have a header.
- Parameters:
- Returns:
The string output of the datasets from the server
- Return type:
See also
- arkouda.merge(left: DataFrame, right: DataFrame, on: str | List[str] | None = None, left_on: str | List[str] | None = None, right_on: str | List[str] | None = None, how: str = 'inner', left_suffix: str = '_x', right_suffix: str = '_y', convert_ints: bool = True, sort: bool = True) DataFrame[source]¶
Merge Arkouda DataFrames with a database-style join.
The resulting dataframe contains rows from both DataFrames as specified by the merge condition (based on the “how” and “on” parameters).
Based on pandas merge functionality. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html
- Parameters:
left (DataFrame) – The Left DataFrame to be joined.
right (DataFrame) – The Right DataFrame to be joined.
on (Optional[Union[str, List[str]]] = None) – The name or list of names of the DataFrame column(s) to join on. If on is None, this defaults to the intersection of the columns in both DataFrames.
left_on (str or List of str, optional) – Column name or names to join on in the left DataFrame. If this is not None, then right_on must also not be None, and this will override on.
right_on (str or List of str, optional) – Column name or names to join on in the right DataFrame. If this is not None, then left_on must also not be None, and this will override on.
how (str, default = "inner") – The merge condition. Must be one of “inner”, “left”, “right”, or “outer”.
left_suffix (str, default = "_x") – A string indicating the suffix to add to columns from the left dataframe for overlapping column names in both left and right. Defaults to “_x”. Only used when how is “inner”.
right_suffix (str, default = "_y") – A string indicating the suffix to add to columns from the right dataframe for overlapping column names in both left and right. Defaults to “_y”. Only used when how is “inner”.
convert_ints (bool = True) – If True, convert columns with missing int values (due to the join) to float64. This is to match pandas. If False, do not convert the column dtypes. This has no effect when how = “inner”.
sort (bool = True) – If True, DataFrame is returned sorted by “on”. Otherwise, the DataFrame is not sorted.
- Returns:
Joined Arkouda DataFrame.
- Return type:
Note
Multiple column joins are only supported for integer columns.
Examples
>>> import arkouda as ak >>> from arkouda import merge >>> left_df = ak.DataFrame({'col1': ak.arange(5), 'col2': -1 * ak.arange(5)}) >>> left_df col1 col2 0 0 0 1 1 -1 2 2 -2 3 3 -3 4 4 -4 (5 rows x 2 columns)
>>> right_df = ak.DataFrame({'col1': 2 * ak.arange(5), 'col2': 2 * ak.arange(5)}) >>> right_df col1 col2 0 0 0 1 2 2 2 4 4 3 6 6 4 8 8 (5 rows x 2 columns)
>>> merge(left_df, right_df, on = "col1") col1 col2_x col2_y 0 0 0 0 1 2 -2 2 2 4 -4 4 (3 rows x 3 columns)
>>> merge(left_df, right_df, on = "col1", how = "left") col1 col2_x col2_y 0 0 0 0.0 1 1 -1 NaN 2 2 -2 2.0 3 3 -3 NaN 4 4 -4 4.0 (5 rows x 3 columns)
>>> merge(left_df, right_df, on = "col1", how = "right") col1 col2_x col2_y 0 0 0.0 0 1 2 -2.0 2 2 4 -4.0 4 3 6 NaN 6 4 8 NaN 8 (5 rows x 3 columns)
>>> merge(left_df, right_df, on = "col1", how = "outer") col1 col2_x col2_y 0 0 0.0 0.0 1 1 -1.0 NaN 2 2 -2.0 2.0 3 3 -3.0 NaN 4 4 -4.0 4.0 5 6 NaN 6.0 6 8 NaN 8.0 (7 rows x 3 columns)
- arkouda.power_divergence(f_obs, f_exp=None, ddof=0, lambda_=None)[source]¶
Computes the power divergence statistic and p-value.
- Parameters:
f_obs (pdarray) – The observed frequency.
f_exp (pdarray, default = None) – The expected frequency.
ddof (int) – The delta degrees of freedom.
lambda (string, default = "pearson") –
The power in the Cressie-Read power divergence statistic. Allowed values: “pearson”, “log-likelihood”, “freeman-tukey”, “mod-log-likelihood”, “neyman”, “cressie-read”
Powers correspond as follows:
”pearson”: 1
”log-likelihood”: 0
”freeman-tukey”: -0.5
”mod-log-likelihood”: -1
”neyman”: -2
”cressie-read”: 2 / 3
- Return type:
arkouda.akstats.Power_divergenceResult
Examples
>>> import arkouda as ak >>> from arkouda.scipy import power_divergence >>> x = ak.array([10, 20, 30, 10]) >>> y = ak.array([10, 30, 20, 10]) >>> power_divergence(x, y, lambda_="pearson") Power_divergenceResult(statistic=np.float64(8.333333333333334), pvalue=np.float64(0.03960235520...)) >>> power_divergence(x, y, lambda_="log-likelihood") Power_divergenceResult(statistic=np.float64(8.109302162163285), pvalue=np.float64(0.04380595350...))
See also
scipy.stats.power_divergence,arkouda.akstats.chisquareNotes
This is a modified version of scipy.stats.power_divergence [2] in order to scale using arkouda pdarrays.
References
[1] “scipy.stats.power_divergence”, https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.power_divergence.html
[2] Scipy contributors (2024) scipy (Version v1.12.0) [Source code]. https://github.com/scipy/scipy
- arkouda.pretty_print_information(names: List[str] | str = RegisteredSymbols) None[source]¶
Print verbose information for each object in names in a human readable format.
- Parameters:
names (Union[List[str], str]) – names is either the name of an object or list of names of objects to retrieve info if names is ak.AllSymbols, retrieves info for all symbols in the symbol table if names is ak.RegisteredSymbols, retrieves info for all symbols in the registry
- Raises:
RuntimeError – Raised if a server-side error is thrown in the process of retrieving information about the objects in names
- arkouda.read(filenames: str | List[str], datasets: str | List[str] | None = None, iterative: bool = False, strictTypes: bool = True, allow_errors: bool = False, calc_string_offsets: bool = False, column_delim: str = ',', read_nested: bool = True, has_non_float_nulls: bool = False, fixed_len: int = -1) Mapping[str, arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings | arkouda.numpy.segarray.SegArray | arkouda.pandas.categorical.Categorical | arkouda.pandas.dataframe.DataFrame | arkouda.client_dtypes.IPv4 | arkouda.numpy.timeclass.Datetime | arkouda.numpy.timeclass.Timedelta | arkouda.pandas.index.Index][source]¶
Read datasets from files.
The file type is determined automatically.
- Parameters:
filenames (Union[str, List[str]]) – Either a list of filenames or a shell expression.
datasets (Optional[Union[str, List[str]]], default=None) – Name or list of names of datasets to read. If
None, all available datasets are read.iterative (bool, default=False) – If
True, make iterative function calls to the server. IfFalse, make a single function call to the server.strictTypes (bool, default=True) – If
True, require all dtypes of a given dataset to have the same precision and sign. IfFalse, allow dtypes of different precision and sign across different files. For example, if one file contains auint32dataset and another contains anint64dataset with the same name, the contents of both will be read into anint64pdarray.allow_errors (bool, default=False) – If
True, files with read errors may be skipped instead of causing the operation to fail. A warning will be included in the return containing the total number of files skipped due to failure and up to 10 filenames.calc_string_offsets (bool, default=False) – If
True, instruct the server to calculate the offsets or segments array instead of loading it from HDF5 files.column_delim (str, default=",") – Column delimiter to use if the dataset is CSV. Otherwise unused.
read_nested (bool, default=True) – If
True,SegArrayobjects are read from the file. IfFalse,SegArrayobjects and other nested Parquet columns are ignored. Ignored ifdatasetsis notNone. Parquet only.has_non_float_nulls (bool, default=False) – Must be set to
Trueto read non-float Parquet columns that contain null values.fixed_len (int, default=-1) – Fixed string length to use when reading Parquet string columns if the length of each string is known at runtime. This can avoid byte calculation and may improve performance.
- Returns:
DataFrame, IPv4, Datetime, Timedelta, Index]] Dictionary mapping
datasetNameto the loaded object. The values may bepdarray,Strings,SegArray,Categorical,DataFrame,IPv4,Datetime,Timedelta, orIndex.- Return type:
- Raises:
RuntimeError – Raised if an invalid file type is detected.
See also
Notes
If
filenamesis a string, it is interpreted as a shell expression. A single filename is a valid expression, so it will also work. The expression is expanded withglobto read all matching files.If
iterative=True, each dataset name and filename is passed to the server independently in sequence. Ifiterative=False, all dataset names and filenames are passed to the server in a single string.If
datasetsisNone, dataset names are inferred from the first file and all datasets are read. Useget_datasetsto show the names of datasets in HDF5 or Parquet files.CSV files without the Arkouda header are not supported.
Examples
>>> import arkouda as ak
Read a file with an extension:
>>> x = ak.read("path/name_prefix.h5")
The file type is determined from file contents, not the extension.
Read a Parquet file:
>>> x = ak.read("path/name_prefix.parquet")
Read files matching a glob expression:
>>> x = ak.read("path/name_prefix*")
- arkouda.read_csv(filenames: str | List[str], datasets: str | List[str] | None = None, column_delim: str = ',', allow_errors: bool = False) Mapping[str, arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings | arkouda.numpy.segarray.SegArray | arkouda.pandas.categorical.Categorical | arkouda.pandas.dataframe.DataFrame | arkouda.client_dtypes.IPv4 | arkouda.numpy.timeclass.Datetime | arkouda.numpy.timeclass.Timedelta | arkouda.pandas.index.Index][source]¶
Read CSV file(s) into Arkouda objects.
If more than one dataset is found, the objects will be returned in a dictionary mapping the dataset name to the Arkouda object containing the data. If the file contains the appropriately formatted header, typed data will be returned. Otherwise, all data will be returned as a Strings object.
- Parameters:
filenames (str or List[str]) – The filenames to read data from
datasets (str or List[str] (Optional)) – names of the datasets to read. When None, all datasets will be read.
column_delim (str) – The delimiter for column names and data. Defaults to “,”.
allow_errors (bool) – Default False, if True will allow files with read errors to be skipped instead of failing. A warning will be included in the return containing the total number of files skipped due to failure and up to 10 filenames.
- Returns:
Dictionary of {datasetName: pdarray, String, or SegArray}
- Return type:
Returns a dictionary of Arkouda pdarrays, Arkouda Strings, or Arkouda Segarrays.
- Raises:
ValueError – Raised if all datasets are not present in all parquet files or if one or more of the specified files do not exist
RuntimeError – Raised if one or more of the specified files cannot be opened. If allow_errors is true this may be raised if no values are returned from the server.
TypeError – Raised if we receive an unknown arkouda_type returned from the server
See also
Notes
CSV format is not currently supported by load/load_all operations
The column delimiter is expected to be the same for column names and data
Be sure that column delimiters are not found within your data.
All CSV files must delimit rows using newline (
\\n) at this time.Unlike other file formats, CSV files store Strings as their UTF-8 format instead of storing bytes as uint(8).
- arkouda.read_hdf(filenames: str | List[str], datasets: str | List[str] | None = None, iterative: bool = False, strict_types: bool = True, allow_errors: bool = False, calc_string_offsets: bool = False, tag_data: bool = False) Mapping[str, arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings | arkouda.numpy.segarray.SegArray | arkouda.pandas.categorical.Categorical | arkouda.pandas.dataframe.DataFrame | arkouda.client_dtypes.IPv4 | arkouda.numpy.timeclass.Datetime | arkouda.numpy.timeclass.Timedelta | arkouda.pandas.index.Index][source]¶
Read Arkouda objects from HDF5 files.
- Parameters:
filenames (Union[str, List[str]]) – Filename or list of filenames to read objects from.
datasets (Optional[Union[str, List[str]]], default=None) – Dataset name or list of dataset names to read from the provided files. If
None, all datasets are read.iterative (bool, default=False) – If
True, make iterative function calls to the server. IfFalse, make a single function call to the server.strict_types (bool, default=True) – If
True, require all dtypes of a given dataset to have the same precision and sign. IfFalse, allow dtypes of different precision and sign across different files. For example, if one file contains auint32dataset and another contains anint64dataset with the same name, the contents of both will be read into anint64pdarray.allow_errors (bool, default=False) – If
True, files with read errors may be skipped instead of causing the operation to fail. A warning will be included in the return containing the total number of files skipped due to failure and up to 10 filenames.calc_string_offsets (bool, default=False) – If
True, instruct the server to calculate the offsets or segments array instead of loading it from HDF5 files. In the future, this option may become the default.tag_data (bool, default=False) – If
True, tag the returned data with the code associated with the filename from which it was read.
- Returns:
Mapping[ – str, Union[
pdarray, Strings, SegArray, Categorical, DataFrame, IPv4, Datetime, Timedelta, Index,
],
]
Dictionary mapping
datasetNameto the loaded object. The valuesmay be
pdarray,Strings,SegArray,Categorical,DataFrame,IPv4,Datetime,Timedelta, orIndex. – Dictionary mappingdatasetNameto the loaded object. The values may bepdarray,Strings,SegArray,Categorical,DataFrame,IPv4,Datetime,Timedelta, orIndex.
- Raises:
ValueError – Raised if not all datasets are present in all HDF5 files or if one or more of the specified files do not exist.
RuntimeError – Raised if one or more of the specified files cannot be opened. If
allow_errorsisTrue, this may also be raised if no values are returned from the server.TypeError – Raised if an unknown Arkouda type is returned from the server.
Notes
If
filenamesis a string, it is interpreted as a shell expression. A single filename is a valid expression, so it will also work. The expression is expanded withglobto read all matching files.If
iterative=True, each dataset name and filename is passed to the server independently in sequence. Ifiterative=False, all dataset names and filenames are passed to the server in a single string.If
datasetsisNone, dataset names are inferred from the first file and all datasets are read. Useget_datasetsto show dataset names in HDF5 files.See also
Examples
>>> import arkouda as ak
Read a file with an extension:
>>> x = ak.read_hdf("path/name_prefix.h5")
Read files matching a glob expression:
>>> x = ak.read_hdf("path/name_prefix*")
- arkouda.read_parquet(filenames: str | List[str], datasets: str | List[str] | None = None, iterative: bool = False, strict_types: bool = True, allow_errors: bool = False, tag_data: bool = False, read_nested: bool = True, has_non_float_nulls: bool = False, null_handling: str | None = None, fixed_len: int = -1) Mapping[str, arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings | arkouda.numpy.segarray.SegArray | arkouda.pandas.categorical.Categorical | arkouda.pandas.dataframe.DataFrame | arkouda.client_dtypes.IPv4 | arkouda.numpy.timeclass.Datetime | arkouda.numpy.timeclass.Timedelta | arkouda.pandas.index.Index][source]¶
Read Arkouda objects from Parquet files.
- Parameters:
filenames (Union[str, List[str]]) – Filename or list of filenames to read objects from.
datasets (Optional[Union[str, List[str]]], default=None) – Dataset name or list of dataset names to read from the provided files. If
None, all datasets are read.iterative (bool, default=False) – If
True, make iterative function calls to the server. IfFalse, make a single function call to the server.strict_types (bool, default=True) – If
True, require all dtypes of a given dataset to have the same precision and sign. IfFalse, allow dtypes of different precision and sign across different files. For example, if one file contains auint32dataset and another contains anint64dataset with the same name, the contents of both will be read into anint64pdarray.allow_errors (bool, default=False) – If
True, files with read errors may be skipped instead of causing the operation to fail. A warning will be included in the return containing the total number of files skipped due to failure and up to 10 filenames.tag_data (bool, default=False) – If
True, tag the data with the code associated with the filename from which the data was read.read_nested (bool, default=True) – If
True,SegArrayobjects are read from the file. IfFalse,SegArrayobjects and other nested Parquet columns are ignored. Ifdatasetsis notNone, this parameter is ignored.has_non_float_nulls (bool, default=False) –
Deprecated. Use
null_handlinginstead.This flag must be set to
Trueto read non-float Parquet columns that contain null values.null_handling (Optional[str], default=None) –
Null-handling mode. Supported values are
"none","only floats", and"all". IfNone, the default is"only floats".If
"none", the data is assumed to contain no nulls. This gives the best performance, but behavior is undefined if nulls are present.If
"only floats", only floating-point columns may contain nulls. This improves performance for other data types.If
"all", any column may contain nulls. This is the most general mode, but it is slower overall.fixed_len (int, default=-1) – Fixed string length to use when reading Parquet string columns if the length of each string is known at runtime. This can avoid byte calculation and may improve performance.
- Returns:
DataFrame, IPv4, Datetime, Timedelta, Index]] Dictionary mapping
datasetNameto the loaded object. The values may bepdarray,Strings,SegArray,Categorical,DataFrame,IPv4,Datetime,Timedelta, orIndex.- Return type:
- Raises:
ValueError – Raised if not all datasets are present in all Parquet files or if one or more of the specified files do not exist.
RuntimeError – Raised if one or more of the specified files cannot be opened. If
allow_errorsisTrue, this may also be raised if no values are returned from the server.TypeError – Raised if an unknown Arkouda type is returned from the server.
Notes
If
filenamesis a string, it is interpreted as a shell expression. A single filename is a valid expression, so it will also work. The expression is expanded withglobto read all matching files.If
iterative=True, each dataset name and filename is passed to the server independently in sequence. Ifiterative=False, all dataset names and filenames are passed to the server in a single string.If
datasetsisNone, dataset names are inferred from the first file and all datasets are read. Useget_datasetsto show the names of datasets in Parquet files.Parquet currently always recomputes offsets. This note should be updated when the Parquet workflow changes.
See also
Examples
>>> import arkouda as ak
Read a Parquet file:
>>> x = ak.read_parquet("path/name_prefix.parquet")
Read files matching a glob expression:
>>> x = ak.read_parquet("path/name_prefix*")
- arkouda.read_tagged_data(filenames: str | List[str], datasets: str | List[str] | None = None, strictTypes: bool = True, allow_errors: bool = False, calc_string_offsets: bool = False, read_nested: bool = True, has_non_float_nulls: bool = False)[source]¶
Read datasets from files and tag each record with the file it was read from.
The file type is determined automatically.
- Parameters:
filenames (Union[str, List[str]]) – Either a list of filenames or a shell expression.
datasets (Optional[Union[str, List[str]]], default=None) – Dataset name or list of dataset names to read. If
None, all available datasets are read.strictTypes (bool, default=True) – If
True, require all dtypes of a given dataset to have the same precision and sign. IfFalse, allow dtypes of different precision and sign across different files. For example, if one file contains auint32dataset and another contains anint64dataset with the same name, the contents of both will be read into anint64pdarray.allow_errors (bool, default=False) – If
True, files with read errors may be skipped instead of causing the operation to fail. A warning will be included in the return containing the total number of files skipped due to failure and up to 10 filenames.calc_string_offsets (bool, default=False) – If
True, instruct the server to calculate the offsets or segments array instead of loading it from HDF5 files. In the future, this option may become the default.read_nested (bool, default=True) – If
True,SegArrayobjects are read from the file. IfFalse,SegArrayobjects and other nested Parquet columns are ignored. Ignored ifdatasetsis notNone. Parquet only.has_non_float_nulls (bool, default=False) – Must be set to
Trueto read non-float Parquet columns that contain null values.
Notes
This function is not currently supported for
CategoricalorGroupBydatasets.Examples
>>> import arkouda as ak
Read files and return the data along with tagging information:
>>> data, cat = ak.read_tagged_data("path/name")
The codes in
catmap each record indatato the file it came from. The returned data includes a"Filename_Codes"array.>>> data {"Filename_Codes": array([0 3 6 9 12]), "col_name": array([0 0 0 1])}
- arkouda.read_zarr(store_path: str, ndim: int, dtype)[source]¶
Read a Zarr store from disk into a pdarray.
Supports multi-dimensional pdarrays of numeric types. To use this function, ensure you have installed the blosc dependency (make install-blosc) and have included ZarrMsg.chpl in the ServerModules.cfg file.
- Parameters:
- Returns:
The pdarray read from the Zarr store.
- Return type:
- arkouda.receive(hostname: str, port)[source]¶
Receive a pdarray sent by pdarray.transfer().
- Parameters:
hostname (str) – The hostname of the pdarray that sent the array
port (int_scalars) – The port to send the array over. This needs to be an open port (i.e., not one that the Arkouda server is running on). This will open up numLocales ports, each of which in succession, so will use ports of the range {port..(port+numLocales)} (e.g., running an Arkouda server of 4 nodes, port 1234 is passed as port, Arkouda will use ports 1234, 1235, 1236, and 1237 to send the array data). This port much match the port passed to the call to pdarray.transfer().
- Returns:
The pdarray sent from the sending server to the current receiving server.
- Return type:
- Raises:
ValueError – Raised if the op is not within the pdarray.BinOps set
TypeError – Raised if other is not a pdarray or the pdarray.dtype is not a supported dtype
- arkouda.receive_dataframe(hostname: str, port)[source]¶
Receive a pdarray sent by dataframe.transfer().
- Parameters:
hostname (str) – The hostname of the dataframe that sent the array
port (int_scalars) – The port to send the dataframe over. This needs to be an open port (i.e., not one that the Arkouda server is running on). This will open up numLocales ports, each of which in succession, so will use ports of the range {port..(port+numLocales)} (e.g., running an Arkouda server of 4 nodes, port 1234 is passed as port, Arkouda will use ports 1234, 1235, 1236, and 1237 to send the array data). This port much match the port passed to the call to pdarray.send_array().
- Returns:
The dataframe sent from the sending server to the current receiving server.
- Return type:
- Raises:
ValueError – Raised if the op is not within the pdarray.BinOps set
TypeError – Raised if other is not a pdarray or the pdarray.dtype is not a supported dtype
- arkouda.restore(filename)[source]¶
Return data saved using ak.snapshot.
- Parameters:
filename (str) – Name used to create snapshot to be read
- Return type:
Dict
Notes
Unlike other save/load methods using snapshot restore will save DataFrames alongside other objects in HDF5. Thus, they are returned within the dictionary as a dataframe.
- arkouda.save_checkpoint(name='', path='.akdata', mode: Literal['overwrite', 'preserve_previous', 'error'] = 'overwrite')[source]¶
Save the server’s state.
Records some metadata about the server, and saves all pdarrays into parquet files.
- Parameters:
name (str) – Name of the checkpoint. The default will be the server session ID, which is typically in format
id_<hash>_. A directory will be created inpathwith this name.path (str) – The directory to save the checkpoint. If the directory doesn’t exist, it will be created. If it exists, a new directory for the checkpoint instance will be created inside this directory.
mode ({'overwrite', 'preserve_previous', 'error'}) –
How to handle an existing checkpoint with the same name. -
'overwrite'(default): overwrite the checkpoint files. -'preserve_previous': rename existing checkpoint to<name>.prev,overwriting that if it exists.
'error': raise an error if the checkpoint exists.
Notes
Only ``pdarray``s are saved. Other data structures will not be recorded. We expect to expand the coverage in the future.
- Returns:
The checkpoint name, which will be the same as the
nameargument if it was passed.- Return type:
Examples
>>> import arkouda as ak >>> arr = ak.zeros(10, int) >>> arr[2] = 2 >>> arr[2] np.int64(2) >>> cp_name = ak.save_checkpoint() >>> arr[2] = 3 >>> arr[2] np.int64(3) >>> ak.load_checkpoint(cp_name) >>> arr[2] np.int64(3)
See also
- arkouda.snapshot(filename)[source]¶
Create a snapshot of the current Arkouda namespace.
All currently accessible variables containing Arkouda objects will be written to an HDF5 file.
Unlike other save/load functions, this maintains the integrity of dataframes.
Current Variable names are used as the dataset name when saving.
- Parameters:
filename (str) – Name to use when storing file
See also
ak.restore
- arkouda.string_operators(cls)[source]¶
Add common string operation methods to a StringAccessor class.
This class decorator dynamically attaches string operations (contains, startswith, endswith) to the given class using the _make_op helper.
- Parameters:
cls (type) – The accessor class to decorate.
- Returns:
The accessor class with string methods added.
- Return type:
Notes
Used internally to implement the .str accessor API.
- arkouda.to_csv(columns: Mapping[str, arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings] | List[arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings], prefix_path: str, names: List[str] | None = None, col_delim: str = ',', overwrite: bool = False)[source]¶
Write Arkouda object(s) to CSV file(s).
All CSV Files written by Arkouda include a header denoting data types of the columns.
- Parameters:
columns (Mapping[str, pdarray] or List[pdarray]) – The objects to be written to CSV file. If a mapping is used and names is None the keys of the mapping will be used as the dataset names.
prefix_path (str) – The filename prefix to be used for saving files. Files will have _LOCALE#### appended when they are written to disk.
names (List[str] (Optional)) – names of dataset to be written. Order should correspond to the order of data provided in columns.
col_delim (str) – Defaults to “,”. Value to be used to separate columns within the file. Please be sure that the value used DOES NOT appear in your dataset.
overwrite (bool) – Defaults to False. If True, any existing files matching your provided prefix_path will be overwritten. If False, an error will be returned if existing files are found.
- Raises:
ValueError – Raised if any datasets are present in all csv files or if one or more of the specified files do not exist
RuntimeError – Raised if one or more of the specified files cannot be opened. If allow_errors is true this may be raised if no values are returned from the server.
TypeError – Raised if we receive an unknown arkouda_type returned from the server
See also
Notes
CSV format is not currently supported by load/load_all operations
The column delimiter is expected to be the same for column names and data
Be sure that column delimiters are not found within your data.
All CSV files must delimit rows using newline (
\\n) at this time.Unlike other file formats, CSV files store Strings as their UTF-8 format instead of storing bytes as uint(8).
- arkouda.to_hdf(columns: Mapping[str, arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings | arkouda.numpy.segarray.SegArray] | List[arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings | arkouda.numpy.segarray.SegArray], prefix_path: str, names: List[str] | None = None, mode: Literal['truncate', 'append'] = 'truncate', file_type: Literal['single', 'distribute'] = 'distribute') None[source]¶
Save multiple named pdarrays to HDF5 files.
- Parameters:
columns (dict or list of pdarrays) – Collection of arrays to save
prefix_path (str) – Directory and filename prefix for output files
names (list of str) – Dataset names for the pdarrays
mode ({"truncate", "append"}) – By default, truncate (overwrite) the output files if they exist. If ‘append’, attempt to create new dataset in existing files.
file_type ({"single", "distribute"}) – Default: distribute Single writes the dataset to a single file Distribute writes the dataset to a file per locale.
- Raises:
ValueError – Raised if (1) the lengths of columns and values differ or (2) the mode is not ‘truncate’ or ‘append’
RuntimeError – Raised if a server-side error is thrown saving the pdarray
See also
Notes
Creates one file per locale containing that locale’s chunk of each pdarray. If columns is a dictionary, the keys are used as the HDF5 dataset names. Otherwise, if no names are supplied, 0-up integers are used. By default, any existing files at path_prefix will be overwritten, unless the user specifies the ‘append’ mode, in which case arkouda will attempt to add <columns> as new datasets to existing files. If the wrong number of files is present or dataset names already exist, a RuntimeError is raised.
Examples
>>> import arkouda as ak >>> a = ak.arange(25) >>> b = ak.arange(25)
Save with mapping defining dataset names
>>> ak.to_hdf({'a': a, 'b': b}, 'path/name_prefix')
Save using names instead of mapping
>>> ak.to_hdf([a, b], 'path/name_prefix', names=['a', 'b'])
- arkouda.to_parquet(columns: Mapping[str, arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings | arkouda.numpy.segarray.SegArray] | List[arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings | arkouda.numpy.segarray.SegArray], prefix_path: str, names: List[str] | None = None, mode: Literal['truncate', 'append'] = 'truncate', compression: str | None = None, convert_categoricals: bool = False) None[source]¶
Save multiple named arrays to Parquet files.
- Parameters:
columns (Union[Mapping[str, Union[pdarray, Strings, SegArray]],) – List[Union[pdarray, Strings, SegArray]]] Collection of arrays to save.
prefix_path (str) – Directory and filename prefix for the output files.
names (Optional[List[str]], default=None) – Dataset names for the arrays when
columnsis provided as a list.mode (Literal["truncate", "append"], default="truncate") –
If
"truncate", overwrite any existing output files. If"append", attempt to create a new dataset in existing files."append"is deprecated. Use the multi-column write instead.compression (Optional[str], default=None) – Compression type to use when writing the file. Supported values include
"snappy","gzip","brotli","zstd", and"lz4".convert_categoricals (bool, default=False) – Parquet requires all columns to have the same size, and
Categoricalobjects do not satisfy that requirement. If set toTrue, write the equivalentStringsin place of anyCategoricalcolumns.
- Raises:
ValueError – Raised if the lengths of
columnsandnamesdiffer, or ifmodeis not"truncate"or"append".RuntimeError – Raised if a server-side error occurs while saving the arrays.
Notes
Creates one file per locale containing that locale’s chunk of each array.
If
columnsis a dictionary, its keys are used as the Parquet column names. Otherwise, if nonamesare supplied, integer names starting at0are used.By default, any existing files at
prefix_pathare deleted regardless of whether they would be overwritten. Ifmode="append", Arkouda attempts to addcolumnsas new datasets to existing files. If the wrong number of files is present or dataset names already exist, aRuntimeErroris raised.Examples
>>> import arkouda as ak >>> a = ak.arange(25) >>> b = ak.arange(25)
Save with a mapping defining dataset names:
>>> ak.to_parquet({"a": a, "b": b}, "path/name_prefix")
Save using
namesinstead of a mapping:>>> ak.to_parquet([a, b], "path/name_prefix", names=["a", "b"])
- arkouda.to_zarr(store_path: str, arr: arkouda.numpy.pdarrayclass.pdarray, chunk_shape)[source]¶
Write a pdarray to disk as a Zarr store.
Supports multi-dimensional pdarrays of numeric types. To use this function, ensure you have installed the blosc dependency (make install-blosc) and have included ZarrMsg.chpl in the ServerModules.cfg file.
- Parameters:
- Raises:
ValueError – Raised if the number of dimensions in the chunk shape does not match the number of dimensions in the array or if the array is not a 32 or 64 bit numeric type
- arkouda.update_hdf(columns: Mapping[str, arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings | arkouda.numpy.segarray.SegArray] | List[arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings | arkouda.numpy.segarray.SegArray], prefix_path: str, names: List[str] | None = None, repack: bool = True)[source]¶
Overwrite the datasets with name appearing in names or keys in columns if columns is a dictionary.
- Parameters:
columns (dict or list of pdarrays) – Collection of arrays to save
prefix_path (str) – Directory and filename prefix for output files
names (list of str) – Dataset names for the pdarrays
repack (bool) – Default: True HDF5 does not release memory on delete. When True, the inaccessible data (that was overwritten) is removed. When False, the data remains, but is inaccessible. Setting to false will yield better performance, but will cause file sizes to expand.
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the datasets
Notes
If file does not contain File_Format attribute to indicate how it was saved, the file name is checked for _LOCALE#### to determine if it is distributed.
If the datasets provided do not exist, they will be added
Because HDF5 deletes do not release memory, this will create a copy of the file with the new data
This workflow is slightly different from to_hdf to prevent reading and creating a copy of the file for each dataset
- arkouda.write_log(log_msg: str, tag: str = 'ClientGeneratedLog', log_lvl: LogLevel = LogLevel.INFO)[source]¶
Allow the user to write custom logs.
- Parameters:
log_msg (str) – The message to be added to the server log
tag (str) – The tag to use in the log. This takes the place of the server function name. Allows for easy identification of custom logs. Defaults to “ClientGeneratedLog”
log_lvl (LogLevel) – The type of log to be written Defaults to LogLevel.INFO
See also
- arkouda.xlogy(x: arkouda.numpy.pdarrayclass.pdarray | numpy.float64, y: arkouda.numpy.pdarrayclass.pdarray)[source]¶
Computes x * log(y).
- Parameters:
- Return type:
Examples
>>> import arkouda as ak >>> from arkouda.scipy.special import xlogy >>> xlogy( ak.array([1, 2, 3, 4]), ak.array([5,6,7,8])) array([1.6094379124341003 3.5835189384561099 5.8377304471659395 8.317766166719343]) >>> xlogy( 5.0, ak.array([1, 2, 3, 4])) array([0.00000000000000000 3.4657359027997265 5.4930614433405491 6.9314718055994531])