arkouda.util

Attributes

Classes

BitVector

Represent integers as bit vectors, e.g. a set of flags.

Datetime

Represents a date and/or time.

GroupBy

Group an array or list of arrays by value, usually in preparation

IPv4

Represent integers as IPv4 addresses.

SegArray

Strings

Represents an array of strings whose data resides on the

Timedelta

Represents a duration, the difference between two dates or times.

pdarray

The basic arkouda array class. This class contains only the

Functions

BitVectorizer([width, reverse])

Make a callback (i.e. function) that can be called on an

arange(→ arkouda.numpy.pdarrayclass.pdarray)

arange([start,] stop[, stride,] dtype=int64)

attach(name)

attach_all(names)

Attach to all objects registered with the names provide

broadcast(segments, values[, size, permutation])

Broadcast a dense column vector to the rows of a sparse matrix or grouped array.

broadcast_dims(→ Tuple[int, Ellipsis])

Algorithm to determine shape of broadcasted PD array given two array shapes

coargsort(→ arkouda.numpy.pdarrayclass.pdarray)

Return the permutation that groups the rows (left-to-right), if the

concatenate(items[, ordered])

convert_bytes(→ arkouda.numpy.dtypes.numeric_scalars)

Convert the number of bytes to KB, MB, or GB.

convert_if_categorical(values)

Convert a Categorical array to Strings for display

create_pdarray(→ pdarray)

Return a pdarray instance pointing to an array created by the arkouda server.

dtype(dtype)

Create a data type object.

enrich_inplace(data, keynames, aggregations, **kwargs)

expand(→ arkouda.numpy.pdarrayclass.pdarray)

Expand an array with values placed into the indicated segments.

generic_concat(items[, ordered])

get_callback(x)

identity(x)

invert_permutation(→ arkouda.numpy.pdarrayclass.pdarray)

Find the inverse of a permutation array.

is_float(→ bool)

Check if the dtype of the given array is float.

is_int(→ bool)

Check if the dtype of the given array is int.

is_numeric(→ bool)

Check if the dtype of the given array is numeric.

is_registered(→ bool)

Determine if the name provided is associated with a registered Object

list_registry([detailed])

Return a list containing the names of all registered objects

map(→ Union[arkouda.numpy.pdarrayclass.pdarray, ...)

Map values of an array according to an input mapping.

most_common(g, values)

register(obj, name)

Register an arkouda object with a user-specified name. Backwards compatible

register_all(data)

Register all objects in the provided dictionary

report_mem([pre])

sparse_sum_help(...)

Helper for summing two sparse matrices together

unique(→ Union[groupable, Tuple[groupable, pdarray, ...)

Find the unique elements of an array.

unregister(→ str)

unregister_all(names)

Unregister all names provided

Package Contents

class arkouda.util.BitVector(values, width=64, reverse=False)[source]

Bases: arkouda.numpy.pdarrayclass.pdarray

Represent integers as bit vectors, e.g. a set of flags.

Parameters:
  • values (pdarray, int64) – The integers to represent as bit vectors

  • width (int) – The number of bit fields in the vector

  • reverse (bool) – If True, display bits from least significant (left) to most significant (right). By default, the most significant bit is the left-most bit.

Returns:

bitvectors – The array of binary vectors

Return type:

BitVector

Notes

This class is a thin wrapper around pdarray that mostly affects how values are displayed to the user. Operators and methods will typically treat this class like a uint64 pdarray.

conserves
format(x)[source]

Format a single binary vector as a string.

classmethod from_return_msg(rep_msg)[source]
opeq(other, op)[source]
register(user_defined_name)[source]

Register this BitVector object and underlying components with the Arkouda server

Parameters:

user_defined_name (str) – user defined name the BitVector is to be registered under, this will be the root name for underlying components

Returns:

The same BitVector which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different BitVectors with the same name.

Return type:

BitVector

Raises:
  • TypeError – Raised if user_defined_name is not a str

  • RegistrationError – If the server was unable to register the BitVector with the user_defined_name

Notes

Objects registered with the server are immune to deletion until they are unregistered.

registered_name = None
reverse = False
special_objType = 'BitVector'
to_list()[source]

Export data to a list of string-formatted bit vectors.

to_ndarray()[source]

Export data to a numpy array of string-formatted bit vectors.

values
width = 64
arkouda.util.BitVectorizer(width=64, reverse=False)[source]

Make a callback (i.e. function) that can be called on an array to create a BitVector.

Parameters:
  • width (int) – The number of bit fields in the vector

  • reverse (bool) – If True, display bits from least significant (left) to most significant (right). By default, the most significant bit is the left-most bit.

Returns:

bitvectorizer – A function that takes an array and returns a BitVector instance

Return type:

callable

class arkouda.util.Datetime(pda, unit: str = _BASE_UNIT)[source]

Bases: _AbstractBaseTime

Represents a date and/or time.

Datetime is the Arkouda analog to pandas DatetimeIndex and other timeseries data types.

Parameters:
  • pda (int64 pdarray, pd.DatetimeIndex, pd.Series, or np.datetime64 array)

  • unit (str, default 'ns') –

    For int64 pdarray, denotes the unit of the input. Ignored for pandas and numpy arrays, which carry their own unit. Not case-sensitive; prefixes of full names (like ‘sec’) are accepted.

    Possible values:

    • ’weeks’ or ‘w’

    • ’days’ or ‘d’

    • ’hours’ or ‘h’

    • ’minutes’, ‘m’, or ‘t’

    • ’seconds’ or ‘s’

    • ’milliseconds’, ‘ms’, or ‘l’

    • ’microseconds’, ‘us’, or ‘u’

    • ’nanoseconds’, ‘ns’, or ‘n’

    Unlike in pandas, units cannot be combined or mixed with integers

Notes

The .values attribute is always in nanoseconds with int64 dtype.

property date
property day
property day_of_week
property day_of_year
property dayofweek
property dayofyear
property hour
property is_leap_year
is_registered() numpy.bool_[source]

Return True iff the object is contained in the registry or is a component of a registered object.

Returns:

Indicates if the object is contained in the registry

Return type:

numpy.bool

Raises:

RegistrationError – Raised if there’s a server-side error or a mis-match of registered components

Notes

Objects registered with the server are immune to deletion until they are unregistered.

isocalendar()[source]
property microsecond
property millisecond
property minute
property month
property nanosecond
register(user_defined_name)[source]

Register this Datetime object and underlying components with the Arkouda server

Parameters:

user_defined_name (str) – user defined name the Datetime is to be registered under, this will be the root name for underlying components

Returns:

The same Datetime which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different Datetimes with the same name.

Return type:

Datetime

Raises:
  • TypeError – Raised if user_defined_name is not a str

  • RegistrationError – If the server was unable to register the Datetimes with the user_defined_name

Notes

Objects registered with the server are immune to deletion until they are unregistered.

property second
special_objType = 'Datetime'
sum()[source]

Return the sum of all elements in the array.

supported_opeq
supported_with_datetime
supported_with_pdarray
supported_with_r_datetime
supported_with_r_pdarray
supported_with_r_timedelta
supported_with_timedelta
to_pandas()[source]

Convert array to a pandas DatetimeIndex. Note: if the array size exceeds client.maxTransferBytes, a RuntimeError is raised.

See also

to_ndarray

unregister()[source]

Unregister this Datetime object in the arkouda server which was previously registered using register() and/or attached to using attach()

Raises:

RegistrationError – If the object is already unregistered or if there is a server error when attempting to unregister

Notes

Objects registered with the server are immune to deletion until they are unregistered.

property week
property weekday
property weekofyear
property year
class arkouda.util.GroupBy[source]

Group an array or list of arrays by value, usually in preparation for aggregating the within-group values of another array.

Parameters:
  • keys ((list of) pdarray, Strings, or Categorical) – The array to group by value, or if list, the column arrays to group by row

  • assume_sorted (bool) – If True, assume keys is already sorted (Default: False)

nkeys

The number of key arrays (columns)

Type:

int

size[source]

The length of the input array(s), i.e. number of rows

Type:

int

permutation

The permutation that sorts the keys array(s) by value (row)

Type:

pdarray

unique_keys

The unique values of the keys array(s), in grouped order

Type:

(list of) pdarray, Strings, or Categorical

ngroups

The length of the unique_keys array(s), i.e. number of groups

Type:

int

segments

The start index of each group in the grouped array(s)

Type:

pdarray

logger

Used for all logging operations

Type:

ArkoudaLogger

dropna

If True, and the groupby keys contain NaN values, the NaN values together with the corresponding row will be dropped. Otherwise, the rows corresponding to NaN values will be kept.

Type:

bool (default=True)

Raises:

TypeError – Raised if keys is a pdarray with a dtype other than int64

Notes

Integral pdarrays, Strings, and Categoricals are natively supported, but float64 and bool arrays are not.

For a user-defined class to be groupable, it must inherit from pdarray and define or overload the grouping API:

  1. a ._get_grouping_keys() method that returns a list of pdarrays that can be (co)argsorted.

  2. (Optional) a .group() method that returns the permutation that groups the array

If the input is a single array with a .group() method defined, method 2 will be used; otherwise, method 1 will be used.

AND(values: pdarray) Tuple[pdarray | List[pdarray | Strings], pdarray][source]

Bitwise AND of values in each segment.

Using the permutation stored in the GroupBy instance, group another array of values and perform a bitwise AND reduction on each group.

Parameters:

values (pdarray, int64) – The values to group and reduce with AND

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • result (pdarray, int64) – Bitwise AND of values in segments corresponding to keys

Raises:
  • TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not int64

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if all is not supported for the values dtype

OR(values: pdarray) Tuple[pdarray | List[pdarray | Strings], pdarray][source]

Bitwise OR of values in each segment.

Using the permutation stored in the GroupBy instance, group another array of values and perform a bitwise OR reduction on each group.

Parameters:

values (pdarray, int64) – The values to group and reduce with OR

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • result (pdarray, int64) – Bitwise OR of values in segments corresponding to keys

Raises:
  • TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not int64

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if all is not supported for the values dtype

Reductions(*args, **kwargs)

frozenset() -> empty frozenset object frozenset(iterable) -> frozenset object

Build an immutable unordered collection of unique elements.

XOR(values: pdarray) Tuple[pdarray | List[pdarray | Strings], pdarray][source]

Bitwise XOR of values in each segment.

Using the permutation stored in the GroupBy instance, group another array of values and perform a bitwise XOR reduction on each group.

Parameters:

values (pdarray, int64) – The values to group and reduce with XOR

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • result (pdarray, int64) – Bitwise XOR of values in segments corresponding to keys

Raises:
  • TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not int64

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if all is not supported for the values dtype

aggregate(values: groupable, operator: str, skipna: bool = True, ddof: int_scalars = 1) Tuple[groupable, groupable][source]

Using the permutation stored in the GroupBy instance, group another array of values and apply a reduction to each group’s values.

Parameters:
  • values (pdarray) – The values to group and reduce

  • operator (str) – The name of the reduction operator to use

  • skipna (bool) – boolean which determines if NANs should be skipped

  • ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating std

Returns:

  • unique_keys (groupable) – The unique keys, in grouped order

  • aggregates (groupable) – One aggregate value per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if the requested operator is not supported for the values dtype

Examples

>>> keys = ak.arange(0, 10)
>>> vals = ak.linspace(-1, 1, 10)
>>> g = ak.GroupBy(keys)
>>> g.aggregate(vals, 'sum')
(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), array([-1, -0.77777777777777768,
-0.55555555555555536, -0.33333333333333348, -0.11111111111111116,
0.11111111111111116, 0.33333333333333348, 0.55555555555555536, 0.77777777777777768,
1]))
>>> g.aggregate(vals, 'min')
(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), array([-1, -0.77777777777777779,
-0.55555555555555558, -0.33333333333333337, -0.11111111111111116, 0.11111111111111116,
0.33333333333333326, 0.55555555555555536, 0.77777777777777768, 1]))
all(values: pdarray) Tuple[pdarray | List[pdarray | Strings], pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and perform an “and” reduction on each group.

Parameters:

values (pdarray, bool) – The values to group and reduce with “and”

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_any (pdarray, bool) – One bool per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not bool

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if all is not supported for the values dtype

any(values: pdarray) Tuple[pdarray | List[pdarray | Strings], pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and perform an “or” reduction on each group.

Parameters:

values (pdarray, bool) – The values to group and reduce with “or”

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_any (pdarray, bool) – One bool per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not bool

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

argmax(values: pdarray) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and return the location of the first maximum of each group’s values.

Parameters:

values (pdarray) – The values to group and find argmax

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_argmaxima (pdarray, int64) – One index per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object or if argmax is not supported for the values dtype

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

Notes

The returned indices refer to the original values array as passed in, not the permutation applied by the GroupBy instance.

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.argmax(b)
(array([2, 3, 4]), array([9, 3, 2]))
argmin(values: pdarray) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and return the location of the first minimum of each group’s values.

Parameters:

values (pdarray) – The values to group and find argmin

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_argminima (pdarray, int64) – One index per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object or if argmax is not supported for the values dtype

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if argmin is not supported for the values dtype

Notes

The returned indices refer to the original values array as passed in, not the permutation applied by the GroupBy instance.

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.argmin(b)
(array([2, 3, 4]), array([5, 4, 2]))
attach(user_defined_name: str) GroupBy[source]

Function to return a GroupBy object attached to the registered name in the arkouda server which was registered using register()

Parameters:

user_defined_name (str) – user defined name which GroupBy object was registered under

Returns:

The GroupBy object created by re-attaching to the corresponding server components

Return type:

GroupBy

Raises:

RegistrationError – if user_defined_name is not registered

broadcast(values: pdarray | Strings, permute: bool = True) pdarray | Strings[source]

Fill each group’s segment with a constant value.

Parameters:
  • values (pdarray, Strings) – The values to put in each group’s segment

  • permute (bool) – If True (default), permute broadcast values back to the ordering of the original array on which GroupBy was called. If False, the broadcast values are grouped by value.

Returns:

The broadcasted values

Return type:

pdarray, Strings

Raises:
  • TypeError – Raised if value is not a pdarray object

  • ValueError – Raised if the values array does not have one value per segment

Notes

This function is a sparse analog of np.broadcast. If a GroupBy object represents a sparse matrix (tensor), then this function takes a (dense) column vector and replicates each value to the non-zero elements in the corresponding row.

Examples

>>> a = ak.array([0, 1, 0, 1, 0])
>>> values = ak.array([3, 5])
>>> g = ak.GroupBy(a)
# By default, result is in original order
>>> g.broadcast(values)
array([3, 5, 3, 5, 3])
# With permute=False, result is in grouped order
>>> g.broadcast(values, permute=False)
array([3, 3, 3, 5, 5]
>>> a = ak.randint(1,5,10)
>>> a
array([3, 1, 4, 4, 4, 1, 3, 3, 2, 2])
>>> g = ak.GroupBy(a)
>>> keys,counts = g.size()
>>> g.broadcast(counts > 2)
array([True False True True True False True True False False])
>>> g.broadcast(counts == 3)
array([True False True True True False True True False False])
>>> g.broadcast(counts < 4)
array([True True True True True True True True True True])
build_from_components(user_defined_name: str | None = None, **kwargs) GroupBy[source]

function to build a new GroupBy object from component keys and permutation.

Parameters:
  • user_defined_name (str (Optional) Passing a name will init the new GroupBy) – and assign it the given name

  • kwargs (dict Dictionary of components required for rebuilding the GroupBy.) – Expected keys are “orig_keys”, “permutation”, “unique_keys”, and “segments”

Returns:

The GroupBy object created by using the given components

Return type:

GroupBy

count(values: pdarray) Tuple[groupable, pdarray][source]

Count the number of elements in each group. NaN values will be excluded from the total.

Parameters:

values (pdarray) – The values to be count by group (excluding NaN values).

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • counts (pdarray, int64) – The number of times each unique key appears (excluding NaN values).

Examples

>>> a = ak.array([1, 0, -1, 1, 0, -1])
>>> a
array([1 0 -1 1 0 -1])
>>> b = ak.array([1, np.nan, -1, np.nan, np.nan, -1], dtype = "float64")
>>> b
array([1.00000000000000000 nan -1.00000000000000000 nan nan -1.00000000000000000])
>>> g = ak.GroupBy(a)
>>> keys,counts = g.count(b)
>>> keys
array([-1 0 1])
>>> counts
array([2 0 1])
first(values: groupable_element_type) Tuple[groupable, groupable_element_type][source]

First value in each group.

Parameters:

values (pdarray-like) – The values from which to take the first of each group

Returns:

  • unique_keys ((list of) pdarray-like) – The unique keys, in grouped order

  • result (pdarray-like) – The first value of each group

from_return_msg(rep_msg)[source]
head(values: groupable_element_type, n: int = 5, return_indices: bool = True) Tuple[groupable, groupable_element_type][source]

Return the first n values from each group.

Parameters:
  • values ((list of) pdarray-like) – The values from which to select, according to their group membership.

  • n (int, optional, default = 5) – Maximum number of items to return for each group. If the number of values in a group is less than n, all the values from that group will be returned.

  • return_indices (bool, default False) – If True, return the indices of the sampled values. Otherwise, return the selected values.

Returns:

  • unique_keys ((list of) pdarray-like) – The unique keys, in grouped order

  • result (pdarray-like) – The first n items of each group. If return_indices is True, the result are indices. O.W. the result are values.

Examples

>>> a = ak.arange(10) %3
>>> a
array([0 1 2 0 1 2 0 1 2 0])
>>> v = ak.arange(10)
>>> v
array([0 1 2 3 4 5 6 7 8 9])
>>> g = GroupBy(a)
>>> unique_keys, idx = g.head(v, 2, return_indices=True)
>>> _, values = g.head(v, 2, return_indices=False)
>>> unique_keys
array([0 1 2])
>>> idx
array([0 3 1 4 2 5])
>>> values
array([0 3 1 4 2 5])
>>> v2 =  -2 * ak.arange(10)
>>> v2
array([0 -2 -4 -6 -8 -10 -12 -14 -16 -18])
>>> _, idx2 = g.head(v2, 2, return_indices=True)
>>> _, values2 = g.head(v2, 2, return_indices=False)
>>> idx2
array([0 3 1 4 2 5])
>>> values2
array([0 -6 -2 -8 -4 -10])
is_registered() bool[source]

Return True if the object is contained in the registry

Returns:

Indicates if the object is contained in the registry

Return type:

bool

Raises:

RegistrationError – Raised if there’s a server-side error or a mismatch of registered components

Notes

Objects registered with the server are immune to deletion until they are unregistered.

max(values: pdarray, skipna: bool = True) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and return the maximum of each group’s values.

Parameters:
  • values (pdarray) – The values to group and find maxima

  • skipna (bool) – boolean which determines if NANs should be skipped

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_maxima (pdarray) – One maximum per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object or if max is not supported for the values dtype

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if max is not supported for the values dtype

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.max(b)
(array([2, 3, 4]), array([4, 4, 3]))
mean(values: pdarray, skipna: bool = True) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and compute the mean of each group’s values.

Parameters:
  • values (pdarray) – The values to group and average

  • skipna (bool) – boolean which determines if NANs should be skipped

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_means (pdarray, float64) – One mean value per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

Notes

The return dtype is always float64.

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.mean(b)
(array([2, 3, 4]), array([2.6666666666666665, 2.7999999999999998, 3]))
median(values: pdarray, skipna: bool = True) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and compute the median of each group’s values.

Parameters:
  • values (pdarray) – The values to group and find median

  • skipna (bool) – boolean which determines if NANs should be skipped

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_medians (pdarray, float64) – One median value per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

Notes

The return dtype is always float64.

Examples

>>> a = ak.randint(1,5,9)
>>> a
array([4 1 4 3 2 2 2 3 3])
>>> g = ak.GroupBy(a)
>>> g.keys
array([4 1 4 3 2 2 2 3 3])
>>> b = ak.linspace(-5,5,9)
>>> b
array([-5 -3.75 -2.5 -1.25 0 1.25 2.5 3.75 5])
>>> g.median(b)
(array([1 2 3 4]), array([-3.75 1.25 3.75 -3.75]))
min(values: pdarray, skipna: bool = True) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and return the minimum of each group’s values.

Parameters:
  • values (pdarray) – The values to group and find minima

  • skipna (bool) – boolean which determines if NANs should be skipped

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_minima (pdarray) – One minimum per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object or if min is not supported for the values dtype

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if min is not supported for the values dtype

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.min(b)
(array([2, 3, 4]), array([1, 1, 3]))
mode(values: groupable) Tuple[groupable, groupable][source]

Most common value in each group. If a group is multi-modal, return the modal value that occurs first.

Parameters:

values ((list of) pdarray-like) – The values from which to take the mode of each group

Returns:

  • unique_keys ((list of) pdarray-like) – The unique keys, in grouped order

  • result ((list of) pdarray-like) – The most common value of each group

most_common(values)[source]

(Deprecated) See GroupBy.mode().

nunique(values: groupable) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and return the number of unique values in each group.

Parameters:

values (pdarray, int64) – The values to group and find unique values

Returns:

  • unique_keys (groupable) – The unique keys, in grouped order

  • group_nunique (groupable) – Number of unique values per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the dtype(s) of values array(s) does/do not support the nunique method

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if nunique is not supported for the values dtype

Examples

>>> data = ak.array([3, 4, 3, 1, 1, 4, 3, 4, 1, 4])
>>> data
array([3, 4, 3, 1, 1, 4, 3, 4, 1, 4])
>>> labels = ak.array([1, 1, 1, 2, 2, 2, 3, 3, 3, 4])
>>> labels
ak.array([1, 1, 1, 2, 2, 2, 3, 3, 3, 4])
>>> g = ak.GroupBy(labels)
>>> g.keys
ak.array([1, 1, 1, 2, 2, 2, 3, 3, 3, 4])
>>> g.nunique(data)
array([1,2,3,4]), array([2, 2, 3, 1])
#    Group (1,1,1) has values [3,4,3] -> there are 2 unique values 3&4
#    Group (2,2,2) has values [1,1,4] -> 2 unique values 1&4
#    Group (3,3,3) has values [3,4,1] -> 3 unique values
#    Group (4) has values [4] -> 1 unique value
objType(*args, **kwargs)

str(object=’’) -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.__str__() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to ‘strict’.

prod(values: pdarray, skipna: bool = True) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and compute the product of each group’s values.

Parameters:
  • values (pdarray) – The values to group and multiply

  • skipna (bool) – boolean which determines if NANs should be skipped

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_products (pdarray, float64) – One product per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if prod is not supported for the values dtype

Notes

The return dtype is always float64.

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.prod(b)
(array([2, 3, 4]), array([12, 108.00000000000003, 8.9999999999999982]))
register(user_defined_name: str) GroupBy[source]

Register this GroupBy object and underlying components with the Arkouda server

Parameters:

user_defined_name (str) – user defined name the GroupBy is to be registered under, this will be the root name for underlying components

Returns:

The same GroupBy which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different GroupBys with the same name.

Return type:

GroupBy

Raises:
  • TypeError – Raised if user_defined_name is not a str

  • RegistrationError – If the server was unable to register the GroupBy with the user_defined_name

Notes

Objects registered with the server are immune to deletion until they are unregistered.

sample(values: groupable, n=None, frac=None, replace=False, weights=None, random_state=None, return_indices=False, permute_samples=False)[source]

Return a random sample from each group. You can either specify the number of elements or the fraction of elements to be sampled. random_state can be used for reproducibility

Parameters:
  • values ((list of) pdarray-like) – The values from which to sample, according to their group membership.

  • n (int, optional) – Number of items to return for each group. Cannot be used with frac and must be no larger than the smallest group unless replace is True. Default is one if frac is None.

  • frac (float, optional) – Fraction of items to return. Cannot be used with n.

  • replace (bool, default False) – Allow or disallow sampling of the value more than once.

  • weights (pdarray, optional) – Default None results in equal probability weighting. If passed a pdarray, then values must have the same length as the groupby keys and will be used as sampling probabilities after normalization within each group. Weights must be non-negative with at least one positive element within each group.

  • random_state (int or ak.random.Generator, optional) – If int, seed for random number generator. If ak.random.Generator, use as given.

  • return_indices (bool, default False) – if True, return the indices of the sampled values. Otherwise, return the sample values.

  • permute_samples (bool, default False) – if True, return permute the samples according to group Otherwise, keep samples in original order.

Returns:

if return_indices is True, return the indices of the sampled values. Otherwise, return the sample values.

Return type:

pdarray

size() Tuple[groupable, pdarray][source]

Count the number of elements in each group, i.e. the number of times each key appears. This counts the total number of rows (including NaN values).

Parameters:

none

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • counts (pdarray, int64) – The number of times each unique key appears

See also

count

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 2, 3, 1, 2, 4, 3, 4, 3, 4])
>>> g = ak.GroupBy(a)
>>> keys,counts = g.size()
>>> keys
array([1, 2, 3, 4])
>>> counts
array([1, 2, 4, 3])
std(values: pdarray, skipna: bool = True, ddof: int_scalars = 1) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and compute the standard deviation of each group’s values.

Parameters:
  • values (pdarray) – The values to group and find standard deviation

  • skipna (bool) – boolean which determines if NANs should be skipped

  • ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating std

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_stds (pdarray, float64) – One std value per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

Notes

The return dtype is always float64.

The standard deviation is the square root of the average of the squared deviations from the mean, i.e., std = sqrt(mean((x - x.mean())**2)).

The average squared deviation is normally calculated as x.sum() / N, where N = len(x). If, however, ddof is specified, the divisor N - ddof is used instead. In standard statistical practice, ddof=1 provides an unbiased estimator of the variance of the infinite population. ddof=0 provides a maximum likelihood estimate of the variance for normally distributed variables. The standard deviation computed in this function is the square root of the estimated variance, so even with ddof=1, it will not be an unbiased estimate of the standard deviation per se.

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.std(b)
(array([2 3 4]), array([1.5275252316519465 1.0954451150103321 0]))
sum(values: pdarray, skipna: bool = True) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and sum each group’s values.

Parameters:
  • values (pdarray) – The values to group and sum

  • skipna (bool) – boolean which determines if NANs should be skipped

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_sums (pdarray) – One sum per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

Notes

The grouped sum of a boolean pdarray returns integers.

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.sum(b)
(array([2, 3, 4]), array([8, 14, 6]))
tail(values: groupable_element_type, n: int = 5, return_indices: bool = True) Tuple[groupable, groupable_element_type][source]

Return the last n values from each group.

Parameters:
  • values ((list of) pdarray-like) – The values from which to select, according to their group membership.

  • n (int, optional, default = 5) – Maximum number of items to return for each group. If the number of values in a group is less than n, all the values from that group will be returned.

  • return_indices (bool, default False) – If True, return the indices of the sampled values. Otherwise, return the selected values.

Returns:

  • unique_keys ((list of) pdarray-like) – The unique keys, in grouped order

  • result (pdarray-like) – The last n items of each group. If return_indices is True, the result are indices. O.W. the result are values.

Examples

>>> a = ak.arange(10) %3
>>> a
array([0 1 2 0 1 2 0 1 2 0])
>>> v = ak.arange(10)
>>> v
array([0 1 2 3 4 5 6 7 8 9])
>>> g = GroupBy(a)
>>> unique_keys, idx = g.tail(v, 2, return_indices=True)
>>> _, values = g.tail(v, 2, return_indices=False)
>>> unique_keys
array([0 1 2])
>>> idx
array([6 9 4 7 5 8])
>>> values
array([6 9 4 7 5 8])
>>> v2 =  -2 * ak.arange(10)
>>> v2
array([0 -2 -4 -6 -8 -10 -12 -14 -16 -18])
>>> _, idx2 = g.tail(v2, 2, return_indices=True)
>>> _, values2 = g.tail(v2, 2, return_indices=False)
>>> idx2
array([6 9 4 7 5 8])
>>> values2
array([-12 -18 -8 -14 -10 -16])
to_hdf(prefix_path, dataset='groupby', mode='truncate', file_type='distribute')[source]

Save the GroupBy to HDF5. The result is a collection of HDF5 files, one file per locale of the arkouda server, where each filename starts with prefix_path.

Parameters:
  • prefix_path (str) – Directory and filename prefix that all output files will share

  • dataset (str) – Name prefix for saved data within the HDF5 file

  • mode (str {'truncate' | 'append'}) – By default, truncate (overwrite) output files, if they exist. If ‘append’, add data as a new column to existing files.

  • file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.

Returns:

  • None

  • GroupBy is not currently supported by Parquet

unique(values: groupable)[source]

Return the set of unique values in each group, as a SegArray.

Parameters:

values ((list of) pdarray-like) – The values to unique

Returns:

  • unique_keys ((list of) pdarray-like) – The unique keys, in grouped order

  • result ((list of) SegArray) – The unique values of each group

Raises:

TypeError – Raised if values is or contains Strings or Categorical

unregister()[source]

Unregister this GroupBy object in the arkouda server which was previously registered using register() and/or attached to using attach()

Raises:

RegistrationError – If the object is already unregistered or if there is a server error when attempting to unregister

Notes

Objects registered with the server are immune to deletion until they are unregistered.

unregister_groupby_by_name(user_defined_name: str) None[source]

Function to unregister GroupBy object by name which was registered with the arkouda server via register()

Parameters:

user_defined_name (str) – Name under which the GroupBy object was registered

Raises:
  • TypeError – if user_defined_name is not a string

  • RegistrationError – if there is an issue attempting to unregister any underlying components

update_hdf(prefix_path: str, dataset: str = 'groupby', repack: bool = True)[source]
var(values: pdarray, skipna: bool = True, ddof: int_scalars = 1) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and compute the variance of each group’s values.

Parameters:
  • values (pdarray) – The values to group and find variance

  • skipna (bool) – boolean which determines if NANs should be skipped

  • ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating var

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_vars (pdarray, float64) – One var value per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

Notes

The return dtype is always float64.

The variance is the average of the squared deviations from the mean, i.e., var = mean((x - x.mean())**2).

The mean is normally calculated as x.sum() / N, where N = len(x). If, however, ddof is specified, the divisor N - ddof is used instead. In standard statistical practice, ddof=1 provides an unbiased estimator of the variance of a hypothetical infinite population. ddof=0 provides a maximum likelihood estimate of the variance for normally distributed variables.

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.var(b)
(array([2 3 4]), array([2.333333333333333 1.2 0]))
class arkouda.util.IPv4(values)[source]

Bases: arkouda.numpy.pdarrayclass.pdarray

Represent integers as IPv4 addresses.

Parameters:

values (pdarray, int64) – The integer IP addresses

Returns:

The same IP addresses

Return type:

IPv4

Notes

This class is a thin wrapper around pdarray that mostly affects how values are displayed to the user. Operators and methods will typically treat this class like an int64 pdarray.

export_uint()[source]
format(x)[source]

Format a single integer IP address as a string.

normalize(x)[source]

Take in an IP address as a string, integer, or IPAddress object, and convert it to an integer.

opeq(other, op)[source]
register(user_defined_name)[source]

Register this IPv4 object and underlying components with the Arkouda server

Parameters:

user_defined_name (str) – user defined name the IPv4 is to be registered under, this will be the root name for underlying components

Returns:

The same IPv4 which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different IPv4s with the same name.

Return type:

IPv4

Raises:
  • TypeError – Raised if user_defined_name is not a str

  • RegistrationError – If the server was unable to register the IPv4 with the user_defined_name

Notes

Objects registered with the server are immune to deletion until they are unregistered.

special_objType = 'IPv4'
to_hdf(prefix_path: str, dataset: str = 'array', mode: str = 'truncate', file_type: str = 'distribute')[source]

Override of the pdarray to_hdf to store the special object type

to_list()[source]

Export array as a list of integers.

to_ndarray()[source]

Export array as a numpy array of integers.

update_hdf(prefix_path: str, dataset: str = 'array', repack: bool = True)[source]

Override the pdarray implementation so that the special object type will be used.

values
arkouda.util.Index
class arkouda.util.SegArray(segments, values, lengths=None, grouping=None)[source]
AND(x=None)[source]
OR(x=None)[source]
XOR(x=None)[source]
aggregate(op, x=None)[source]
all(x=None)[source]
any(x=None)[source]
append(other, axis=0)[source]

Append other to self, either vertically (axis=0, length of resulting SegArray increases), or horizontally (axis=1, each sub-array of other appends to the corresponding sub-array of self).

Parameters:
  • other (SegArray) – Array of sub-arrays to append

  • axis (0 or 1) – Whether to append vertically (0) or horizontally (1). If axis=1, other must be same size as self.

Returns:

axis=0: New SegArray containing all sub-arrays axis=1: New SegArray of same length, with pairs of sub-arrays concatenated

Return type:

SegArray

append_single(x, prepend=False)[source]

Append a single value to each sub-array.

Parameters:

x (pdarray or scalar) – Single value to append to each sub-array

Returns:

Copy of original SegArray with values from x appended to each sub-array

Return type:

SegArray

argmax(x=None)[source]
argmin(x=None)[source]
classmethod attach(user_defined_name)[source]

Using the defined name, attach to a SegArray that has been registered to the Symbol Table

Parameters:

user_defined_name (str) – user defined name which the SegArray object was registered under

Returns:

The resulting SegArray

Return type:

SegArray

Raises:

RuntimeError – Raised if the server could not attach to the SegArray object

classmethod concat(x, axis=0, ordered=True)[source]

Concatenate a sequence of SegArrays

Parameters:
  • x (sequence of SegArray) – The SegArrays to concatenate

  • axis (0 or 1) – Select vertical (0) or horizontal (1) concatenation. If axis=1, all SegArrays must have same size.

  • ordered (bool) – Must be True. This option is present for compatibility only, because unordered concatenation is not yet supported.

Returns:

The input arrays joined into one SegArray

Return type:

SegArray

copy()[source]

Return a deep copy.

dtype
filter(filter, discard_empty: bool = False)[source]

Filter values out of the SegArray object

Parameters:
  • filter (pdarray, list, or value) – The value/s to be filtered out of the SegArray

  • discard_empty (bool) – Defaults to False. When True, empty segments are removed from the return SegArray

Return type:

SegArray

classmethod from_multi_array(m)[source]

Construct a SegArray from a list of columns. This essentially transposes the input, resulting in an array of rows.

Parameters:

m (list of pdarray or Strings) – List of columns, the rows of which will form the sub-arrays of the output

Returns:

Array of rows of input

Return type:

SegArray

classmethod from_parts(segments, values, lengths=None, grouping=None) SegArray[source]

DEPRECATED Construct a SegArray object from its parts

Parameters:
  • segments (pdarray, int64) – Start index of each sub-array in the flattened values array

  • values (pdarray) – The flattened values of all sub-arrays

  • lengths (pdarray) – The length of each segment

  • grouping (GroupBy) – grouping of segments

Returns:

Data structure representing an array whose elements are variable-length arrays.

Return type:

SegArray

Notes

Keyword args ‘lengths’ and ‘grouping’ are not user-facing. They are used by the attach method.

classmethod from_return_msg(rep_msg) SegArray[source]
get_jth(j, return_origins=True, compressed=False, default=0)[source]

Select the j-th element of each sub-array, where possible.

Parameters:
  • j (int) – The index of the value to get from each sub-array. If j is negative, it counts backwards from the end of each sub-array.

  • return_origins (bool) – If True, return a logical index indicating where j is in bounds

  • compressed (bool) – If False, return array is same size as self, with default value where j is out of bounds. If True, the return array only contains values where j is in bounds.

  • default (scalar) – When compressed=False, the value to return when j is out of bounds for the sub-array

Returns:

  • val (pdarray) – compressed=False: The j-th value of each sub-array where j is in bounds and the default value where j is out of bounds. compressed=True: The j-th values of only the sub-arrays where j is in bounds

  • origin_indices (pdarray, bool) – A Boolean array that is True where j is in bounds for the sub-array.

Notes

If values are Strings, only the compressed format is supported.

get_length_n(n, return_origins=True)[source]

Return all sub-arrays of length n, as a list of columns.

Parameters:
  • n (int) – Length of sub-arrays to select

  • return_origins (bool) – Return a logical index indicating which sub-arrays are length n

Returns:

  • columns (list of pdarray) – An n-long list of pdarray, where each row is one of the n-long sub-arrays from the SegArray. The number of rows is the number of True values in the returned mask.

  • origin_indices (pdarray, bool) – Array of bool for each element of the SegArray, True where sub-array has length n.

get_ngrams(n, return_origins=True)[source]

Return all n-grams from all sub-arrays.

Parameters:
  • n (int) – Length of n-gram

  • return_origins (bool) – If True, return an int64 array indicating which sub-array each returned n-gram came from.

Returns:

  • ngrams (list of pdarray) – An n-long list of pdarrays, essentially a table where each row is an n-gram.

  • origin_indices (pdarray, int) – The index of the sub-array from which the corresponding n-gram originated

get_prefixes(n, return_origins=True, proper=True)[source]

Return all sub-array prefixes of length n (for sub-arrays that are at least n+1 long)

Parameters:
  • n (int) – Length of suffix

  • return_origins (bool) – If True, return a logical index indicating which sub-arrays were long enough to return an n-prefix

  • proper (bool) – If True, only return proper prefixes, i.e. from sub-arrays that are at least n+1 long. If False, allow the entire sub-array to be returned as a prefix.

Returns:

  • prefixes (list of pdarray) – An n-long list of pdarrays, essentially a table where each row is an n-prefix. The number of rows is the number of True values in the returned mask.

  • origin_indices (pdarray, bool) – Boolean array that is True where the sub-array was long enough to return an n-suffix, False otherwise.

get_suffixes(n, return_origins=True, proper=True)[source]

Return the n-long suffix of each sub-array, where possible

Parameters:
  • n (int) – Length of suffix

  • return_origins (bool) – If True, return a logical index indicating which sub-arrays were long enough to return an n-suffix

  • proper (bool) – If True, only return proper suffixes, i.e. from sub-arrays that are at least n+1 long. If False, allow the entire sub-array to be returned as a suffix.

Returns:

  • suffixes (list of pdarray) – An n-long list of pdarrays, essentially a table where each row is an n-suffix. The number of rows is the number of True values in the returned mask.

  • origin_indices (pdarray, bool) – Boolean array that is True where the sub-array was long enough to return an n-suffix, False otherwise.

property grouping
hash() Tuple[arkouda.numpy.pdarrayclass.pdarray, arkouda.numpy.pdarrayclass.pdarray][source]

Compute a 128-bit hash of each segment.

Returns:

A tuple of two int64 pdarrays. The ith hash value is the concatenation of the ith values from each array.

Return type:

Tuple[pdarray,pdarray]

intersect(other)[source]

Computes the intersection of 2 SegArrays.

Parameters:

other (SegArray) – SegArray to compute against

Returns:

Segments are the 1d intersections of the segments of self and other

Return type:

SegArray

See also

pdarraysetops.intersect1d

Examples

>>> a = [1, 2, 3, 1, 4]
>>> b = [3, 1, 4, 5]
>>> c = [1, 3, 3, 5]
>>> d = [2, 2, 4]
>>> seg_a = ak.segarray(ak.array([0, len(a)]), ak.array(a+b))
>>> seg_b = ak.segarray(ak.array([0, len(c)]), ak.array(c+d))
>>> seg_a.intersect(seg_b)
SegArray([
[1, 3],
[4]
])
is_registered() bool[source]

Checks if the name of the SegArray object is registered in the Symbol Table

Returns:

True if SegArray is registered, false if not

Return type:

bool

classmethod load(prefix_path, dataset='segarray', segment_name='segments', value_name='values')[source]
logger
max(x=None)[source]
mean(x=None)[source]
min(x=None)[source]
property nbytes

The size of the segarray in bytes.

Returns:

The size of the segarray in bytes.

Return type:

int

property non_empty
nunique(x=None)[source]
objType = 'SegArray'
prepend_single(x)[source]
prod(x=None)[source]
classmethod read_hdf(prefix_path, dataset='segarray')[source]

Load a saved SegArray from HDF5. All arguments must match what was supplied to SegArray.save()

Parameters:
  • prefix_path (str) – Directory and filename prefix

  • dataset (str) – Name prefix for saved data within the HDF5 files

Return type:

SegArray

register(user_defined_name)[source]

Register this SegArray object and underlying components with the Arkouda server

Parameters:

user_defined_name (str) – user defined name which this SegArray object will be registered under

Returns:

The same SegArray which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different SegArrays with the same name.

Return type:

SegArray

Raises:

RegistrationError – Raised if the server could not register the SegArray object

Notes

Objects registered with the server are immune to deletion until they are unregistered.

registered_name: str | None = None
remove_repeats(return_multiplicity=False)[source]

Condense sequences of repeated values within a sub-array to a single value.

Parameters:

return_multiplicity (bool) – If True, also return the number of times each value was repeated.

Returns:

  • norepeats (SegArray) – Sub-arrays with runs of repeated values replaced with single value

  • multiplicity (SegArray) – If return_multiplicity=True, this array contains the number of times each value in the returned SegArray was repeated in the original SegArray.

save(prefix_path, dataset='segarray', mode='truncate', file_type='distribute')[source]

DEPRECATED Save the SegArray to HDF5. The object can be saved to a collection of files or single file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.

If ‘append’, attempt to create new dataset in existing files.

Parameters:

file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.

Return type:

string message indicating result of save operation

Raises:

RuntimeError – Raised if a server-side error is thrown saving the pdarray

Notes

  • The prefix_path must be visible to the arkouda server and the user must

have write permission. - Output files have names of the form <prefix_path>_LOCALE<i>, where <i> ranges from 0 to numLocales for file_type=’distribute’. Otherwise, the file name will be prefix_path. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, a RuntimeError will result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.

See also

to_hdf, load

segments
set_jth(i, j, v)[source]

Set the j-th element of each sub-array in a subset.

Parameters:
  • i (pdarray, int) – Indices of sub-arrays to set j-th element

  • j (int) – Index of value to set in each sub-array. If j is negative, it counts backwards from the end of the sub-array.

  • v (pdarray or scalar) – The value(s) to set. If v is a pdarray, it must have same length as i.

Raises:

ValueError – If j is out of bounds in any of the sub-arrays specified by i.

setdiff(other)[source]

Computes the set difference of 2 SegArrays.

Parameters:

other (SegArray) – SegArray to compute against

Returns:

Segments are the 1d set difference of the segments of self and other

Return type:

SegArray

See also

pdarraysetops.setdiff1d

Examples

>>> a = [1, 2, 3, 1, 4]
>>> b = [3, 1, 4, 5]
>>> c = [1, 3, 3, 5]
>>> d = [2, 2, 4]
>>> seg_a = ak.segarray(ak.array([0, len(a)]), ak.array(a+b))
>>> seg_b = ak.segarray(ak.array([0, len(c)]), ak.array(c+d))
>>> seg_a.setdiff(seg_b)
SegArray([
[2, 4],
[1, 3, 5]
])
setxor(other)[source]

Computes the symmetric difference of 2 SegArrays.

Parameters:

other (SegArray) – SegArray to compute against

Returns:

Segments are the 1d symmetric difference of the segments of self and other

Return type:

SegArray

See also

pdarraysetops.setxor1d

Examples

>>> a = [1, 2, 3, 1, 4]
>>> b = [3, 1, 4, 5]
>>> c = [1, 3, 3, 5]
>>> d = [2, 2, 4]
>>> seg_a = ak.segarray(ak.array([0, len(a)]), ak.array(a+b))
>>> seg_b = ak.segarray(ak.array([0, len(c)]), ak.array(c+d))
>>> seg_a.setxor(seg_b)
SegArray([
[2, 4, 5],
[1, 3, 5, 2]
])
size
sum(x=None)[source]
to_hdf(prefix_path, dataset='segarray', mode='truncate', file_type='distribute')[source]

Save the SegArray to HDF5. The result is a collection of HDF5 files, one file per locale of the arkouda server, where each filename starts with prefix_path.

Parameters:
  • prefix_path (str) – Directory and filename prefix that all output files will share

  • dataset (str) – Name prefix for saved data within the HDF5 file

  • mode (str {'truncate' | 'append'}) – By default, truncate (overwrite) output files, if they exist. If ‘append’, add data as a new column to existing files.

  • file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.

Return type:

None

See also

load

to_list()[source]

Convert the segarray into a list containing sub-arrays

Returns:

A list with the same sub-arrays (also list) as this segarray

Return type:

list

See also

to_ndarray

Examples

>>> segarr = ak.SegArray(ak.array([0, 4, 7]), ak.arange(12))
>>> segarr.to_list()
[[0, 1, 2, 3], [4, 5, 6], [7, 8, 9, 10, 11]]
>>> type(segarr.to_list())
list
to_ndarray()[source]

Convert the array into a numpy.ndarray containing sub-arrays

Returns:

A numpy ndarray with the same sub-arrays (also numpy.ndarray) as this array

Return type:

np.ndarray

See also

array, to_list

Examples

>>> segarr = ak.SegArray(ak.array([0, 4, 7]), ak.arange(12))
>>> segarr.to_ndarray()
array([array([1, 2, 3, 4]), array([5, 6, 7]), array([8, 9, 10, 11, 12])])
>>> type(segarr.to_ndarray())
numpy.ndarray
to_parquet(prefix_path, dataset='segarray', mode: str = 'truncate', compression: str | None = None)[source]

Save the SegArray object to Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. Each locale saves its chunk of the object to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: Deprecated.

Parameter kept to maintain functionality of other calls. Only Truncate supported. By default, truncate (overwrite) output files, if they exist. If ‘append’, attempt to create new dataset in existing files.

Parameters:

compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Sets the compression type used with Parquet files

Return type:

string message indicating result of save operation

Raises:
  • RuntimeError – Raised if a server-side error is thrown saving the pdarray

  • ValueError – If write mode is not Truncate.

Notes

  • Append mode for Parquet has been deprecated. It was not implemented for SegArray.

  • The prefix_path must be visible to the arkouda server and the user must

have write permission. - Output files have names of the form <prefix_path>_LOCALE<i>, where <i> ranges from 0 to numLocales for file_type=’distribute’. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, a RuntimeError will result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.

transfer(hostname: str, port: arkouda.numpy.dtypes.int_scalars)[source]

Sends a Segmented Array to a different Arkouda server

Parameters:
  • hostname (str) – The hostname where the Arkouda server intended to receive the Segmented Array is running.

  • port (int_scalars) – The port to send the array over. This needs to be an open port (i.e., not one that the Arkouda server is running on). This will open up numLocales ports, each of which in succession, so will use ports of the range {port..(port+numLocales)} (e.g., running an Arkouda server of 4 nodes, port 1234 is passed as port, Arkouda will use ports 1234, 1235, 1236, and 1237 to send the array data). This port much match the port passed to the call to ak.receive_array().

Return type:

A message indicating a complete transfer

Raises:
  • ValueError – Raised if the op is not within the pdarray.BinOps set

  • TypeError – Raised if other is not a pdarray or the pdarray.dtype is not a supported dtype

union(other)[source]

Computes the union of 2 SegArrays.

Parameters:

other (SegArray) – SegArray to compute against

Returns:

Segments are the 1d union of the segments of self and other

Return type:

SegArray

See also

pdarraysetops.union1d

Examples

>>> a = [1, 2, 3, 1, 4]
>>> b = [3, 1, 4, 5]
>>> c = [1, 3, 3, 5]
>>> d = [2, 2, 4]
>>> seg_a = ak.segarray(ak.array([0, len(a)]), ak.array(a+b))
>>> seg_b = ak.segarray(ak.array([0, len(c)]), ak.array(c+d))
>>> seg_a.union(seg_b)
SegArray([
[1, 2, 3, 4, 5],
[1, 2, 3, 4, 5]
])
unique(x=None)[source]

Return sub-arrays of unique values.

Parameters:

x (pdarray) – The values to unique, per group. By default, the values of this SegArray’s sub-arrays.

Returns:

Same number of sub-arrays as original SegArray, but elements in sub-array are unique and in sorted order.

Return type:

SegArray

unregister()[source]

Unregister this SegArray object in the arkouda server which was previously registered using register() and/or attached to using attach()

Return type:

None

Raises:

RuntimeError – Raised if the server could not unregister the SegArray object from the Symbol Table

Notes

Objects registered with the server are immune to deletion until they are unregistered.

static unregister_segarray_by_name(user_defined_name)[source]

Using the defined name, remove the registered SegArray object from the Symbol Table

Parameters:

user_defined_name (str) – user defined name which the SegArray object was registered under

Return type:

None

Raises:

RuntimeError – Raised if the server could not unregister the SegArray object from the Symbol Table

update_hdf(prefix_path: str, dataset: str = 'segarray', repack: bool = True)[source]

Overwrite the dataset with the name provided with this SegArray object. If the dataset does not exist it is added.

Parameters:
  • prefix_path (str) – Directory and filename prefix that all output files share

  • dataset (str) – Name of the dataset to create in files

  • repack (bool) – Default: True HDF5 does not release memory on delete. When True, the inaccessible data (that was overwritten) is removed. When False, the data remains, but is inaccessible. Setting to false will yield better performance, but will cause file sizes to expand.

Return type:

None

Raises:

RuntimeError – Raised if a server-side error is thrown saving the SegArray

Notes

  • If file does not contain File_Format attribute to indicate how it was saved, the file name is checked for _LOCALE#### to determine if it is distributed.

  • If the dataset provided does not exist, it will be added

  • Because HDF5 deletes do not release memory, this will create a copy of the file with the new data

valsize
values
class arkouda.util.Strings(strings_pdarray: arkouda.numpy.pdarrayclass.pdarray, bytes_size: arkouda.numpy.dtypes.int_scalars)[source]

Represents an array of strings whose data resides on the arkouda server. The user should not call this class directly; rather its instances are created by other arkouda functions.

entry

Encapsulation of a Segmented Strings array contained on the arkouda server. This is a composite of

  • offsets array: starting indices for each string

  • bytes array: raw bytes of all strings joined by nulls

Type:

pdarray

size

The number of strings in the array

Type:

int_scalars

nbytes

The total number of bytes in all strings

Type:

int_scalars

ndim

The rank of the array (currently only rank 1 arrays supported)

Type:

int_scalars

shape

The sizes of each dimension of the array

Type:

tuple

dtype

The dtype is ak.str

Type:

dtype

logger

Used for all logging operations

Type:

ArkoudaLogger

Notes

Strings is composed of two pdarrays: (1) offsets, which contains the starting indices for each string and (2) bytes, which contains the raw bytes of all strings, delimited by nulls.

BinOps
astype(dtype: numpy.dtype | str) arkouda.numpy.pdarrayclass.pdarray[source]

Cast values of Strings object to provided dtype

Parameters:

dtype (np.dtype or str) – Dtype to cast to

Returns:

An arkouda pdarray with values converted to the specified data type

Return type:

ak.pdarray

Notes

This is essentially shorthand for ak.cast(x, ‘<dtype>’) where x is a pdarray.

static attach(user_defined_name: str) Strings[source]

class method to return a Strings object attached to the registered name in the arkouda server which was registered using register()

Parameters:

user_defined_name (str) – user defined name which the Strings object was registered under

Returns:

the Strings object registered with user_defined_name in the arkouda server

Return type:

Strings object

Raises:

TypeError – Raised if user_defined_name is not a str

See also

register, unregister

Notes

Registered names/Strings objects in the server are immune to deletion until they are unregistered.

cached_regex_patterns() List[source]

Returns the regex patterns for which Match objects have been cached

capitalize() Strings[source]

Returns a new Strings from the original replaced with the first letter capitilzed and the remaining letters lowercase.

Returns:

Strings from the original replaced with the capitalized equivalent.

Return type:

Strings

Raises:

RuntimeError – Raised if there is a server-side error thrown.

See also

Strings.lower, String.upper, String.title

Examples

>>> strings = ak.array([f'StrINgS aRe Here {i}' for i in range(5)])
>>> strings
array(['StrINgS aRe Here 0', 'StrINgS aRe Here 1', 'StrINgS aRe Here 2', 'StrINgS aRe Here 3', 'StrINgS aRe Here 4'])
>>> strings.title()
array(['Strings Are Here 0', 'Strings Are Here 1', 'Strings Are Here 2', 'Strings Are Here 3', 'Strings Are Here 4'])
contains(substr: bytes | arkouda.numpy.dtypes.str_scalars, regex: bool = False) arkouda.numpy.pdarrayclass.pdarray[source]

Check whether each element contains the given substring.

Parameters:
  • substr (bytes or str_scalars) – The substring in the form of string or byte array to search for

  • regex (bool, default=False) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)

Returns:

True for elements that contain substr, False otherwise

Return type:

pdarray, bool

Raises:
  • TypeError – Raised if the substr parameter is not bytes or str_scalars

  • ValueError – Rasied if substr is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

Examples

>>> strings = ak.array([f'{i} string {i}' for i in range(1, 6)])
>>> strings
array(['1 string 1', '2 string 2', '3 string 3', '4 string 4', '5 string 5'])
>>> strings.contains('string')
array([True True True True True])
>>> strings.contains('string \d', regex=True)
array([True True True True True])
decode(fromEncoding: str, toEncoding: str = 'UTF-8') Strings[source]

Return a new strings object in fromEncoding, expecting that the current Strings is encoded in toEncoding

Parameters:
  • fromEncoding (str) – The current encoding of the strings object

  • toEncoding (str, default="UTF-8") – The encoding that the strings will be converted to, default to UTF-8

Returns:

A new Strings object in toEncoding

Return type:

Strings

Raises:

RuntimeError – Raised if there is a server-side error thrown

dtype
encode(toEncoding: str, fromEncoding: str = 'UTF-8') Strings[source]

Return a new strings object in toEncoding, expecting that the current Strings is encoded in fromEncoding

Parameters:
  • toEncoding (str) – The encoding that the strings will be converted to

  • fromEncoding (str, default="UTF-8") – The current encoding of the strings object, default to UTF-8

Returns:

A new Strings object in toEncoding

Return type:

Strings

Raises:

RuntimeError – Raised if there is a server-side error thrown

endswith(substr: bytes | arkouda.numpy.dtypes.str_scalars, regex: bool = False) arkouda.numpy.pdarrayclass.pdarray[source]

Check whether each element ends with the given substring.

Parameters:
  • substr (bytes or str_scalars) – The suffix to search for

  • regex (bool, default=False) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)

Returns:

True for elements that end with substr, False otherwise

Return type:

pdarray, bool

Raises:
  • TypeError – Raised if the substr parameter is not bytes or str_scalars

  • ValueError – Rasied if substr is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

Examples

>>> strings_start = ak.array([f'{i} string' for i in range(1,6)])
>>> strings_start
array(['1 string', '2 string', '3 string', '4 string', '5 string'])
>>> strings_start.endswith('ing')
array([True True True True True])
>>> strings_end = ak.array([f'string {i}' for i in range(1, 6)])
>>> strings_end
array(['string 1', 'string 2', 'string 3', 'string 4', 'string 5'])
>>> strings_end.endswith('ing \d', regex = True)
array([True True True True True])
entry: arkouda.numpy.pdarrayclass.pdarray
equals(other: Any) arkouda.numpy.dtypes.bool_scalars[source]

Whether Strings are the same size and all entries are equal.

Parameters:

other (Any) – object to compare.

Returns:

True if the Strings are the same, o.w. False.

Return type:

bool

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> s = ak.array(["a", "b", "c"])
>>> s_cpy = ak.array(["a", "b", "c"])
>>> s.equals(s_cpy)
True
>>> s2 = ak.array(["a", "x", "c"])
>>> s.equals(s2)
False
find_locations(pattern: bytes | arkouda.numpy.dtypes.str_scalars) Tuple[arkouda.numpy.pdarrayclass.pdarray, arkouda.numpy.pdarrayclass.pdarray, arkouda.numpy.pdarrayclass.pdarray][source]

Finds pattern matches and returns pdarrays containing the number, start postitions, and lengths of matches

Parameters:

pattern (bytes or str_scalars) – The regex pattern used to find matches

Returns:

  • pdarray, int64 – For each original string, the number of pattern matches

  • pdarray, int64 – The start positons of pattern matches

  • pdarray, int64 – The lengths of pattern matches

Raises:
  • TypeError – Raised if the pattern parameter is not bytes or str_scalars

  • ValueError – Raised if pattern is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

Examples

>>> strings = ak.array([f'{i} string {i}' for i in range(1, 6)])
>>> num_matches, starts, lens = strings.find_locations('\d')
>>> num_matches
array([2 2 2 2 2])
>>> starts
array([0 9 0 9 0 9 0 9 0 9])
>>> lens
array([1 1 1 1 1 1 1 1 1 1])
findall(pattern: bytes | arkouda.numpy.dtypes.str_scalars, return_match_origins: bool = False) Strings | Tuple[source]

Return a new Strings containg all non-overlapping matches of pattern

Parameters:
  • pattern (bytes or str_scalars) – Regex used to find matches

  • return_match_origins (bool, default=False) – If True, return a pdarray containing the index of the original string each pattern match is from

Returns:

  • Strings – Strings object containing only pattern matches

  • pdarray, int64 (optional) – The index of the original string each pattern match is from

Raises:
  • TypeError – Raised if the pattern parameter is not bytes or str_scalars

  • ValueError – Raised if pattern is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

Examples

>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', ''])
>>> strings.findall('_+', return_match_origins=True)
(array(['_', '___', '____', '__', '___', '____', '___']), array([0 0 1 3 3 3 3]))
flatten() Strings[source]

Return a copy of the array collapsed into one dimension.

Return type:

A copy of the input array, flattened to one dimension.

Note

As multidimensional Strings are currently supported, flatten on a Strings object will always return itself.

static from_parts(offset_attrib: arkouda.numpy.pdarrayclass.pdarray | str, bytes_attrib: arkouda.numpy.pdarrayclass.pdarray | str) Strings[source]

Factory method for creating a Strings object from an Arkouda server response where the arrays are separate components.

Parameters:
  • offset_attrib (pdarray or str) – the array containing the offsets

  • bytes_attrib (pdarray or str) – the array containing the string values

Returns:

object representing a segmented strings array on the server

Return type:

Strings

Raises:

RuntimeError – Raised if there’s an error converting a server-returned str-descriptor

Notes

This factory method is used when we construct the parts of a Strings object on the client side and transfer the offsets & bytes separately to the server. This results in two entries in the symbol table and we need to instruct the server to assemble the into a composite entity.

static from_return_msg(rep_msg: str) Strings[source]

Factory method for creating a Strings object from an Arkouda server response message

Parameters:

rep_msg (str) – Server response message currently of form created name type size ndim shape itemsize+created bytes.size 1234

Returns:

object representing a segmented strings array on the server

Return type:

Strings

Raises:

RuntimeError – Raised if there’s an error converting a server-returned str-descriptor

Notes

We really don’t have an itemsize because these are variable length strings. In the future we could probably use this position to store the total bytes.

fullmatch(pattern: bytes | arkouda.numpy.dtypes.str_scalars) arkouda.match.Match[source]

Returns a match object where elements match only if the whole string matches the regular expression pattern

Parameters:

pattern (bytes or str_scalars) – Regex used to find matches

Returns:

Match object where elements match only if the whole string matches the regular expression pattern

Return type:

Match

Examples

>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', ''])
>>> strings.fullmatch('_+')
<ak.Match object: matched=False; matched=True, span=(0, 4); matched=False;
matched=False; matched=False>
get_bytes() arkouda.numpy.pdarrayclass.pdarray[source]

Getter for the bytes component (uint8 pdarray) of this Strings.

Returns:

Pdarray of bytes of the string accessed

Return type:

pdarray, uint8

Example

>>> x = ak.array(['one', 'two', 'three'])
>>> x.get_bytes()
[111 110 101 0 116 119 111 0 116 104 114 101 101 0]
get_lengths() arkouda.numpy.pdarrayclass.pdarray[source]

Return the length of each string in the array.

Returns:

The length of each string

Return type:

pdarray, int

Raises:

RuntimeError – Raised if there is a server-side error thrown

get_offsets() arkouda.numpy.pdarrayclass.pdarray[source]

Getter for the offsets component (int64 pdarray) of this Strings.

Returns:

Pdarray of offsets of the string accessed

Return type:

pdarray, int64

Example

>>> x = ak.array(['one', 'two', 'three'])
>>> x.get_offsets()
[0 4 8]
get_prefixes(n: arkouda.numpy.dtypes.int_scalars, return_origins: bool = True, proper: bool = True) Strings | Tuple[Strings, arkouda.numpy.pdarrayclass.pdarray][source]

Return the n-long prefix of each string, where possible

Parameters:
  • n (int_scalars) – Length of prefix

  • return_origins (bool, default=True) – If True, return a logical index indicating which strings were long enough to return an n-prefix

  • proper (bool, default=True) – If True, only return proper prefixes, i.e. from strings that are at least n+1 long. If False, allow the entire string to be returned as a prefix.

Returns:

  • prefixes (Strings) – The array of n-character prefixes; the number of elements is the number of True values in the returned mask.

  • origin_indices (pdarray, bool) – Boolean array that is True where the string was long enough to return an n-character prefix, False otherwise.

get_suffixes(n: arkouda.numpy.dtypes.int_scalars, return_origins: bool = True, proper: bool = True) Strings | Tuple[Strings, arkouda.numpy.pdarrayclass.pdarray][source]

Return the n-long suffix of each string, where possible

Parameters:
  • n (int_scalars) – Length of suffix

  • return_origins (bool, default=True) – If True, return a logical index indicating which strings were long enough to return an n-suffix

  • proper (bool, default=True) – If True, only return proper suffixes, i.e. from strings that are at least n+1 long. If False, allow the entire string to be returned as a suffix.

Returns:

  • suffixes (Strings) – The array of n-character suffixes; the number of elements is the number of True values in the returned mask.

  • origin_indices (pdarray, bool) – Boolean array that is True where the string was long enough to return an n-character suffix, False otherwise.

group() arkouda.numpy.pdarrayclass.pdarray[source]

Return the permutation that groups the array, placing equivalent strings together. All instances of the same string are guaranteed to lie in one contiguous block of the permuted array, but the blocks are not necessarily ordered.

Returns:

The permutation that groups the array by value

Return type:

pdarray

See also

GroupBy, unique

Notes

If the arkouda server is compiled with “-sSegmentedString.useHash=true”, then arkouda uses 128-bit hash values to group strings, rather than sorting the strings directly. This method is fast, but the resulting permutation merely groups equivalent strings and does not sort them. If the “useHash” parameter is false, then a full sort is performed.

Raises:

RuntimeError – Raised if there is a server-side error in executing group request or creating the pdarray encapsulating the return message

hash() Tuple[arkouda.numpy.pdarrayclass.pdarray, arkouda.numpy.pdarrayclass.pdarray][source]

Compute a 128-bit hash of each string.

Returns:

A tuple of two int64 pdarrays. The ith hash value is the concatenation of the ith values from each array.

Return type:

Tuple[pdarray,pdarray]

Notes

The implementation uses SipHash128, a fast and balanced hash function (used by Python for dictionaries and sets). For realistic numbers of strings (up to about 10**15), the probability of a collision between two 128-bit hash values is negligible.

property inferred_type: str

Return a string of the type inferred from the values.

info() str[source]

Returns a JSON formatted string containing information about all components of self

Parameters:

None

Returns:

JSON string containing information about all components of self

Return type:

str

is_registered() numpy.bool_[source]

Return True iff the object is contained in the registry

Parameters:

None

Returns:

Indicates if the object is contained in the registry

Return type:

bool

Raises:

RuntimeError – Raised if there’s a server-side error thrown

isalnum() arkouda.numpy.pdarrayclass.pdarray[source]

Returns a boolean pdarray where index i indicates whether string i of the Strings is alphanumeric.

Returns:

True for elements that are alphanumeric, False otherwise

Return type:

pdarray, bool

Raises:

RuntimeError – Raised if there is a server-side error thrown

Examples

>>> not_alnum = ak.array([f'%Strings {i}' for i in range(3)])
>>> alnum = ak.array([f'Strings{i}' for i in range(3)])
>>> strings = ak.concatenate([not_alnum, alnum])
>>> strings
array(['%Strings 0', '%Strings 1', '%Strings 2', 'Strings0', 'Strings1', 'Strings2'])
>>> strings.isalnum()
array([False False False True True True])
isalpha() arkouda.numpy.pdarrayclass.pdarray[source]

Returns a boolean pdarray where index i indicates whether string i of the Strings is alphabetic. This means there is at least one character, and all the characters are alphabetic.

Returns:

True for elements that are alphabetic, False otherwise

Return type:

pdarray, bool

Raises:

RuntimeError – Raised if there is a server-side error thrown

Examples

>>> not_alpha = ak.array([f'%Strings {i}' for i in range(3)])
>>> alpha = ak.array(['StringA','StringB','StringC'])
>>> strings = ak.concatenate([not_alpha, alpha])
>>> strings
array(['%Strings 0', '%Strings 1', '%Strings 2', 'StringA', 'StringB', 'StringC'])
>>> strings.isalpha()
array([False False False True True True])
isdecimal() arkouda.numpy.pdarrayclass.pdarray[source]

Returns a boolean pdarray where index i indicates whether string i of the Strings has all decimal characters.

Returns:

True for elements that are decimals, False otherwise

Return type:

pdarray, bool

Raises:

RuntimeError – Raised if there is a server-side error thrown

See also

Strings.isdigit

Examples

>>> not_decimal = ak.array([f'Strings {i}' for i in range(3)])
>>> decimal = ak.array([f'12{i}' for i in range(3)])
>>> strings = ak.concatenate([not_decimal, decimal])
>>> strings
array(['Strings 0', 'Strings 1', 'Strings 2', '120', '121', '122'])
>>> strings.isdecimal()
array([False False False True True True])

Special Character Examples

>>> special_strings = ak.array(["3.14", "0", "²", "2³₇", "2³x₇"])
>>> special_strings
array(['3.14', '0', '²', '2³₇', '2³x₇'])
>>> special_strings.isdecimal()
array([False True False False False])
isdigit() arkouda.numpy.pdarrayclass.pdarray[source]

Returns a boolean pdarray where index i indicates whether string i of the Strings has all digit characters.

Returns:

True for elements that are digits, False otherwise

Return type:

pdarray, bool

Raises:

RuntimeError – Raised if there is a server-side error thrown

Examples

>>> not_digit = ak.array([f'Strings {i}' for i in range(3)])
>>> digit = ak.array([f'12{i}' for i in range(3)])
>>> strings = ak.concatenate([not_digit, digit])
>>> strings
array(['Strings 0', 'Strings 1', 'Strings 2', '120', '121', '122'])
>>> strings.isdigit()
array([False False False True True True])

Special Character Examples

>>> special_strings = ak.array(["3.14", "0", "²", "2³₇", "2³x₇"])
>>> special_strings
array(['3.14', '0', '²', '2³₇', '2³x₇'])
>>> special_strings.isdigit()
array([False True True True False])
isempty() arkouda.numpy.pdarrayclass.pdarray[source]

Returns a boolean pdarray where index i indicates whether string i of the Strings is empty.

True for elements that are the empty string, False otherwise

Returns:

True for elements that are digits, False otherwise

Return type:

pdarray, bool

Raises:

RuntimeError – Raised if there is a server-side error thrown

Examples

>>> not_empty = ak.array([f'Strings {i}' for i in range(3)])
>>> empty = ak.array(['' for i in range(3)])
>>> strings = ak.concatenate([not_empty, empty])
>>> strings
array(['Strings 0', 'Strings 1', 'Strings 2', '', '', ''])
>>> strings.isempty()
array([False False False True True True])
islower() arkouda.numpy.pdarrayclass.pdarray[source]

Returns a boolean pdarray where index i indicates whether string i of the Strings is entirely lowercase

Returns:

True for elements that are entirely lowercase, False otherwise

Return type:

pdarray, bool

Raises:

RuntimeError – Raised if there is a server-side error thrown

See also

Strings.isupper

Examples

>>> lower = ak.array([f'strings {i}' for i in range(3)])
>>> upper = ak.array([f'STRINGS {i}' for i in range(3)])
>>> strings = ak.concatenate([lower, upper])
>>> strings
array(['strings 0', 'strings 1', 'strings 2', 'STRINGS 0', 'STRINGS 1', 'STRINGS 2'])
>>> strings.islower()
array([True True True False False False])
isspace() arkouda.numpy.pdarrayclass.pdarray[source]

Returns a boolean pdarray where index i indicates whether string i has all whitespace characters (‘ ’, ‘\t’, ‘\n’, ‘\v’, ‘\f’, ‘\r’).

Returns:

True for elements that are whitespace, False otherwise

Return type:

pdarray, bool

Raises:

RuntimeError – Raised if there is a server-side error thrown

Examples

>>> not_space = ak.array([f'Strings {i}' for i in range(3)])
>>> space = ak.array([' ', '\t', '\n', '\v', '\f', '\r', ' \t\n\v\f\r'])
>>> strings = ak.concatenate([not_space, space])
>>> strings
array(['Strings 0', 'Strings 1', 'Strings 2', ' ', 'u0009', 'n', 'u000B', 'u000C', 'u000D', ' u0009nu000Bu000Cu000D'])
>>> strings.isspace()
array([False False False True True True True True True True])
istitle() arkouda.numpy.pdarrayclass.pdarray[source]

Returns a boolean pdarray where index i indicates whether string i of the Strings is titlecase

Returns:

True for elements that are titlecase, False otherwise

Return type:

pdarray, bool

Raises:

RuntimeError – Raised if there is a server-side error thrown

Examples

>>> mixed = ak.array([f'sTrINgs {i}' for i in range(3)])
>>> title = ak.array([f'Strings {i}' for i in range(3)])
>>> strings = ak.concatenate([mixed, title])
>>> strings
array(['sTrINgs 0', 'sTrINgs 1', 'sTrINgs 2', 'Strings 0', 'Strings 1', 'Strings 2'])
>>> strings.istitle()
array([False False False True True True])
isupper() arkouda.numpy.pdarrayclass.pdarray[source]

Returns a boolean pdarray where index i indicates whether string i of the Strings is entirely uppercase

Returns:

True for elements that are entirely uppercase, False otherwise

Return type:

pdarray, bool

Raises:

RuntimeError – Raised if there is a server-side error thrown

See also

Strings.islower

Examples

>>> lower = ak.array([f'strings {i}' for i in range(3)])
>>> upper = ak.array([f'STRINGS {i}' for i in range(3)])
>>> strings = ak.concatenate([lower, upper])
>>> strings
array(['strings 0', 'strings 1', 'strings 2', 'STRINGS 0', 'STRINGS 1', 'STRINGS 2'])
>>> strings.isupper()
array([False False False True True True])
logger
lower() Strings[source]

Returns a new Strings with all uppercase characters from the original replaced with their lowercase equivalent

Returns:

Strings with all uppercase characters from the original replaced with their lowercase equivalent

Return type:

Strings

Raises:

RuntimeError – Raised if there is a server-side error thrown

See also

Strings.upper

Examples

>>> strings = ak.array([f'StrINgS {i}' for i in range(5)])
>>> strings
array(['StrINgS 0', 'StrINgS 1', 'StrINgS 2', 'StrINgS 3', 'StrINgS 4'])
>>> strings.lower()
array(['strings 0', 'strings 1', 'strings 2', 'strings 3', 'strings 4'])
lstick(other: Strings, delimiter: bytes | arkouda.numpy.dtypes.str_scalars = '') Strings[source]

Join the strings from another array onto the left of the strings of this array, optionally inserting a delimiter. Warning: This function is experimental and not guaranteed to work.

Parameters:
  • other (Strings) – The strings to join onto self’s strings

  • delimiter (bytes or str_scalars, default="") – String inserted between self and other

Returns:

The array of joined strings, as other + self

Return type:

Strings

Raises:
  • TypeError – Raised if the delimiter parameter is neither bytes nor a str or if the other parameter is not a Strings instance

  • RuntimeError – Raised if there is a server-side error thrown

See also

stick, peel, rpeel

Examples

>>> s = ak.array(['a', 'c', 'e'])
>>> t = ak.array(['b', 'd', 'f'])
>>> s.lstick(t, delimiter='.')
array(['b.a', 'd.c', 'f.e'])
match(pattern: bytes | arkouda.numpy.dtypes.str_scalars) arkouda.match.Match[source]

Returns a match object where elements match only if the beginning of the string matches the regular expression pattern

Parameters:

pattern (bytes or str_scalars) – Regex used to find matches

Returns:

Match object where elements match only if the beginning of the string matches the regular expression pattern

Return type:

Match

Examples

>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', ''])
>>> strings.match('_+')
<ak.Match object: matched=False; matched=True, span=(0, 4); matched=False;
matched=True, span=(0, 2); matched=False>
objType = 'Strings'
peel(delimiter: bytes | arkouda.numpy.dtypes.str_scalars, times: arkouda.numpy.dtypes.int_scalars = 1, includeDelimiter: bool = False, keepPartial: bool = False, fromRight: bool = False, regex: bool = False) Tuple[Strings, Strings][source]

Peel off one or more delimited fields from each string (similar to string.partition), returning two new arrays of strings. Warning: This function is experimental and not guaranteed to work.

Parameters:
  • delimiter (bytes or str_scalars) – The separator where the split will occur

  • times (int_scalars, default=1) – The number of times the delimiter is sought, i.e. skip over the first (times-1) delimiters

  • includeDelimiter (bool, default=False) – If true, append the delimiter to the end of the first return array. By default, it is prepended to the beginning of the second return array.

  • keepPartial (bool, default=False) – If true, a string that does not contain <times> instances of the delimiter will be returned in the first array. By default, such strings are returned in the second array.

  • fromRight (bool, default=False) – If true, peel from the right instead of the left (see also rpeel)

  • regex (bool, default=False) – Indicates whether delimiter is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)

Returns:

left: Strings

The field(s) peeled from the end of each string (unless fromRight is true)

right: Strings

The remainder of each string after peeling (unless fromRight is true)

Return type:

Tuple[Strings, Strings]

Raises:
  • TypeError – Raised if the delimiter parameter is not byte or str_scalars, if times is not int64, or if includeDelimiter, keepPartial, or fromRight is not bool

  • ValueError – Raised if times is < 1 or if delimiter is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

See also

rpeel, stick, lstick

Examples

>>> s = ak.array(['a.b', 'c.d', 'e.f.g'])
>>> s.peel('.')
(array(['a', 'c', 'e']), array(['b', 'd', 'f.g']))
>>> s.peel('.', includeDelimiter=True)
(array(['a.', 'c.', 'e.']), array(['b', 'd', 'f.g']))
>>> s.peel('.', times=2)
(array(['', '', 'e.f']), array(['a.b', 'c.d', 'g']))
>>> s.peel('.', times=2, keepPartial=True)
(array(['a.b', 'c.d', 'e.f']), array(['', '', 'g']))
pretty_print_info() None[source]

Prints information about all components of self in a human readable format

Parameters:

None

Return type:

None

purge_cached_regex_patterns() None[source]

purges cached regex patterns

regex_split(pattern: bytes | arkouda.numpy.dtypes.str_scalars, maxsplit: int = 0, return_segments: bool = False) Strings | Tuple[source]

Returns a new Strings split by the occurrences of pattern. If maxsplit is nonzero, at most maxsplit splits occur

Parameters:
  • pattern (bytes or str_scalars) – Regex used to split strings into substrings

  • maxsplit (int, default=0) – The max number of pattern match occurences in each element to split. The default maxsplit=0 splits on all occurences

  • return_segments (bool, default=False) – If True, return mapping of original strings to first substring in return array.

Returns:

  • Strings – Substrings with pattern matches removed

  • pdarray, int64 (optional) – For each original string, the index of first corresponding substring in the return array

Examples

>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', ''])
>>> strings.regex_split('_+', maxsplit=2, return_segments=True)
(array(['1', '2', '', '', '', '3', '', '4', '5____6___7', '']), array([0 3 5 6 9]))
register(user_defined_name: str) Strings[source]

Register this Strings object with a user defined name in the arkouda server so it can be attached to later using Strings.attach() This is an in-place operation, registering a Strings object more than once will update the name in the registry and remove the previously registered name. A name can only be registered to one object at a time.

Parameters:

user_defined_name (str) – user defined name which the Strings object is to be registered under

Returns:

The same Strings object which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different objects with the same name.

Return type:

Strings

Raises:
  • TypeError – Raised if user_defined_name is not a str

  • RegistrationError – If the server was unable to register the Strings object with the user_defined_name If the user is attempting to register more than one object with the same name, the former should be unregistered first to free up the registration name.

See also

attach, unregister

Notes

Registered names/Strings objects in the server are immune to deletion until they are unregistered.

registered_name: str | None = None
rpeel(delimiter: bytes | arkouda.numpy.dtypes.str_scalars, times: arkouda.numpy.dtypes.int_scalars = 1, includeDelimiter: bool = False, keepPartial: bool = False, regex: bool = False) Tuple[Strings, Strings][source]

Peel off one or more delimited fields from the end of each string (similar to string.rpartition), returning two new arrays of strings. Warning: This function is experimental and not guaranteed to work.

Parameters:
  • delimiter (bytes or str_scalars) – The separator where the split will occur

  • times (int_scalars, default=1) – The number of times the delimiter is sought, i.e. skip over the last (times-1) delimiters

  • includeDelimiter (bool, default=False) – If true, prepend the delimiter to the start of the first return array. By default, it is appended to the end of the second return array.

  • keepPartial (bool, default=False) – If true, a string that does not contain <times> instances of the delimiter will be returned in the second array. By default, such strings are returned in the first array.

  • regex (bool, default=False) – Indicates whether delimiter is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)

Returns:

left: Strings

The remainder of the string after peeling

right: Strings

The field(s) that were peeled from the right of each string

Return type:

Tuple[Strings, Strings]

Raises:
  • TypeError – Raised if the delimiter parameter is not bytes or str_scalars or if times is not int64

  • ValueError – Raised if times is < 1 or if delimiter is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

See also

peel, stick, lstick

Examples

>>> s = ak.array(['a.b', 'c.d', 'e.f.g'])
>>> s.rpeel('.')
(array(['a', 'c', 'e.f']), array(['b', 'd', 'g']))

Compared against peel

>>> s.peel('.')
(array(['a', 'c', 'e']), array(['b', 'd', 'f.g']))
save(prefix_path: str, dataset: str = 'strings_array', mode: Literal['truncate', 'append'] = 'truncate', save_offsets: bool = True, compression: Literal['snappy', 'gzip', 'brotli', 'zstd', 'lz4'] | None = None, file_format: Literal['HDF5', 'Parquet'] = 'HDF5', file_type: Literal['single', 'distribute'] = 'distribute') str[source]

DEPRECATED Save the Strings object to HDF5 or Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. HDF5 support single files, in which case the file name will only be that provided. Each locale saves its chunk of the array to its corresponding file.

Parameters:
  • prefix_path (str) – Directory and filename prefix that all output files share

  • dataset (str, default="strings_array") – The name of the Strings dataset to be written, defaults to strings_array

  • mode ({"truncate", "append"}, default = "truncate") – By default, truncate (overwrite) output files, if they exist. If ‘append’, create a new Strings dataset within existing files.

  • save_offsets (bool, default=True) – Defaults to True which will instruct the server to save the offsets array to HDF5 If False the offsets array will not be save and will be derived from the string values upon load/read. This is not supported for Parquet files.

  • compression ({"snappy", "gzip", "brotli", "zstd", "lz4"}, optional) – Sets the compression type used with Parquet files

  • file_format ({"HDF5", "Parquet"}, default = "HDF5") – By default, saved files will be written to the HDF5 file format. If ‘Parquet’, the files will be written to the Parquet file format. This is case insensitive.

  • file_type ({"single", "distribute"}, default = "distribute") – Default: Distribute Distribute the dataset over a file per locale. Single file will save the dataset to one file

Return type:

String message indicating result of save operation

Notes

Important implementation notes: (1) Strings state is saved as two datasets within an hdf5 group: one for the string characters and one for the segments corresponding to the start of each string, (2) the hdf5 group is named via the dataset parameter. (3) Parquet files do not store the segments, only the values.

search(pattern: bytes | arkouda.numpy.dtypes.str_scalars) arkouda.match.Match[source]

Returns a match object with the first location in each element where pattern produces a match. Elements match if any part of the string matches the regular expression pattern

Parameters:

pattern (bytes or str_scalars) – Regex used to find matches

Returns:

Match object where elements match if any part of the string matches the regular expression pattern

Return type:

Match

Examples

>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', ''])
>>> strings.search('_+')
<ak.Match object: matched=True, span=(1, 2); matched=True, span=(0, 4);
matched=False; matched=True, span=(0, 2); matched=False>
size: arkouda.numpy.dtypes.int_scalars
split(delimiter: str, return_segments: bool = False, regex: bool = False) Strings | Tuple[source]

Unpack delimiter-joined substrings into a flat array.

Parameters:
  • delimiter (str) – Characters used to split strings into substrings

  • return_segments (bool, default=False) – If True, also return mapping of original strings to first substring in return array.

  • regex (bool, default=False) – Indicates whether delimiter is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)

Returns:

  • Strings – Flattened substrings with delimiters removed

  • pdarray, int64 (optional) – For each original string, the index of first corresponding substring in the return array

See also

peel, rpeel

Examples

>>> orig = ak.array(['one|two', 'three|four|five', 'six'])
>>> orig.split('|')
array(['one', 'two', 'three', 'four', 'five', 'six'])
>>> flat, mapping = orig.split('|', return_segments=True)
>>> mapping
array([0 2 5])
>>> under = ak.array(['one_two', 'three_____four____five', 'six'])
>>> under_split, under_map = under.split('_+', return_segments=True, regex=True)
>>> under_split
array(['one', 'two', 'three', 'four', 'five', 'six'])
>>> under_map
array([0 2 5])
startswith(substr: bytes | arkouda.numpy.dtypes.str_scalars, regex: bool = False) arkouda.numpy.pdarrayclass.pdarray[source]

Check whether each element starts with the given substring.

Parameters:
  • substr (bytes or str_scalars) – The prefix to search for

  • regex (bool, default=False) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)

Returns:

True for elements that start with substr, False otherwise

Return type:

pdarray, bool

Raises:
  • TypeError – Raised if the substr parameter is not a bytes ior str_scalars

  • ValueError – Rasied if substr is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

Examples

>>> strings_end = ak.array([f'string {i}' for i in range(1, 6)])
>>> strings_end
array(['string 1', 'string 2', 'string 3', 'string 4', 'string 5'])
>>> strings_end.startswith('string')
array([True True True True True])
>>> strings_start = ak.array([f'{i} string' for i in range(1,6)])
>>> strings_start
array(['1 string', '2 string', '3 string', '4 string', '5 string'])
>>> strings_start.startswith('\d str', regex = True)
array([True True True True True])
stick(other: Strings, delimiter: bytes | arkouda.numpy.dtypes.str_scalars = '', toLeft: bool = False) Strings[source]

Join the strings from another array onto one end of the strings of this array, optionally inserting a delimiter. Warning: This function is experimental and not guaranteed to work.

Parameters:
  • other (Strings) – The strings to join onto self’s strings

  • delimiter (bytes or str_scalars, default="") – String inserted between self and other

  • toLeft (bool, default=False) – If true, join other strings to the left of self. By default, other is joined to the right of self.

Returns:

The array of joined strings

Return type:

Strings

Raises:
  • TypeError – Raised if the delimiter parameter is not bytes or str_scalars or if the other parameter is not a Strings instance

  • ValueError – Raised if times is < 1

  • RuntimeError – Raised if there is a server-side error thrown

See also

lstick, peel, rpeel

Examples

>>> s = ak.array(['a', 'c', 'e'])
>>> t = ak.array(['b', 'd', 'f'])
>>> s.stick(t, delimiter='.')
array(['a.b', 'c.d', 'e.f'])
strip(chars: bytes | arkouda.numpy.dtypes.str_scalars | None = '') Strings[source]

Returns a new Strings object with all leading and trailing occurrences of characters contained in chars removed. The chars argument is a string specifying the set of characters to be removed. If omitted, the chars argument defaults to removing whitespace. The chars argument is not a prefix or suffix; rather, all combinations of its values are stripped.

Parameters:

chars (bytes or str_scalars, optional) – the set of characters to be removed

Returns:

Strings object with the leading and trailing characters matching the set of characters in the chars argument removed

Return type:

Strings

Raises:

RuntimeError – Raised if there is a server-side error thrown

Examples

>>> strings = ak.array(['Strings ', '  StringS  ', 'StringS   '])
>>> s = strings.strip()
>>> s
array(['Strings', 'StringS', 'StringS'])
>>> strings = ak.array(['Strings 1', '1 StringS  ', '  1StringS  12 '])
>>> s = strings.strip(' 12')
>>> s
array(['Strings', 'StringS', 'StringS'])
sub(pattern: bytes | arkouda.numpy.dtypes.str_scalars, repl: bytes | arkouda.numpy.dtypes.str_scalars, count: int = 0) Strings[source]

Return new Strings obtained by replacing non-overlapping occurrences of pattern with the replacement repl. If count is nonzero, at most count substitutions occur

Parameters:
  • pattern (bytes or str_scalars) – The regex to substitue

  • repl (bytes or str_scalars) – The substring to replace pattern matches with

  • count (int, default=0) – The max number of pattern match occurences in each element to replace. The default count=0 replaces all occurences of pattern with repl

Returns:

Strings with pattern matches replaced

Return type:

Strings

Raises:
  • TypeError – Raised if pattern or repl are not bytes or str_scalars

  • ValueError – Raised if pattern is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

See also

Strings.subn

Examples

>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', ''])
>>> strings.sub(pattern='_+', repl='-', count=2)
array(['1-2-', '-', '3', '-4-5____6___7', ''])
subn(pattern: bytes | arkouda.numpy.dtypes.str_scalars, repl: bytes | arkouda.numpy.dtypes.str_scalars, count: int = 0) Tuple[source]

Perform the same operation as sub(), but return a tuple (new_Strings, number_of_substitions)

Parameters:
  • pattern (bytes or str_scalars) – The regex to substitue

  • repl (bytes or str_scalars) – The substring to replace pattern matches with

  • count (int, default=0) – The max number of pattern match occurences in each element to replace. The default count=0 replaces all occurences of pattern with repl

Returns:

  • Strings – Strings with pattern matches replaced

  • pdarray, int64 – The number of substitutions made for each element of Strings

Raises:
  • TypeError – Raised if pattern or repl are not bytes or str_scalars

  • ValueError – Raised if pattern is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

See also

Strings.sub

Examples

>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', ''])
>>> strings.subn(pattern='_+', repl='-', count=2)
(array(['1-2-', '-', '3', '-4-5____6___7', '']), array([2 1 0 2 0]))
title() Strings[source]

Returns a new Strings from the original replaced with their titlecase equivalent.

Returns:

Strings from the original replaced with their titlecase equivalent.

Return type:

Strings

Raises:

RuntimeError – Raised if there is a server-side error thrown.

See also

Strings.lower, String.upper

Examples

>>> strings = ak.array([f'StrINgS {i}' for i in range(5)])
>>> strings
array(['StrINgS 0', 'StrINgS 1', 'StrINgS 2', 'StrINgS 3', 'StrINgS 4'])
>>> strings.title()
array(['Strings 0', 'Strings 1', 'Strings 2', 'Strings 3', 'Strings 4'])
to_csv(prefix_path: str, dataset: str = 'strings_array', col_delim: str = ',', overwrite: bool = False) str[source]

Write Strings to CSV file(s). File will contain a single column with the Strings data. All CSV Files written by Arkouda include a header denoting data types of the columns. Unlike other file formats, CSV files store Strings as their UTF-8 format instead of storing bytes as uint(8).

Parameters:
  • prefix_path (str) – The filename prefix to be used for saving files. Files will have _LOCALE#### appended when they are written to disk.

  • dataset (str, default="strings_array") – Column name to save the Strings under. Defaults to “strings_array”.

  • col_delim (str, default=",") – Defaults to “,”. Value to be used to separate columns within the file. Please be sure that the value used DOES NOT appear in your dataset.

  • overwrite (bool, default=False) – Defaults to False. If True, any existing files matching your provided prefix_path will be overwritten. If False, an error will be returned if existing files are found.

Returns:

response message

Return type:

str

Raises:
  • ValueError – Raised if all datasets are not present in all parquet files or if one or more of the specified files do not exist

  • RuntimeError – Raised if one or more of the specified files cannot be opened. If allow_errors is true this may be raised if no values are returned from the server.

  • TypeError – Raised if we receive an unknown arkouda_type returned from the server

Notes

  • CSV format is not currently supported by load/load_all operations

  • The column delimiter is expected to be the same for column names and data

  • Be sure that column delimiters are not found within your data.

  • All CSV files must delimit rows using newline (\n) at this time.

to_hdf(prefix_path: str, dataset: str = 'strings_array', mode: Literal['truncate', 'append'] = 'truncate', save_offsets: bool = True, file_type: Literal['single', 'distribute'] = 'distribute') str[source]

Save the Strings object to HDF5. The object can be saved to a collection of files or single file.

Parameters:
  • prefix_path (str) – Directory and filename prefix that all output files share

  • dataset (str, default="strings_array") – The name of the Strings dataset to be written, defaults to strings_array

  • mode ({"truncate", "append"}, default = "truncate") – By default, truncate (overwrite) output files, if they exist. If ‘append’, create a new Strings dataset within existing files.

  • save_offsets (bool, default=True) – Defaults to True which will instruct the server to save the offsets array to HDF5 If False the offsets array will not be save and will be derived from the string values upon load/read.

  • file_type ({"single", "distribute"}, default = "distribute") – Default: Distribute Distribute the dataset over a file per locale. Single file will save the dataset to one file

Return type:

String message indicating result of save operation

Raises:

RuntimeError – Raised if a server-side error is thrown saving the pdarray

Notes

  • Parquet files do not store the segments, only the values.

  • Strings state is saved as two datasets within an hdf5 group: one for the string characters and one for the segments corresponding to the start of each string

  • the hdf5 group is named via the dataset parameter.

  • The prefix_path must be visible to the arkouda server and the user must have write permission.

  • Output files have names of the form <prefix_path>_LOCALE<i>, where <i> ranges from 0 to numLocales for file_type=’distribute’. Otherwise, the file name will be prefix_path.

  • If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, a RuntimeError will result.

  • Any file extension can be used.The file I/O does not rely on the extension to determine the file format.

See also

to_hdf

to_list() list[source]

Convert the SegString to a list, transferring data from the arkouda server to Python. If the SegString exceeds a built-in size limit, a RuntimeError is raised.

Returns:

A list with the same strings as this SegString

Return type:

list

Notes

The number of bytes in the array cannot exceed ak.client.maxTransferBytes, otherwise a RuntimeError will be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting ak.client.maxTransferBytes to a larger value, but proceed with caution.

See also

to_ndarray

Examples

>>> a = ak.array(["hello", "my", "world"])
>>> a.to_list()
['hello', 'my', 'world']
>>> type(a.to_list())
<class 'list'>
to_ndarray() numpy.ndarray[source]

Convert the array to a np.ndarray, transferring array data from the arkouda server to Python. If the array exceeds a built-in size limit, a RuntimeError is raised.

Returns:

A numpy ndarray with the same strings as this array

Return type:

np.ndarray

Notes

The number of bytes in the array cannot exceed ak.client.maxTransferBytes, otherwise a RuntimeError will be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting ak.client.maxTransferBytes to a larger value, but proceed with caution.

See also

array, to_list

Examples

>>> a = ak.array(["hello", "my", "world"])
>>> a.to_ndarray()
array(['hello', 'my', 'world'], dtype='<U5')
>>> type(a.to_ndarray())
<class 'numpy.ndarray'>
to_parquet(prefix_path: str, dataset: str = 'strings_array', mode: Literal['truncate', 'append'] = 'truncate', compression: Literal['snappy', 'gzip', 'brotli', 'zstd', 'lz4'] | None = None) str[source]

Save the Strings object to Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. Each locale saves its chunk of the array to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str, default=”strings_array” :param mode: By default, truncate (overwrite) output files, if they exist.

If ‘append’, attempt to create new dataset in existing files.

Parameters:

compression ({"snappy", "gzip", "brotli", "zstd", "lz4"}, optional) – Sets the compression type used with Parquet files

Return type:

string message indicating result of save operation

Raises:

RuntimeError – Raised if a server-side error is thrown saving the pdarray

Notes

  • The prefix_path must be visible to the arkouda server and the user must

have write permission. - Output files have names of the form <prefix_path>_LOCALE<i>, where <i> ranges from 0 to numLocales for file_type=’distribute’. - ‘append’ write mode is supported, but is not efficient. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, a RuntimeError will result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.

transfer(hostname: str, port: arkouda.numpy.dtypes.int_scalars) str | memoryview[source]

Sends a Strings object to a different Arkouda server

Parameters:
  • hostname (str) – The hostname where the Arkouda server intended to receive the Strings object is running.

  • port (int_scalars) – The port to send the array over. This needs to be an open port (i.e., not one that the Arkouda server is running on). This will open up numLocales ports, each of which in succession, so will use ports of the range {port..(port+numLocales)} (e.g., running an Arkouda server of 4 nodes, port 1234 is passed as port, Arkouda will use ports 1234, 1235, 1236, and 1237 to send the array data). This port much match the port passed to the call to ak.receive_array().

Return type:

A message indicating a complete transfer

Raises:
  • ValueError – Raised if the op is not within the pdarray.BinOps set

  • TypeError – Raised if other is not a pdarray or the pdarray.dtype is not a supported dtype

unregister() None[source]

Unregister a Strings object in the arkouda server which was previously registered using register() and/or attached to using attach()

Return type:

None

Raises:

RuntimeError – Raised if the server could not find the internal name/symbol to remove

See also

register, attach

Notes

Registered names/Strings objects in the server are immune to deletion until they are unregistered.

static unregister_strings_by_name(user_defined_name: str) None[source]

Unregister a Strings object in the arkouda server previously registered via register()

Parameters:

user_defined_name (str) – The registered name of the Strings object

update_hdf(prefix_path: str, dataset: str = 'strings_array', save_offsets: bool = True, repack: bool = True) str[source]

Overwrite the dataset with the name provided with this Strings object. If the dataset does not exist it is added

Parameters:
  • prefix_path (str) – Directory and filename prefix that all output files share

  • dataset (str, default="strings_array") – Name of the dataset to create in files

  • save_offsets (bool, default=True) – Defaults to True which will instruct the server to save the offsets array to HDF5 If False the offsets array will not be save and will be derived from the string values upon load/read.

  • repack (bool, default=True) – Default: True HDF5 does not release memory on delete. When True, the inaccessible data (that was overwritten) is removed. When False, the data remains, but is inaccessible. Setting to false will yield better performance, but will cause file sizes to expand.

Return type:

str - success message if successful

Raises:

RuntimeError – Raised if a server-side error is thrown saving the Strings object

Notes

  • If file does not contain File_Format attribute to indicate how it was saved, the file name is checked for _LOCALE#### to determine if it is distributed.

  • If the dataset provided does not exist, it will be added

upper() Strings[source]

Returns a new Strings with all lowercase characters from the original replaced with their uppercase equivalent

Returns:

Strings with all lowercase characters from the original replaced with their uppercase equivalent

Return type:

Strings

Raises:

RuntimeError – Raised if there is a server-side error thrown

See also

Strings.lower

Examples

>>> strings = ak.array([f'StrINgS {i}' for i in range(5)])
>>> strings
array(['StrINgS 0', 'StrINgS 1', 'StrINgS 2', 'StrINgS 3', 'StrINgS 4'])
>>> strings.upper()
array(['STRINGS 0', 'STRINGS 1', 'STRINGS 2', 'STRINGS 3', 'STRINGS 4'])
class arkouda.util.Timedelta(pda, unit: str = _BASE_UNIT)[source]

Bases: _AbstractBaseTime

Represents a duration, the difference between two dates or times.

Timedelta is the Arkouda equivalent of pandas.TimedeltaIndex.

Parameters:
  • pda (int64 pdarray, pd.TimedeltaIndex, pd.Series, or np.timedelta64 array)

  • unit (str, default 'ns') –

    For int64 pdarray, denotes the unit of the input. Ignored for pandas and numpy arrays, which carry their own unit. Not case-sensitive; prefixes of full names (like ‘sec’) are accepted.

    Possible values:

    • ’weeks’ or ‘w’

    • ’days’ or ‘d’

    • ’hours’ or ‘h’

    • ’minutes’, ‘m’, or ‘t’

    • ’seconds’ or ‘s’

    • ’milliseconds’, ‘ms’, or ‘l’

    • ’microseconds’, ‘us’, or ‘u’

    • ’nanoseconds’, ‘ns’, or ‘n’

    Unlike in pandas, units cannot be combined or mixed with integers

Notes

The .values attribute is always in nanoseconds with int64 dtype.

abs()[source]

Absolute value of time interval.

property components
property days
is_registered() numpy.bool_[source]

Return True iff the object is contained in the registry or is a component of a registered object.

Returns:

Indicates if the object is contained in the registry

Return type:

numpy.bool

Raises:

RegistrationError – Raised if there’s a server-side error or a mis-match of registered components

Notes

Objects registered with the server are immune to deletion until they are unregistered.

property microseconds
property nanoseconds
register(user_defined_name)[source]

Register this Timedelta object and underlying components with the Arkouda server

Parameters:

user_defined_name (str) – user defined name the timedelta is to be registered under, this will be the root name for underlying components

Returns:

The same Timedelta which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different Timedeltas with the same name.

Return type:

Timedelta

Raises:
  • TypeError – Raised if user_defined_name is not a str

  • RegistrationError – If the server was unable to register the timedelta with the user_defined_name

Notes

Objects registered with the server are immune to deletion until they are unregistered.

property seconds
special_objType = 'Timedelta'
std(ddof: arkouda.numpy.dtypes.int_scalars = 0)[source]

Returns the standard deviation as a pd.Timedelta object

sum()[source]

Return the sum of all elements in the array.

supported_opeq
supported_with_datetime
supported_with_pdarray
supported_with_r_datetime
supported_with_r_pdarray
supported_with_r_timedelta
supported_with_timedelta
to_pandas()[source]

Convert array to a pandas TimedeltaIndex. Note: if the array size exceeds client.maxTransferBytes, a RuntimeError is raised.

See also

to_ndarray

total_seconds()[source]
unregister()[source]

Unregister this timedelta object in the arkouda server which was previously registered using register() and/or attached to using attach()

Raises:

RegistrationError – If the object is already unregistered or if there is a server error when attempting to unregister

Notes

Objects registered with the server are immune to deletion until they are unregistered.

arkouda.util.arange(*args, **kwargs) arkouda.numpy.pdarrayclass.pdarray[source]

arange([start,] stop[, stride,] dtype=int64)

Create a pdarray of consecutive integers within the interval [start, stop). If only one arg is given then arg is the stop parameter. If two args are given, then the first arg is start and second is stop. If three args are given, then the first arg is start, second is stop, third is stride.

The return value is cast to type dtype

Parameters:
  • start (int_scalars, optional) – Starting value (inclusive)

  • stop (int_scalars) – Stopping value (exclusive)

  • stride (int_scalars, optional) – The difference between consecutive elements, the default stride is 1, if stride is specified then start must also be specified.

  • dtype (np.dtype, type, or str) – The target dtype to cast values to

  • max_bits (int) – Specifies the maximum number of bits; only used for bigint pdarrays

Returns:

Integers from start (inclusive) to stop (exclusive) by stride

Return type:

pdarray, dtype

Raises:
  • TypeError – Raised if start, stop, or stride is not an int object

  • ZeroDivisionError – Raised if stride == 0

See also

linspace, zeros, ones, randint

Notes

Negative strides result in decreasing values. Currently, only int64 pdarrays can be created with this method. For float64 arrays, use the linspace method.

Examples

>>> ak.arange(0, 5, 1)
array([0 1 2 3 4])
>>> ak.arange(5, 0, -1)
array([5 4 3 2 1])
>>> ak.arange(0, 10, 2)
array([0 2 4 6 8])
>>> ak.arange(-5, -10, -1)
array([-5 -6 -7 -8 -9])
arkouda.util.attach(name: str)[source]
arkouda.util.attach_all(names: list)[source]

Attach to all objects registered with the names provide

Parameters:

names (list) – List of names to attach to

Return type:

dict

arkouda.util.broadcast(segments: pdarray, values: pdarray | Strings, size: int | np.int64 | np.uint64 = -1, permutation: pdarray | None = None)[source]

Broadcast a dense column vector to the rows of a sparse matrix or grouped array.

Parameters:
  • segments (pdarray, int64) – Offsets of the start of each row in the sparse matrix or grouped array. Must be sorted in ascending order.

  • values (pdarray, Strings) – The values to broadcast, one per row (or group)

  • size (int) – The total number of nonzeros in the matrix. If permutation is given, this argument is ignored and the size is inferred from the permutation array.

  • permutation (pdarray, int64) – The permutation to go from the original ordering of nonzeros to the ordering grouped by row. To broadcast values back to the original ordering, this permutation will be inverted. If no permutation is supplied, it is assumed that the original nonzeros were already grouped by row. In this case, the size argument must be given.

Returns:

The broadcast values, one per nonzero

Return type:

pdarray, Strings

Raises:

ValueError

  • If segments and values are different sizes

  • If segments are empty

  • If number of nonzeros (either user-specified or inferred from permutation) is less than one

Examples

>>>
# Define a sparse matrix with 3 rows and 7 nonzeros
>>> row_starts = ak.array([0, 2, 5])
>>> nnz = 7
# Broadcast the row number to each nonzero element
>>> row_number = ak.arange(3)
>>> ak.broadcast(row_starts, row_number, nnz)
array([0 0 1 1 1 2 2])
# If the original nonzeros were in reverse order...
>>> permutation = ak.arange(6, -1, -1)
>>> ak.broadcast(row_starts, row_number, permutation=permutation)
array([2 2 1 1 1 0 0])
arkouda.util.broadcast_dims(sa: Sequence[int], sb: Sequence[int]) Tuple[int, Ellipsis][source]

Algorithm to determine shape of broadcasted PD array given two array shapes

see: https://data-apis.org/array-api/latest/API_specification/broadcasting.html#algorithm

arkouda.util.coargsort(arrays: Sequence[arkouda.numpy.strings.Strings | arkouda.numpy.pdarrayclass.pdarray | arkouda.categorical.Categorical], algorithm: SortingAlgorithm = SortingAlgorithm.RadixSortLSD) arkouda.numpy.pdarrayclass.pdarray[source]

Return the permutation that groups the rows (left-to-right), if the input arrays are treated as columns. The permutation sorts numeric columns, but not strings/Categoricals – strings/Categoricals are grouped, but not ordered.

Parameters:
  • arrays (Sequence of Strings, pdarray, or Categorical) – The columns (int64, uint64, float64, Strings, or Categorical) to sort by row

  • algorithm (SortingAlgorithm, default=SortingAlgorithm.RadixSortLSD) – The algorithm to be used for sorting the arrays.

Returns:

The indices that permute the rows to grouped order

Return type:

pdarray of int64

Raises:

ValueError – Raised if the pdarrays are not of the same size or if the parameter is not an Iterable containing pdarrays, Strings, or Categoricals

See also

argsort

Notes

Uses a least-significant-digit radix sort, which is stable and resilient to non-uniformity in data but communication intensive. Starts with the last array and moves forward. This sort operates directly on numeric types, but for Strings, it operates on a hash. Thus, while grouping of equivalent strings is guaranteed, lexicographic ordering of the groups is not. For Categoricals, coargsort sorts based on Categorical.codes which guarantees grouping of equivalent categories but not lexicographic ordering of those groups.

Examples

>>> a = ak.array([0, 1, 0, 1])
>>> b = ak.array([1, 1, 0, 0])
>>> perm = ak.coargsort([a, b])
>>> perm
array([2 0 3 1])
>>> a[perm]
array([0 0 1 1])
>>> b[perm]
array([0 1 0 1])
arkouda.util.concatenate(items, ordered=True)[source]
arkouda.util.convert_bytes(nbytes: arkouda.numpy.dtypes.int_scalars, unit: Literal['B', 'KB', 'MB', 'GB'] = 'B') arkouda.numpy.dtypes.numeric_scalars[source]

Convert the number of bytes to KB, MB, or GB.

Parameters:
  • nbytes (int_scalars) – The number of bytes to convert

  • unit ({"B", "KB", "MB", "GB"}, default="B") – Unit to return. One of {‘B’, ‘KB’, ‘MB’, ‘GB’}.

Return type:

int

arkouda.util.convert_if_categorical(values)[source]

Convert a Categorical array to Strings for display

arkouda.util.create_pdarray(repMsg: str, max_bits=None) pdarray[source]

Return a pdarray instance pointing to an array created by the arkouda server. The user should not call this function directly.

Parameters:

repMsg (str) – space-delimited string containing the pdarray name, datatype, size dimension, shape,and itemsize

Returns:

A pdarray with the same attributes and data as the pdarray; on GPU

Return type:

pdarray

Raises:
  • ValueError – If there’s an error in parsing the repMsg parameter into the six values needed to create the pdarray instance

  • RuntimeError – Raised if a server-side error is thrown in the process of creating the pdarray instance

arkouda.util.dtype(dtype)[source]

Create a data type object.

Parameters:

dtype (object) – Object to be converted to a data type object.

Return type:

type

arkouda.util.enrich_inplace(data, keynames, aggregations, **kwargs)[source]
arkouda.util.expand(size: int | numpy.int64 | numpy.uint64, segs: arkouda.numpy.pdarrayclass.pdarray, vals: arkouda.numpy.pdarrayclass.pdarray) arkouda.numpy.pdarrayclass.pdarray[source]

Expand an array with values placed into the indicated segments.

Parameters:
  • size (int, np.int64, or np.uint64) – The size of the array to be expanded

  • segs (pdarray) – The indices where the values should be placed

  • vals (pdarray) – The values to be placed in each segment

Returns:

The expanded array.

Return type:

pdarray

Notes

This function (with different order of arguments) is now in arkouda proper as ak.broadcast. It is retained here for backwards compatibility.

arkouda.util.float_scalars
arkouda.util.generic_concat(items, ordered=True)[source]
arkouda.util.get_callback(x)[source]
arkouda.util.identity(x)[source]
arkouda.util.int_scalars
arkouda.util.invert_permutation(perm: arkouda.numpy.pdarrayclass.pdarray) arkouda.numpy.pdarrayclass.pdarray[source]

Find the inverse of a permutation array.

Parameters:

perm (pdarray) – The permutation array.

Returns:

The inverse of the permutation array.

Return type:

pdarray

arkouda.util.is_float(arry: arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings | arkouda.categorical.Categorical | arkouda.series.Series | arkouda.index.Index) bool[source]

Check if the dtype of the given array is float.

Parameters:

arry (((pdarray, Strings, Categorical))) – The input pdarray, Strings, or Categorical object.

Returns:

True if the dtype of pda is of type float, False otherwise.

Return type:

bool

Examples

>>> data = ak.array([1.0, 2, 3, 4, np.nan])
>>> ak.util.is_float(data)
True
>>> data2 = ak.arange(5)
>>> ak.util.is_float(data2)
False
arkouda.util.is_int(arry: arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings | arkouda.categorical.Categorical | arkouda.series.Series | arkouda.index.Index) bool[source]

Check if the dtype of the given array is int.

Parameters:

arry (pdarray, Strings, Categorical, Series, or Index) – The input pdarray, Strings, or Categorical object.

Returns:

True if the dtype of pda is of type int, False otherwise.

Return type:

bool

Examples

>>> data = ak.array([1.0, 2, 3, 4, np.nan])
>>> ak.util.is_int(data)
False
>>> data2 = ak.arange(5)
>>> ak.util.is_int(data2)
True
arkouda.util.is_numeric(arry: arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings | arkouda.categorical.Categorical | arkouda.series.Series | arkouda.index.Index) bool[source]

Check if the dtype of the given array is numeric.

Parameters:

arry (((pdarray, Strings, Categorical))) – The input pdarray, Strings, or Categorical object.

Returns:

True if the dtype of pda is numeric, False otherwise.

Return type:

bool

Examples

>>> data = ak.array([1, 2, 3, 4, 5])
>>> ak.util.is_numeric(data)
True
>>> strings = ak.array(["a", "b", "c"])
>>> ak.util.is_numeric(strings)
False
arkouda.util.is_registered(name: str, as_component: bool = False) bool[source]

Determine if the name provided is associated with a registered Object

Parameters:
  • name (str) – The name to check for in the registry

  • as_component (bool, default=False) – When True, the name will be checked to determine if it is registered as a component of a registered object

Return type:

bool

arkouda.util.list_registry(detailed: bool = False)[source]

Return a list containing the names of all registered objects

Parameters:

detailed (bool) – Default = False Return details of registry objects. Currently includes object type for any objects

Returns:

Dict containing keys “Components” and “Objects”.

Return type:

dict

Raises:

RuntimeError – Raised if there’s a server-side error thrown

arkouda.util.map(values: arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings | arkouda.categorical.Categorical, mapping: dict | arkouda.Series) arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings[source]

Map values of an array according to an input mapping.

Parameters:
Returns:

A new array with the values mapped by the mapping correspondence. When the input Series has Categorical values, the return Series will have Strings values. Otherwise, the return type will match the input type.

Return type:

arkouda.numpy.pdarrayclass.pdarray or arkouda.numpy.strings.Strings

Raises:

TypeError – Raised if arg is not of type dict or arkouda.Series. Raised if values not of type pdarray, Categorical, or Strings.

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> from arkouda.numpy.util import map
>>> a = ak.array([2, 3, 2, 3, 4])
>>> a
array([2 3 2 3 4])
>>> ak.util.map(a, {4: 25.0, 2: 30.0, 1: 7.0, 3: 5.0})
array([30.00000000000000000 5.00000000000000000 30.00000000000000000
5.00000000000000000 25.00000000000000000])
>>> s = ak.Series(ak.array(["a","b","c","d"]), index = ak.array([4,2,1,3]))
>>> ak.util.map(a, s)
array(['b', 'd', 'b', 'd', 'a'])
arkouda.util.most_common(g, values)[source]
arkouda.util.numeric_scalars
class arkouda.util.pdarray(name: str, mydtype: numpy.dtype | str, size: arkouda.numpy.dtypes.int_scalars, ndim: arkouda.numpy.dtypes.int_scalars, shape: Sequence[int], itemsize: arkouda.numpy.dtypes.int_scalars, max_bits: int | None = None)[source]

The basic arkouda array class. This class contains only the attributies of the array; the data resides on the arkouda server. When a server operation results in a new array, arkouda will create a pdarray instance that points to the array data on the server. As such, the user should not initialize pdarray instances directly.

name

The server-side identifier for the array

Type:

str

dtype

The element type of the array

Type:

dtype

size

The number of elements in the array

Type:

int_scalars

ndim

The rank of the array (currently only rank 1 arrays supported)

Type:

int_scalars

shape

A list or tuple containing the sizes of each dimension of the array

Type:

Sequence[int]

itemsize

The size in bytes of each element

Type:

int_scalars

BinOps
OpEqOps
all(axis: int | Tuple[int, Ellipsis] | None = None, keepdims: bool = False) arkouda.numpy.dtypes.bool_scalars | pdarray[source]

Return True iff all elements of the array evaluate to True.

any(axis: int | Tuple[int, Ellipsis] | None = None, keepdims: bool = False) arkouda.numpy.dtypes.bool_scalars | pdarray[source]

Return True iff any element of the array evaluates to True.

argmax(axis: int | None | None = None, keepdims: bool = False) numpy.int64 | numpy.uint64 | pdarray[source]

Return the index of the first occurrence of the array max value.

argmaxk(k: arkouda.numpy.dtypes.int_scalars) pdarray[source]

Finds the indices corresponding to the maximum “k” values.

Parameters:

k (int_scalars) – The desired count of maximum values to be returned by the output.

Returns:

Indices corresponding to the maximum k values, sorted

Return type:

pdarray, int

Raises:

TypeError – Raised if pda is not a pdarray

argmin(axis: int | None | None = None, keepdims: bool = False) numpy.int64 | numpy.uint64 | pdarray[source]

Return the index of the first occurrence of the array min value

argmink(k: arkouda.numpy.dtypes.int_scalars) pdarray[source]

Compute the minimum “k” values.

Parameters:

k (int_scalars) – The desired count of maximum values to be returned by the output.

Returns:

Indices corresponding to the maximum k values from pda

Return type:

pdarray, int

Raises:

TypeError – Raised if pda is not a pdarray

astype(dtype) pdarray[source]

Cast values of pdarray to provided dtype

Parameters:

dtype (np.dtype or str) – Dtype to cast to

Returns:

An arkouda pdarray with values converted to the specified data type

Return type:

ak.pdarray

Notes

This is essentially shorthand for ak.cast(x, ‘<dtype>’) where x is a pdarray.

static attach(user_defined_name: str) pdarray[source]

class method to return a pdarray attached to the registered name in the arkouda server which was registered using register()

Parameters:

user_defined_name (str) – user defined name which array was registered under

Returns:

pdarray which is bound to the corresponding server side component which was registered with user_defined_name

Return type:

pdarray

Raises:

TypeError – Raised if user_defined_name is not a str

See also

register, unregister, is_registered, unregister_pdarray_by_name, list_registry

Notes

Registered names/pdarrays in the server are immune to deletion until they are unregistered.

Examples

>>> a = zeros(100)
>>> a.register("my_zeros")
>>> # potentially disconnect from server and reconnect to server
>>> b = ak.pdarray.attach("my_zeros")
>>> # ...other work...
>>> b.unregister()
bigint_to_uint_arrays() List[pdarray][source]

Creates a list of uint pdarrays from a bigint pdarray. The first item in return will be the highest 64 bits of the bigint pdarray and the last item will be the lowest 64 bits.

Returns:

A list of uint pdarrays where: The first item in return will be the highest 64 bits of the bigint pdarray and the last item will be the lowest 64 bits.

Return type:

List[pdarrays]

Raises:

RuntimeError – Raised if there is a server-side error thrown

See also

pdarraycreation.bigint_from_uint_arrays

Examples

>>> a = ak.arange(2**64, 2**64 + 5)
>>> a
array(["18446744073709551616" "18446744073709551617" "18446744073709551618"
"18446744073709551619" "18446744073709551620"])
>>> a.bigint_to_uint_arrays()
[array([1 1 1 1 1]), array([0 1 2 3 4])]
clz() pdarray[source]

Count the number of leading zeros in each element. See ak.clz.

corr(y: pdarray) numpy.float64[source]

Compute the correlation between self and y using pearson correlation coefficient.

Parameters:

y (pdarray) – Other pdarray used to calculate correlation

Returns:

The scalar correlation of the two arrays

Return type:

np.float64

Raises:
  • TypeError – Raised if y is not a pdarray instance

  • RuntimeError – Raised if there’s a server-side error thrown

cov(y: pdarray) numpy.float64[source]

Compute the covariance between self and y.

Parameters:

y (pdarray) – Other pdarray used to calculate covariance

Returns:

The scalar covariance of the two arrays

Return type:

np.float64

Raises:
  • TypeError – Raised if y is not a pdarray instance

  • RuntimeError – Raised if there’s a server-side error thrown

ctz() pdarray[source]

Count the number of trailing zeros in each element. See ak.ctz.

dtype
equals(other) arkouda.numpy.dtypes.bool_scalars[source]

Whether pdarrays are the same size and all entries are equal.

Parameters:

other (object) – object to compare.

Returns:

True if the pdarrays are the same, o.w. False.

Return type:

bool

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> a = ak.array([1, 2, 3])
>>> a_cpy = ak.array([1, 2, 3])
>>> a.equals(a_cpy)
True
>>> a2 = ak.array([1, 2, 5)
>>> a.equals(a2)
False
fill(value: arkouda.numpy.dtypes.numeric_scalars) None[source]

Fill the array (in place) with a constant value.

Parameters:

value (numeric_scalars)

Raises:

TypeError – Raised if value is not an int, int64, float, or float64

flatten()[source]

Return a copy of the array collapsed into one dimension.

Return type:

A copy of the input array, flattened to one dimension.

format_other(other) str[source]

Attempt to cast scalar other to the element dtype of this pdarray, and print the resulting value to a string (e.g. for sending to a server command). The user should not call this function directly.

Parameters:

other (object) – The scalar to be cast to the pdarray.dtype

Return type:

string representation of np.dtype corresponding to the other parameter

Raises:

TypeError – Raised if the other parameter cannot be converted to Numpy dtype

property inferred_type: str | None

Return a string of the type inferred from the values.

info() str[source]

Returns a JSON formatted string containing information about all components of self

Parameters:

None

Returns:

JSON string containing information about all components of self

Return type:

str

is_registered() numpy.bool_[source]

Return True iff the object is contained in the registry

Parameters:

None

Returns:

Indicates if the object is contained in the registry

Return type:

bool

Raises:

RuntimeError – Raised if there’s a server-side error thrown

Note

This will return True if the object is registered itself or as a component of another object

is_sorted(axis: int | Tuple[int, Ellipsis] | None = None, keepdims: bool = False) arkouda.numpy.dtypes.bool_scalars | pdarray[source]

Return True iff the array is monotonically non-decreasing.

Parameters:

None

Returns:

Indicates if the array is monotonically non-decreasing

Return type:

bool

Raises:
  • TypeError – Raised if pda is not a pdarray instance

  • RuntimeError – Raised if there’s a server-side error thrown

itemsize
max(axis: int | Tuple[int, Ellipsis] | None = None, keepdims: bool = False) arkouda.numpy.dtypes.numpy_scalars | pdarray[source]

Return the maximum value of the array.

property max_bits
maxk(k: arkouda.numpy.dtypes.int_scalars) pdarray[source]

Compute the maximum “k” values.

Parameters:

k (int_scalars) – The desired count of maximum values to be returned by the output.

Returns:

The maximum k values from pda

Return type:

pdarray, int

Raises:

TypeError – Raised if pda is not a pdarray

mean() numpy.float64[source]

Return the mean of the array.

min(axis: int | Tuple[int, Ellipsis] | None = None, keepdims: bool = False) arkouda.numpy.dtypes.numpy_scalars | pdarray[source]

Return the minimum value of the array.

mink(k: arkouda.numpy.dtypes.int_scalars) pdarray[source]

Compute the minimum “k” values.

Parameters:

k (int_scalars) – The desired count of maximum values to be returned by the output.

Returns:

The maximum k values from pda

Return type:

pdarray, int

Raises:

TypeError – Raised if pda is not a pdarray

name
property nbytes

The size of the pdarray in bytes.

Returns:

The size of the pdarray in bytes.

Return type:

int

ndim
objType = 'pdarray'
opeq(other, op)[source]
parity() pdarray[source]

Find the parity (XOR of all bits) in each element. See ak.parity.

popcount() pdarray[source]

Find the population (number of bits set) in each element. See ak.popcount.

pretty_print_info() None[source]

Prints information about all components of self in a human readable format

Parameters:

None

Return type:

None

prod(axis: int | Tuple[int, Ellipsis] | None = None, keepdims: bool = False) arkouda.numpy.dtypes.numpy_scalars | pdarray[source]

Return the product of all elements in the array. Return value is always a np.float64 or np.int64.

register(user_defined_name: str) pdarray[source]

Register this pdarray with a user defined name in the arkouda server so it can be attached to later using pdarray.attach() This is an in-place operation, registering a pdarray more than once will update the name in the registry and remove the previously registered name. A name can only be registered to one pdarray at a time.

Parameters:

user_defined_name (str) – user defined name array is to be registered under

Returns:

The same pdarray which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different pdarrays with the same name.

Return type:

pdarray

Raises:
  • TypeError – Raised if user_defined_name is not a str

  • RegistrationError – If the server was unable to register the pdarray with the user_defined_name If the user is attempting to register more than one pdarray with the same name, the former should be unregistered first to free up the registration name.

See also

attach, unregister, is_registered, list_registry, unregister_pdarray_by_name

Notes

Registered names/pdarrays in the server are immune to deletion until they are unregistered.

Examples

>>> a = zeros(100)
>>> a.register("my_zeros")
>>> # potentially disconnect from server and reconnect to server
>>> b = ak.pdarray.attach("my_zeros")
>>> # ...other work...
>>> b.unregister()
registered_name: str | None = None
reshape(*shape)[source]

Gives a new shape to an array without changing its data.

Parameters:

shape (int, tuple of ints, or pdarray) – The new shape should be compatible with the original shape.

Returns:

a pdarray with the same data, reshaped to the new shape

Return type:

pdarray

rotl(other) pdarray[source]

Rotate bits left by <other>.

rotr(other) pdarray[source]

Rotate bits right by <other>.

save(prefix_path: str, dataset: str = 'array', mode: str = 'truncate', compression: str | None = None, file_format: str = 'HDF5', file_type: str = 'distribute') str[source]

DEPRECATED Save the pdarray to HDF5 or Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. HDF5 support single files, in which case the file name will only be that provided. Each locale saves its chunk of the array to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.

If ‘append’, attempt to create new dataset in existing files.

Parameters:
  • compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Sets the compression type used with Parquet files

  • file_format (str {'HDF5', 'Parquet'}) – By default, saved files will be written to the HDF5 file format. If ‘Parquet’, the files will be written to the Parquet file format. This is case insensitive.

  • file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.

Return type:

string message indicating result of save operation

Raises:
  • RuntimeError – Raised if a server-side error is thrown saving the pdarray

  • ValueError – Raised if there is an error in parsing the prefix path pointing to file write location or if the mode parameter is neither truncate nor append

  • TypeError – Raised if any one of the prefix_path, dataset, or mode parameters is not a string

See also

save_all, load, read, to_parquet, to_hdf

Notes

The prefix_path must be visible to the arkouda server and the user must have write permission. Output files have names of the form <prefix_path>_LOCALE<i>, where <i> ranges from 0 to numLocales. If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, a RuntimeError will result. Previously all files saved in Parquet format were saved with a .parquet file extension. This will require you to use load as if you saved the file with the extension. Try this if an older file is not being found. Any file extension can be used.The file I/O does not rely on the extension to determine the file format.

Examples

>>> a = ak.arange(25)
>>> # Saving without an extension
>>> a.save('path/prefix', dataset='array')
Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####``
>>> # Saving with an extension (HDF5)
>>> a.save('path/prefix.h5', dataset='array')
Saves the array to numLocales HDF5 files with the name
``cwd/path/name_prefix_LOCALE####.h5`` where #### is replaced by each locale number
>>> # Saving with an extension (Parquet)
>>> a.save('path/prefix.parquet', dataset='array', file_format='Parquet')
Saves the array in numLocales Parquet files with the name
``cwd/path/name_prefix_LOCALE####.parquet`` where #### is replaced by each locale number
property shape

Return the shape of an array.

Returns:

The elements of the shape tuple give the lengths of the corresponding array dimensions.

Return type:

tuple of int

size
slice_bits(low, high) pdarray[source]

Returns a pdarray containing only bits from low to high of self.

This is zero indexed and inclusive on both ends, so slicing the bottom 64 bits is pda.slice_bits(0, 63)

Parameters:
  • low (int) – The lowest bit included in the slice (inclusive) zero indexed, so the first bit is 0

  • high (int) – The highest bit included in the slice (inclusive)

Returns:

A new pdarray containing the bits of self from low to high

Return type:

pdarray

Raises:

RuntimeError – Raised if there is a server-side error thrown

Examples

>>> p = ak.array([2**65 + (2**64 - 1)])
>>> bin(p[0])
'0b101111111111111111111111111111111111111111111111111111111111111111'
>>> bin(p.slice_bits(64, 65)[0])
'0b10'
std(ddof: arkouda.numpy.dtypes.int_scalars = 0) numpy.float64[source]

Compute the standard deviation. See arkouda.std for details.

Parameters:

ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating std

Returns:

The scalar standard deviation of the array

Return type:

np.float64

Raises:
  • TypeError – Raised if pda is not a pdarray instance

  • RuntimeError – Raised if there’s a server-side error thrown

sum(axis: int | Tuple[int, Ellipsis] | None = None, keepdims: bool = False) arkouda.numpy.dtypes.numpy_scalars | pdarray[source]

Return the sum of all elements in the array.

to_csv(prefix_path: str, dataset: str = 'array', col_delim: str = ',', overwrite: bool = False)[source]

Write pdarray to CSV file(s). File will contain a single column with the pdarray data. All CSV Files written by Arkouda include a header denoting data types of the columns.

prefix_path: str

The filename prefix to be used for saving files. Files will have _LOCALE#### appended when they are written to disk.

dataset: str

Column name to save the pdarray under. Defaults to “array”.

col_delim: str

Defaults to “,”. Value to be used to separate columns within the file. Please be sure that the value used DOES NOT appear in your dataset.

overwrite: bool

Defaults to False. If True, any existing files matching your provided prefix_path will be overwritten. If False, an error will be returned if existing files are found.

str reponse message

ValueError

Raised if all datasets are not present in all parquet files or if one or more of the specified files do not exist

RuntimeError

Raised if one or more of the specified files cannot be opened. If allow_errors is true this may be raised if no values are returned from the server.

TypeError

Raised if we receive an unknown arkouda_type returned from the server

  • CSV format is not currently supported by load/load_all operations

  • The column delimiter is expected to be the same for column names and data

  • Be sure that column delimiters are not found within your data.

  • All CSV files must delimit rows using newline (`

`) at this time.

to_cuda()[source]

Convert the array to a Numba DeviceND array, transferring array data from the arkouda server to Python via ndarray. If the array exceeds a builtin size limit, a RuntimeError is raised.

Returns:

A Numba ndarray with the same attributes and data as the pdarray; on GPU

Return type:

numba.DeviceNDArray

Raises:
  • ImportError – Raised if CUDA is not available

  • ModuleNotFoundError – Raised if Numba is either not installed or not enabled

  • RuntimeError – Raised if there is a server-side error thrown in the course of retrieving the pdarray.

Notes

The number of bytes in the array cannot exceed client.maxTransferBytes, otherwise a RuntimeError will be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting client.maxTransferBytes to a larger value, but proceed with caution.

See also

array

Examples

>>> a = ak.arange(0, 5, 1)
>>> a.to_cuda()
array([0, 1, 2, 3, 4])
>>> type(a.to_cuda())
numpy.devicendarray
to_hdf(prefix_path: str, dataset: str = 'array', mode: str = 'truncate', file_type: str = 'distribute') str[source]

Save the pdarray to HDF5. The object can be saved to a collection of files or single file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.

If ‘append’, attempt to create new dataset in existing files.

Parameters:

file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.

Return type:

string message indicating result of save operation

Raises:

RuntimeError – Raised if a server-side error is thrown saving the pdarray

Notes

  • The prefix_path must be visible to the arkouda server and the user must

have write permission. - Output files have names of the form <prefix_path>_LOCALE<i>, where <i> ranges from 0 to numLocales for file_type=’distribute’. Otherwise, the file name will be prefix_path. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, a RuntimeError will result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.

Examples

>>> a = ak.arange(25)
>>> # Saving without an extension
>>> a.to_hdf('path/prefix', dataset='array')
Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####``
>>> # Saving with an extension (HDF5)
>>> a.to_hdf('path/prefix.h5', dataset='array')
Saves the array to numLocales HDF5 files with the name
``cwd/path/name_prefix_LOCALE####.h5`` where #### is replaced by each locale number
>>> # Saving to a single file
>>> a.to_hdf('path/prefix.hdf5', dataset='array', file_type='single')
Saves the array in to single hdf5 file on the root node.
``cwd/path/name_prefix.hdf5``
to_list() List[source]

Convert the array to a list, transferring array data from the Arkouda server to client-side Python. Note: if the pdarray size exceeds client.maxTransferBytes, a RuntimeError is raised.

Returns:

A list with the same data as the pdarray

Return type:

list

Raises:

RuntimeError – Raised if there is a server-side error thrown, if the pdarray size exceeds the built-in client.maxTransferBytes size limit, or if the bytes received does not match expected number of bytes

Notes

The number of bytes in the array cannot exceed client.maxTransferBytes, otherwise a RuntimeError will be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting client.maxTransferBytes to a larger value, but proceed with caution.

See also

to_ndarray

Examples

>>> a = ak.arange(0, 5, 1)
>>> a.to_list()
[0, 1, 2, 3, 4]
>>> type(a.to_list())
list
to_ndarray() numpy.ndarray[source]

Convert the array to a np.ndarray, transferring array data from the Arkouda server to client-side Python. Note: if the pdarray size exceeds client.maxTransferBytes, a RuntimeError is raised.

Returns:

A numpy ndarray with the same attributes and data as the pdarray

Return type:

np.ndarray

Raises:

RuntimeError – Raised if there is a server-side error thrown, if the pdarray size exceeds the built-in client.maxTransferBytes size limit, or if the bytes received does not match expected number of bytes

Notes

The number of bytes in the array cannot exceed client.maxTransferBytes, otherwise a RuntimeError will be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting client.maxTransferBytes to a larger value, but proceed with caution.

See also

array, to_list

Examples

>>> a = ak.arange(0, 5, 1)
>>> a.to_ndarray()
array([0, 1, 2, 3, 4])
>>> type(a.to_ndarray())
numpy.ndarray
to_parquet(prefix_path: str, dataset: str = 'array', mode: str = 'truncate', compression: str | None = None) str[source]

Save the pdarray to Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. Each locale saves its chunk of the array to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.

If ‘append’, attempt to create new dataset in existing files.

Parameters:

compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Sets the compression type used with Parquet files

Return type:

string message indicating result of save operation

Raises:

RuntimeError – Raised if a server-side error is thrown saving the pdarray

Notes

  • The prefix_path must be visible to the arkouda server and the user must

have write permission. - Output files have names of the form <prefix_path>_LOCALE<i>, where <i> ranges from 0 to numLocales for file_type=’distribute’. - ‘append’ write mode is supported, but is not efficient. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, a RuntimeError will result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.

Examples

>>> a = ak.arange(25)
>>> # Saving without an extension
>>> a.to_parquet('path/prefix', dataset='array')
Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####``
>>> # Saving with an extension (HDF5)
>>> a.to_parqet('path/prefix.parquet', dataset='array')
Saves the array to numLocales HDF5 files with the name
``cwd/path/name_prefix_LOCALE####.parquet`` where #### is replaced by each locale number
transfer(hostname: str, port: arkouda.numpy.dtypes.int_scalars)[source]

Sends a pdarray to a different Arkouda server

Parameters:
  • hostname (str) – The hostname where the Arkouda server intended to receive the pdarray is running.

  • port (int_scalars) – The port to send the array over. This needs to be an open port (i.e., not one that the Arkouda server is running on). This will open up numLocales ports, each of which in succession, so will use ports of the range {port..(port+numLocales)} (e.g., running an Arkouda server of 4 nodes, port 1234 is passed as port, Arkouda will use ports 1234, 1235, 1236, and 1237 to send the array data). This port much match the port passed to the call to ak.receive_array().

Return type:

A message indicating a complete transfer

Raises:
  • ValueError – Raised if the op is not within the pdarray.BinOps set

  • TypeError – Raised if other is not a pdarray or the pdarray.dtype is not a supported dtype

unregister() None[source]

Unregister a pdarray in the arkouda server which was previously registered using register() and/or attahced to using attach()

Return type:

None

Raises:

RuntimeError – Raised if the server could not find the internal name/symbol to remove

See also

register, unregister, is_registered, unregister_pdarray_by_name, list_registry

Notes

Registered names/pdarrays in the server are immune to deletion until they are unregistered.

Examples

>>> a = zeros(100)
>>> a.register("my_zeros")
>>> # potentially disconnect from server and reconnect to server
>>> b = ak.pdarray.attach("my_zeros")
>>> # ...other work...
>>> b.unregister()
update_hdf(prefix_path: str, dataset: str = 'array', repack: bool = True)[source]

Overwrite the dataset with the name provided with this pdarray. If the dataset does not exist it is added

Parameters:
  • prefix_path (str) – Directory and filename prefix that all output files share

  • dataset (str) – Name of the dataset to create in files

  • repack (bool) – Default: True HDF5 does not release memory on delete. When True, the inaccessible data (that was overwritten) is removed. When False, the data remains, but is inaccessible. Setting to false will yield better performance, but will cause file sizes to expand.

Return type:

str - success message if successful

Raises:

RuntimeError – Raised if a server-side error is thrown saving the pdarray

Notes

  • If file does not contain File_Format attribute to indicate how it was saved, the file name is checked for _LOCALE#### to determine if it is distributed.

  • If the dataset provided does not exist, it will be added

value_counts()[source]

Count the occurrences of the unique values of self.

Returns:

  • unique_values (pdarray) – The unique values, sorted in ascending order

  • counts (pdarray, int64) – The number of times the corresponding unique value occurs

Examples

>>> ak.array([2, 0, 2, 4, 0, 0]).value_counts()
(array([0, 2, 4]), array([3, 2, 1]))
var(ddof: arkouda.numpy.dtypes.int_scalars = 0) numpy.float64[source]

Compute the variance. See arkouda.var for details.

Parameters:

ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating var

Returns:

The scalar variance of the array

Return type:

np.float64

Raises:
  • TypeError – Raised if pda is not a pdarray instance

  • ValueError – Raised if the ddof >= pdarray size

  • RuntimeError – Raised if there’s a server-side error thrown

arkouda.util.register(obj, name)[source]

Register an arkouda object with a user-specified name. Backwards compatible with earlier arkouda versions.

arkouda.util.register_all(data: dict)[source]

Register all objects in the provided dictionary

Parameters:

data (dict) – Maps name to register the object to the object. For example, {“MyArray”: ak.array([0, 1, 2])

Return type:

None

arkouda.util.report_mem(pre='')[source]
arkouda.util.sparse_sum_help(idx1: arkouda.numpy.pdarrayclass.pdarray, idx2: arkouda.numpy.pdarrayclass.pdarray, val1: arkouda.numpy.pdarrayclass.pdarray, val2: arkouda.numpy.pdarrayclass.pdarray, merge: bool = True, percent_transfer_limit: int = 100) Tuple[arkouda.numpy.pdarrayclass.pdarray, arkouda.numpy.pdarrayclass.pdarray][source]

Helper for summing two sparse matrices together

Return is equivalent to ak.GroupBy(ak.concatenate([idx1, idx2])).sum(ak.concatenate((val1, val2)))

Parameters:
  • idx1 (pdarray) – indices for the first sparse matrix

  • idx2 (pdarray) – indices for the second sparse matrix

  • val1 (pdarray) – values for the first sparse matrix

  • val2 (pdarray) – values for the second sparse matrix

  • merge (bool, default=True) – If true the indices are combined using a merge based workflow, otherwise they are combine using a sort based workflow.

  • percent_transfer_limit (int, default=100) – Only used when merge is true. This is the maximum percentage of the data allowed to be moved between locales during the merge workflow. If we would exceed this percentage, we fall back to using the sort based workflow.

Returns:

indices and values for the summed sparse matrix

Return type:

(pdarray, pdarray)

Examples

>>> idx1 = ak.array([0, 1, 3, 4, 7, 9])
>>> idx2 = ak.array([0, 1, 3, 6, 9])
>>> vals1 = idx1
>>> vals2 = ak.array([10, 11, 13, 16, 19])
>>> ak.util.sparse_sum_help(idx1, idx2, vals1, vals2)
(array([0 1 3 4 6 7 9]), array([10 12 16 4 16 7 28]))
>>> ak.GroupBy(ak.concatenate([idx1, idx2])).sum(ak.concatenate((vals1, vals2)))
(array([0 1 3 4 6 7 9]), array([10 12 16 4 16 7 28]))
arkouda.util.unique(pda: groupable, return_groups: bool = False, assume_sorted: bool = False, return_indices: bool = False) groupable | Tuple[groupable, pdarray, pdarray, int][source]

Find the unique elements of an array.

Returns the unique elements of an array, sorted if the values are integers. There is an optional output in addition to the unique elements: the number of times each unique value comes up in the input array.

Parameters:
  • pda ((list of) pdarray, Strings, or Categorical) – Input array.

  • return_groups (bool, optional) – If True, also return grouping information for the array.

  • assume_sorted (bool, optional) – If True, assume pda is sorted and skip sorting step

  • return_indices (bool, optional) – Only applicable if return_groups is True. If True, return unique key indices along with other groups

Returns:

  • unique ((list of) pdarray, Strings, or Categorical) – The unique values. If input dtype is int64, return values will be sorted.

  • permutation (pdarray, optional) – Permutation that groups equivalent values together (only when return_groups=True)

  • segments (pdarray, optional) – The offset of each group in the permuted array (only when return_groups=True)

Raises:
  • TypeError – Raised if pda is not a pdarray or Strings object

  • RuntimeError – Raised if the pdarray or Strings dtype is unsupported

Notes

For integer arrays, this function checks to see whether pda is sorted and, if so, whether it is already unique. This step can save considerable computation. Otherwise, this function will sort pda.

Examples

>>> A = ak.array([3, 2, 1, 1, 2, 3])
>>> ak.unique(A)
array([1, 2, 3])
arkouda.util.unregister(name: str) str[source]
arkouda.util.unregister_all(names: list)[source]

Unregister all names provided

Parameters:

names (list) – List of names used to register objects to be unregistered

Return type:

None