arkouda.util¶
Attributes¶
Classes¶
Represent integers as bit vectors, e.g. a set of flags. |
|
Represents a date and/or time. |
|
Group an array or list of arrays by value, usually in preparation |
|
Represent integers as IPv4 addresses. |
|
Represents an array of strings whose data resides on the |
|
Represents a duration, the difference between two dates or times. |
|
The basic arkouda array class. This class contains only the |
Functions¶
|
Make a callback (i.e. function) that can be called on an |
|
arange([start,] stop[, stride,] dtype=int64) |
|
|
|
Attach to all objects registered with the names provide |
|
Broadcast a dense column vector to the rows of a sparse matrix or grouped array. |
|
Algorithm to determine shape of broadcasted PD array given two array shapes |
|
Return the permutation that groups the rows (left-to-right), if the |
|
|
|
Convert the number of bytes to KB, MB, or GB. |
|
Convert a Categorical array to Strings for display |
|
Return a pdarray instance pointing to an array created by the arkouda server. |
|
Create a data type object. |
|
|
|
Expand an array with values placed into the indicated segments. |
|
|
|
|
|
|
|
Find the inverse of a permutation array. |
|
Check if the dtype of the given array is float. |
|
Check if the dtype of the given array is int. |
|
Check if the dtype of the given array is numeric. |
|
Determine if the name provided is associated with a registered Object |
|
Return a list containing the names of all registered objects |
|
Map values of an array according to an input mapping. |
|
|
|
Register an arkouda object with a user-specified name. Backwards compatible |
|
Register all objects in the provided dictionary |
|
|
|
Helper for summing two sparse matrices together |
|
Find the unique elements of an array. |
|
|
|
Unregister all names provided |
Package Contents¶
- class arkouda.util.BitVector(values, width=64, reverse=False)[source]¶
Bases:
arkouda.numpy.pdarrayclass.pdarray
Represent integers as bit vectors, e.g. a set of flags.
- Parameters:
values (pdarray, int64) – The integers to represent as bit vectors
width (int) – The number of bit fields in the vector
reverse (bool) – If True, display bits from least significant (left) to most significant (right). By default, the most significant bit is the left-most bit.
- Returns:
bitvectors – The array of binary vectors
- Return type:
Notes
This class is a thin wrapper around pdarray that mostly affects how values are displayed to the user. Operators and methods will typically treat this class like a uint64 pdarray.
- conserves¶
- register(user_defined_name)[source]¶
Register this BitVector object and underlying components with the Arkouda server
- Parameters:
user_defined_name (str) – user defined name the BitVector is to be registered under, this will be the root name for underlying components
- Returns:
The same BitVector which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different BitVectors with the same name.
- Return type:
- Raises:
TypeError – Raised if user_defined_name is not a str
RegistrationError – If the server was unable to register the BitVector with the user_defined_name
See also
Notes
Objects registered with the server are immune to deletion until they are unregistered.
- registered_name = None¶
- reverse = False¶
- special_objType = 'BitVector'¶
- values¶
- width = 64¶
- arkouda.util.BitVectorizer(width=64, reverse=False)[source]¶
Make a callback (i.e. function) that can be called on an array to create a BitVector.
- Parameters:
width (int) – The number of bit fields in the vector
reverse (bool) – If True, display bits from least significant (left) to most significant (right). By default, the most significant bit is the left-most bit.
- Returns:
bitvectorizer – A function that takes an array and returns a BitVector instance
- Return type:
callable
- class arkouda.util.Datetime(pda, unit: str = _BASE_UNIT)[source]¶
Bases:
_AbstractBaseTime
Represents a date and/or time.
Datetime is the Arkouda analog to pandas DatetimeIndex and other timeseries data types.
- Parameters:
pda (int64 pdarray, pd.DatetimeIndex, pd.Series, or np.datetime64 array)
unit (str, default 'ns') –
For int64 pdarray, denotes the unit of the input. Ignored for pandas and numpy arrays, which carry their own unit. Not case-sensitive; prefixes of full names (like ‘sec’) are accepted.
Possible values:
’weeks’ or ‘w’
’days’ or ‘d’
’hours’ or ‘h’
’minutes’, ‘m’, or ‘t’
’seconds’ or ‘s’
’milliseconds’, ‘ms’, or ‘l’
’microseconds’, ‘us’, or ‘u’
’nanoseconds’, ‘ns’, or ‘n’
Unlike in pandas, units cannot be combined or mixed with integers
Notes
The
.values
attribute is always in nanoseconds with int64 dtype.- property date¶
- property day¶
- property day_of_week¶
- property day_of_year¶
- property dayofweek¶
- property dayofyear¶
- property hour¶
- property is_leap_year¶
- is_registered() numpy.bool_ [source]¶
Return True iff the object is contained in the registry or is a component of a registered object.
- Returns:
Indicates if the object is contained in the registry
- Return type:
numpy.bool
- Raises:
RegistrationError – Raised if there’s a server-side error or a mis-match of registered components
See also
Notes
Objects registered with the server are immune to deletion until they are unregistered.
- property microsecond¶
- property millisecond¶
- property minute¶
- property month¶
- property nanosecond¶
- register(user_defined_name)[source]¶
Register this Datetime object and underlying components with the Arkouda server
- Parameters:
user_defined_name (str) – user defined name the Datetime is to be registered under, this will be the root name for underlying components
- Returns:
The same Datetime which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different Datetimes with the same name.
- Return type:
- Raises:
TypeError – Raised if user_defined_name is not a str
RegistrationError – If the server was unable to register the Datetimes with the user_defined_name
See also
Notes
Objects registered with the server are immune to deletion until they are unregistered.
- property second¶
- special_objType = 'Datetime'¶
- supported_opeq¶
- supported_with_datetime¶
- supported_with_pdarray¶
- supported_with_r_datetime¶
- supported_with_r_pdarray¶
- supported_with_r_timedelta¶
- supported_with_timedelta¶
- to_pandas()[source]¶
Convert array to a pandas DatetimeIndex. Note: if the array size exceeds client.maxTransferBytes, a RuntimeError is raised.
See also
to_ndarray
- unregister()[source]¶
Unregister this Datetime object in the arkouda server which was previously registered using register() and/or attached to using attach()
- Raises:
RegistrationError – If the object is already unregistered or if there is a server error when attempting to unregister
See also
Notes
Objects registered with the server are immune to deletion until they are unregistered.
- property week¶
- property weekday¶
- property weekofyear¶
- property year¶
- class arkouda.util.GroupBy[source]¶
Group an array or list of arrays by value, usually in preparation for aggregating the within-group values of another array.
- Parameters:
keys ((list of) pdarray, Strings, or Categorical) – The array to group by value, or if list, the column arrays to group by row
assume_sorted (bool) – If True, assume keys is already sorted (Default: False)
- nkeys¶
The number of key arrays (columns)
- Type:
int
- unique_keys¶
The unique values of the keys array(s), in grouped order
- Type:
(list of) pdarray, Strings, or Categorical
- ngroups¶
The length of the unique_keys array(s), i.e. number of groups
- Type:
int
- logger¶
Used for all logging operations
- Type:
ArkoudaLogger
- dropna¶
If True, and the groupby keys contain NaN values, the NaN values together with the corresponding row will be dropped. Otherwise, the rows corresponding to NaN values will be kept.
- Type:
bool (default=True)
- Raises:
TypeError – Raised if keys is a pdarray with a dtype other than int64
Notes
Integral pdarrays, Strings, and Categoricals are natively supported, but float64 and bool arrays are not.
For a user-defined class to be groupable, it must inherit from pdarray and define or overload the grouping API:
a ._get_grouping_keys() method that returns a list of pdarrays that can be (co)argsorted.
(Optional) a .group() method that returns the permutation that groups the array
If the input is a single array with a .group() method defined, method 2 will be used; otherwise, method 1 will be used.
- AND(values: pdarray) Tuple[pdarray | List[pdarray | Strings], pdarray] [source]¶
Bitwise AND of values in each segment.
Using the permutation stored in the GroupBy instance, group another array of values and perform a bitwise AND reduction on each group.
- Parameters:
values (pdarray, int64) – The values to group and reduce with AND
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
result (pdarray, int64) – Bitwise AND of values in segments corresponding to keys
- Raises:
TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not int64
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if all is not supported for the values dtype
- OR(values: pdarray) Tuple[pdarray | List[pdarray | Strings], pdarray] [source]¶
Bitwise OR of values in each segment.
Using the permutation stored in the GroupBy instance, group another array of values and perform a bitwise OR reduction on each group.
- Parameters:
values (pdarray, int64) – The values to group and reduce with OR
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
result (pdarray, int64) – Bitwise OR of values in segments corresponding to keys
- Raises:
TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not int64
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if all is not supported for the values dtype
- Reductions(*args, **kwargs)¶
frozenset() -> empty frozenset object frozenset(iterable) -> frozenset object
Build an immutable unordered collection of unique elements.
- XOR(values: pdarray) Tuple[pdarray | List[pdarray | Strings], pdarray] [source]¶
Bitwise XOR of values in each segment.
Using the permutation stored in the GroupBy instance, group another array of values and perform a bitwise XOR reduction on each group.
- Parameters:
values (pdarray, int64) – The values to group and reduce with XOR
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
result (pdarray, int64) – Bitwise XOR of values in segments corresponding to keys
- Raises:
TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not int64
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if all is not supported for the values dtype
- aggregate(values: groupable, operator: str, skipna: bool = True, ddof: int_scalars = 1) Tuple[groupable, groupable] [source]¶
Using the permutation stored in the GroupBy instance, group another array of values and apply a reduction to each group’s values.
- Parameters:
values (pdarray) – The values to group and reduce
operator (str) – The name of the reduction operator to use
skipna (bool) – boolean which determines if NANs should be skipped
ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating std
- Returns:
unique_keys (groupable) – The unique keys, in grouped order
aggregates (groupable) – One aggregate value per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if the requested operator is not supported for the values dtype
Examples
>>> keys = ak.arange(0, 10) >>> vals = ak.linspace(-1, 1, 10) >>> g = ak.GroupBy(keys) >>> g.aggregate(vals, 'sum') (array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), array([-1, -0.77777777777777768, -0.55555555555555536, -0.33333333333333348, -0.11111111111111116, 0.11111111111111116, 0.33333333333333348, 0.55555555555555536, 0.77777777777777768, 1])) >>> g.aggregate(vals, 'min') (array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), array([-1, -0.77777777777777779, -0.55555555555555558, -0.33333333333333337, -0.11111111111111116, 0.11111111111111116, 0.33333333333333326, 0.55555555555555536, 0.77777777777777768, 1]))
- all(values: pdarray) Tuple[pdarray | List[pdarray | Strings], pdarray] [source]¶
Using the permutation stored in the GroupBy instance, group another array of values and perform an “and” reduction on each group.
- Parameters:
values (pdarray, bool) – The values to group and reduce with “and”
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_any (pdarray, bool) – One bool per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not bool
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if all is not supported for the values dtype
- any(values: pdarray) Tuple[pdarray | List[pdarray | Strings], pdarray] [source]¶
Using the permutation stored in the GroupBy instance, group another array of values and perform an “or” reduction on each group.
- Parameters:
values (pdarray, bool) – The values to group and reduce with “or”
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_any (pdarray, bool) – One bool per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not bool
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
- argmax(values: pdarray) Tuple[groupable, pdarray] [source]¶
Using the permutation stored in the GroupBy instance, group another array of values and return the location of the first maximum of each group’s values.
- Parameters:
values (pdarray) – The values to group and find argmax
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_argmaxima (pdarray, int64) – One index per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray object or if argmax is not supported for the values dtype
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
Notes
The returned indices refer to the original values array as passed in, not the permutation applied by the GroupBy instance.
Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> g = ak.GroupBy(a) >>> g.keys array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> b = ak.randint(1,5,10) >>> b array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4]) >>> g.argmax(b) (array([2, 3, 4]), array([9, 3, 2]))
- argmin(values: pdarray) Tuple[groupable, pdarray] [source]¶
Using the permutation stored in the GroupBy instance, group another array of values and return the location of the first minimum of each group’s values.
- Parameters:
values (pdarray) – The values to group and find argmin
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_argminima (pdarray, int64) – One index per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray object or if argmax is not supported for the values dtype
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if argmin is not supported for the values dtype
Notes
The returned indices refer to the original values array as passed in, not the permutation applied by the GroupBy instance.
Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> g = ak.GroupBy(a) >>> g.keys array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> b = ak.randint(1,5,10) >>> b array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4]) >>> g.argmin(b) (array([2, 3, 4]), array([5, 4, 2]))
- attach(user_defined_name: str) GroupBy [source]¶
Function to return a GroupBy object attached to the registered name in the arkouda server which was registered using register()
- Parameters:
user_defined_name (str) – user defined name which GroupBy object was registered under
- Returns:
The GroupBy object created by re-attaching to the corresponding server components
- Return type:
- Raises:
RegistrationError – if user_defined_name is not registered
See also
register
,is_registered
,unregister
,unregister_groupby_by_name
- broadcast(values: pdarray | Strings, permute: bool = True) pdarray | Strings [source]¶
Fill each group’s segment with a constant value.
- Parameters:
- Returns:
The broadcasted values
- Return type:
- Raises:
TypeError – Raised if value is not a pdarray object
ValueError – Raised if the values array does not have one value per segment
Notes
This function is a sparse analog of
np.broadcast
. If a GroupBy object represents a sparse matrix (tensor), then this function takes a (dense) column vector and replicates each value to the non-zero elements in the corresponding row.Examples
>>> a = ak.array([0, 1, 0, 1, 0]) >>> values = ak.array([3, 5]) >>> g = ak.GroupBy(a) # By default, result is in original order >>> g.broadcast(values) array([3, 5, 3, 5, 3]) # With permute=False, result is in grouped order >>> g.broadcast(values, permute=False) array([3, 3, 3, 5, 5] >>> a = ak.randint(1,5,10) >>> a array([3, 1, 4, 4, 4, 1, 3, 3, 2, 2]) >>> g = ak.GroupBy(a) >>> keys,counts = g.size() >>> g.broadcast(counts > 2) array([True False True True True False True True False False]) >>> g.broadcast(counts == 3) array([True False True True True False True True False False]) >>> g.broadcast(counts < 4) array([True True True True True True True True True True])
- build_from_components(user_defined_name: str | None = None, **kwargs) GroupBy [source]¶
function to build a new GroupBy object from component keys and permutation.
- Parameters:
user_defined_name (str (Optional) Passing a name will init the new GroupBy) – and assign it the given name
kwargs (dict Dictionary of components required for rebuilding the GroupBy.) – Expected keys are “orig_keys”, “permutation”, “unique_keys”, and “segments”
- Returns:
The GroupBy object created by using the given components
- Return type:
- count(values: pdarray) Tuple[groupable, pdarray] [source]¶
Count the number of elements in each group. NaN values will be excluded from the total.
- Parameters:
values (pdarray) – The values to be count by group (excluding NaN values).
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
counts (pdarray, int64) – The number of times each unique key appears (excluding NaN values).
Examples
>>> a = ak.array([1, 0, -1, 1, 0, -1]) >>> a array([1 0 -1 1 0 -1]) >>> b = ak.array([1, np.nan, -1, np.nan, np.nan, -1], dtype = "float64") >>> b array([1.00000000000000000 nan -1.00000000000000000 nan nan -1.00000000000000000]) >>> g = ak.GroupBy(a) >>> keys,counts = g.count(b) >>> keys array([-1 0 1]) >>> counts array([2 0 1])
- first(values: groupable_element_type) Tuple[groupable, groupable_element_type] [source]¶
First value in each group.
- Parameters:
values (pdarray-like) – The values from which to take the first of each group
- Returns:
unique_keys ((list of) pdarray-like) – The unique keys, in grouped order
result (pdarray-like) – The first value of each group
- head(values: groupable_element_type, n: int = 5, return_indices: bool = True) Tuple[groupable, groupable_element_type] [source]¶
Return the first n values from each group.
- Parameters:
values ((list of) pdarray-like) – The values from which to select, according to their group membership.
n (int, optional, default = 5) – Maximum number of items to return for each group. If the number of values in a group is less than n, all the values from that group will be returned.
return_indices (bool, default False) – If True, return the indices of the sampled values. Otherwise, return the selected values.
- Returns:
unique_keys ((list of) pdarray-like) – The unique keys, in grouped order
result (pdarray-like) – The first n items of each group. If return_indices is True, the result are indices. O.W. the result are values.
Examples
>>> a = ak.arange(10) %3 >>> a array([0 1 2 0 1 2 0 1 2 0]) >>> v = ak.arange(10) >>> v array([0 1 2 3 4 5 6 7 8 9]) >>> g = GroupBy(a) >>> unique_keys, idx = g.head(v, 2, return_indices=True) >>> _, values = g.head(v, 2, return_indices=False) >>> unique_keys array([0 1 2]) >>> idx array([0 3 1 4 2 5]) >>> values array([0 3 1 4 2 5])
>>> v2 = -2 * ak.arange(10) >>> v2 array([0 -2 -4 -6 -8 -10 -12 -14 -16 -18]) >>> _, idx2 = g.head(v2, 2, return_indices=True) >>> _, values2 = g.head(v2, 2, return_indices=False) >>> idx2 array([0 3 1 4 2 5]) >>> values2 array([0 -6 -2 -8 -4 -10])
- is_registered() bool [source]¶
Return True if the object is contained in the registry
- Returns:
Indicates if the object is contained in the registry
- Return type:
bool
- Raises:
RegistrationError – Raised if there’s a server-side error or a mismatch of registered components
See also
Notes
Objects registered with the server are immune to deletion until they are unregistered.
- max(values: pdarray, skipna: bool = True) Tuple[groupable, pdarray] [source]¶
Using the permutation stored in the GroupBy instance, group another array of values and return the maximum of each group’s values.
- Parameters:
values (pdarray) – The values to group and find maxima
skipna (bool) – boolean which determines if NANs should be skipped
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_maxima (pdarray) – One maximum per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray object or if max is not supported for the values dtype
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if max is not supported for the values dtype
Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> g = ak.GroupBy(a) >>> g.keys array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> b = ak.randint(1,5,10) >>> b array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4]) >>> g.max(b) (array([2, 3, 4]), array([4, 4, 3]))
- mean(values: pdarray, skipna: bool = True) Tuple[groupable, pdarray] [source]¶
Using the permutation stored in the GroupBy instance, group another array of values and compute the mean of each group’s values.
- Parameters:
values (pdarray) – The values to group and average
skipna (bool) – boolean which determines if NANs should be skipped
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_means (pdarray, float64) – One mean value per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray object
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
Notes
The return dtype is always float64.
Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> g = ak.GroupBy(a) >>> g.keys array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> b = ak.randint(1,5,10) >>> b array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4]) >>> g.mean(b) (array([2, 3, 4]), array([2.6666666666666665, 2.7999999999999998, 3]))
- median(values: pdarray, skipna: bool = True) Tuple[groupable, pdarray] [source]¶
Using the permutation stored in the GroupBy instance, group another array of values and compute the median of each group’s values.
- Parameters:
values (pdarray) – The values to group and find median
skipna (bool) – boolean which determines if NANs should be skipped
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_medians (pdarray, float64) – One median value per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray object
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
Notes
The return dtype is always float64.
Examples
>>> a = ak.randint(1,5,9) >>> a array([4 1 4 3 2 2 2 3 3]) >>> g = ak.GroupBy(a) >>> g.keys array([4 1 4 3 2 2 2 3 3]) >>> b = ak.linspace(-5,5,9) >>> b array([-5 -3.75 -2.5 -1.25 0 1.25 2.5 3.75 5]) >>> g.median(b) (array([1 2 3 4]), array([-3.75 1.25 3.75 -3.75]))
- min(values: pdarray, skipna: bool = True) Tuple[groupable, pdarray] [source]¶
Using the permutation stored in the GroupBy instance, group another array of values and return the minimum of each group’s values.
- Parameters:
values (pdarray) – The values to group and find minima
skipna (bool) – boolean which determines if NANs should be skipped
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_minima (pdarray) – One minimum per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray object or if min is not supported for the values dtype
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if min is not supported for the values dtype
Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> g = ak.GroupBy(a) >>> g.keys array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> b = ak.randint(1,5,10) >>> b array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4]) >>> g.min(b) (array([2, 3, 4]), array([1, 1, 3]))
- mode(values: groupable) Tuple[groupable, groupable] [source]¶
Most common value in each group. If a group is multi-modal, return the modal value that occurs first.
- Parameters:
values ((list of) pdarray-like) – The values from which to take the mode of each group
- Returns:
unique_keys ((list of) pdarray-like) – The unique keys, in grouped order
result ((list of) pdarray-like) – The most common value of each group
- nunique(values: groupable) Tuple[groupable, pdarray] [source]¶
Using the permutation stored in the GroupBy instance, group another array of values and return the number of unique values in each group.
- Parameters:
values (pdarray, int64) – The values to group and find unique values
- Returns:
unique_keys (groupable) – The unique keys, in grouped order
group_nunique (groupable) – Number of unique values per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the dtype(s) of values array(s) does/do not support the nunique method
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if nunique is not supported for the values dtype
Examples
>>> data = ak.array([3, 4, 3, 1, 1, 4, 3, 4, 1, 4]) >>> data array([3, 4, 3, 1, 1, 4, 3, 4, 1, 4]) >>> labels = ak.array([1, 1, 1, 2, 2, 2, 3, 3, 3, 4]) >>> labels ak.array([1, 1, 1, 2, 2, 2, 3, 3, 3, 4]) >>> g = ak.GroupBy(labels) >>> g.keys ak.array([1, 1, 1, 2, 2, 2, 3, 3, 3, 4]) >>> g.nunique(data) array([1,2,3,4]), array([2, 2, 3, 1]) # Group (1,1,1) has values [3,4,3] -> there are 2 unique values 3&4 # Group (2,2,2) has values [1,1,4] -> 2 unique values 1&4 # Group (3,3,3) has values [3,4,1] -> 3 unique values # Group (4) has values [4] -> 1 unique value
- objType(*args, **kwargs)¶
str(object=’’) -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.__str__() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to ‘strict’.
- prod(values: pdarray, skipna: bool = True) Tuple[groupable, pdarray] [source]¶
Using the permutation stored in the GroupBy instance, group another array of values and compute the product of each group’s values.
- Parameters:
values (pdarray) – The values to group and multiply
skipna (bool) – boolean which determines if NANs should be skipped
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_products (pdarray, float64) – One product per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray object
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if prod is not supported for the values dtype
Notes
The return dtype is always float64.
Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> g = ak.GroupBy(a) >>> g.keys array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> b = ak.randint(1,5,10) >>> b array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4]) >>> g.prod(b) (array([2, 3, 4]), array([12, 108.00000000000003, 8.9999999999999982]))
- register(user_defined_name: str) GroupBy [source]¶
Register this GroupBy object and underlying components with the Arkouda server
- Parameters:
user_defined_name (str) – user defined name the GroupBy is to be registered under, this will be the root name for underlying components
- Returns:
The same GroupBy which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different GroupBys with the same name.
- Return type:
- Raises:
TypeError – Raised if user_defined_name is not a str
RegistrationError – If the server was unable to register the GroupBy with the user_defined_name
See also
unregister
,attach
,unregister_groupby_by_name
,is_registered
Notes
Objects registered with the server are immune to deletion until they are unregistered.
- sample(values: groupable, n=None, frac=None, replace=False, weights=None, random_state=None, return_indices=False, permute_samples=False)[source]¶
Return a random sample from each group. You can either specify the number of elements or the fraction of elements to be sampled. random_state can be used for reproducibility
- Parameters:
values ((list of) pdarray-like) – The values from which to sample, according to their group membership.
n (int, optional) – Number of items to return for each group. Cannot be used with frac and must be no larger than the smallest group unless replace is True. Default is one if frac is None.
frac (float, optional) – Fraction of items to return. Cannot be used with n.
replace (bool, default False) – Allow or disallow sampling of the value more than once.
weights (pdarray, optional) – Default None results in equal probability weighting. If passed a pdarray, then values must have the same length as the groupby keys and will be used as sampling probabilities after normalization within each group. Weights must be non-negative with at least one positive element within each group.
random_state (int or ak.random.Generator, optional) – If int, seed for random number generator. If ak.random.Generator, use as given.
return_indices (bool, default False) – if True, return the indices of the sampled values. Otherwise, return the sample values.
permute_samples (bool, default False) – if True, return permute the samples according to group Otherwise, keep samples in original order.
- Returns:
if return_indices is True, return the indices of the sampled values. Otherwise, return the sample values.
- Return type:
- size() Tuple[groupable, pdarray] [source]¶
Count the number of elements in each group, i.e. the number of times each key appears. This counts the total number of rows (including NaN values).
- Parameters:
none
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
counts (pdarray, int64) – The number of times each unique key appears
See also
Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 2, 3, 1, 2, 4, 3, 4, 3, 4]) >>> g = ak.GroupBy(a) >>> keys,counts = g.size() >>> keys array([1, 2, 3, 4]) >>> counts array([1, 2, 4, 3])
- std(values: pdarray, skipna: bool = True, ddof: int_scalars = 1) Tuple[groupable, pdarray] [source]¶
Using the permutation stored in the GroupBy instance, group another array of values and compute the standard deviation of each group’s values.
- Parameters:
values (pdarray) – The values to group and find standard deviation
skipna (bool) – boolean which determines if NANs should be skipped
ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating std
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_stds (pdarray, float64) – One std value per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray object
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
Notes
The return dtype is always float64.
The standard deviation is the square root of the average of the squared deviations from the mean, i.e.,
std = sqrt(mean((x - x.mean())**2))
.The average squared deviation is normally calculated as
x.sum() / N
, whereN = len(x)
. If, however, ddof is specified, the divisorN - ddof
is used instead. In standard statistical practice,ddof=1
provides an unbiased estimator of the variance of the infinite population.ddof=0
provides a maximum likelihood estimate of the variance for normally distributed variables. The standard deviation computed in this function is the square root of the estimated variance, so even withddof=1
, it will not be an unbiased estimate of the standard deviation per se.Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> g = ak.GroupBy(a) >>> g.keys array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> b = ak.randint(1,5,10) >>> b array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4]) >>> g.std(b) (array([2 3 4]), array([1.5275252316519465 1.0954451150103321 0]))
- sum(values: pdarray, skipna: bool = True) Tuple[groupable, pdarray] [source]¶
Using the permutation stored in the GroupBy instance, group another array of values and sum each group’s values.
- Parameters:
values (pdarray) – The values to group and sum
skipna (bool) – boolean which determines if NANs should be skipped
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_sums (pdarray) – One sum per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray object
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
Notes
The grouped sum of a boolean
pdarray
returns integers.Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> g = ak.GroupBy(a) >>> g.keys array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> b = ak.randint(1,5,10) >>> b array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4]) >>> g.sum(b) (array([2, 3, 4]), array([8, 14, 6]))
- tail(values: groupable_element_type, n: int = 5, return_indices: bool = True) Tuple[groupable, groupable_element_type] [source]¶
Return the last n values from each group.
- Parameters:
values ((list of) pdarray-like) – The values from which to select, according to their group membership.
n (int, optional, default = 5) – Maximum number of items to return for each group. If the number of values in a group is less than n, all the values from that group will be returned.
return_indices (bool, default False) – If True, return the indices of the sampled values. Otherwise, return the selected values.
- Returns:
unique_keys ((list of) pdarray-like) – The unique keys, in grouped order
result (pdarray-like) – The last n items of each group. If return_indices is True, the result are indices. O.W. the result are values.
Examples
>>> a = ak.arange(10) %3 >>> a array([0 1 2 0 1 2 0 1 2 0]) >>> v = ak.arange(10) >>> v array([0 1 2 3 4 5 6 7 8 9]) >>> g = GroupBy(a) >>> unique_keys, idx = g.tail(v, 2, return_indices=True) >>> _, values = g.tail(v, 2, return_indices=False) >>> unique_keys array([0 1 2]) >>> idx array([6 9 4 7 5 8]) >>> values array([6 9 4 7 5 8])
>>> v2 = -2 * ak.arange(10) >>> v2 array([0 -2 -4 -6 -8 -10 -12 -14 -16 -18]) >>> _, idx2 = g.tail(v2, 2, return_indices=True) >>> _, values2 = g.tail(v2, 2, return_indices=False) >>> idx2 array([6 9 4 7 5 8]) >>> values2 array([-12 -18 -8 -14 -10 -16])
- to_hdf(prefix_path, dataset='groupby', mode='truncate', file_type='distribute')[source]¶
Save the GroupBy to HDF5. The result is a collection of HDF5 files, one file per locale of the arkouda server, where each filename starts with prefix_path.
- Parameters:
prefix_path (str) – Directory and filename prefix that all output files will share
dataset (str) – Name prefix for saved data within the HDF5 file
mode (str {'truncate' | 'append'}) – By default, truncate (overwrite) output files, if they exist. If ‘append’, add data as a new column to existing files.
file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.
- Returns:
None
GroupBy is not currently supported by Parquet
- unique(values: groupable)[source]¶
Return the set of unique values in each group, as a SegArray.
- Parameters:
values ((list of) pdarray-like) – The values to unique
- Returns:
unique_keys ((list of) pdarray-like) – The unique keys, in grouped order
result ((list of) SegArray) – The unique values of each group
- Raises:
TypeError – Raised if values is or contains Strings or Categorical
- unregister()[source]¶
Unregister this GroupBy object in the arkouda server which was previously registered using register() and/or attached to using attach()
- Raises:
RegistrationError – If the object is already unregistered or if there is a server error when attempting to unregister
See also
Notes
Objects registered with the server are immune to deletion until they are unregistered.
- unregister_groupby_by_name(user_defined_name: str) None [source]¶
Function to unregister GroupBy object by name which was registered with the arkouda server via register()
- Parameters:
user_defined_name (str) – Name under which the GroupBy object was registered
- Raises:
TypeError – if user_defined_name is not a string
RegistrationError – if there is an issue attempting to unregister any underlying components
See also
- var(values: pdarray, skipna: bool = True, ddof: int_scalars = 1) Tuple[groupable, pdarray] [source]¶
Using the permutation stored in the GroupBy instance, group another array of values and compute the variance of each group’s values.
- Parameters:
values (pdarray) – The values to group and find variance
skipna (bool) – boolean which determines if NANs should be skipped
ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating var
- Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_vars (pdarray, float64) – One var value per unique key in the GroupBy instance
- Raises:
TypeError – Raised if the values array is not a pdarray object
ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array
Notes
The return dtype is always float64.
The variance is the average of the squared deviations from the mean, i.e.,
var = mean((x - x.mean())**2)
.The mean is normally calculated as
x.sum() / N
, whereN = len(x)
. If, however, ddof is specified, the divisorN - ddof
is used instead. In standard statistical practice,ddof=1
provides an unbiased estimator of the variance of a hypothetical infinite population.ddof=0
provides a maximum likelihood estimate of the variance for normally distributed variables.Examples
>>> a = ak.randint(1,5,10) >>> a array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> g = ak.GroupBy(a) >>> g.keys array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2]) >>> b = ak.randint(1,5,10) >>> b array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4]) >>> g.var(b) (array([2 3 4]), array([2.333333333333333 1.2 0]))
- class arkouda.util.IPv4(values)[source]¶
Bases:
arkouda.numpy.pdarrayclass.pdarray
Represent integers as IPv4 addresses.
- Parameters:
values (pdarray, int64) – The integer IP addresses
- Returns:
The same IP addresses
- Return type:
Notes
This class is a thin wrapper around pdarray that mostly affects how values are displayed to the user. Operators and methods will typically treat this class like an int64 pdarray.
- normalize(x)[source]¶
Take in an IP address as a string, integer, or IPAddress object, and convert it to an integer.
- register(user_defined_name)[source]¶
Register this IPv4 object and underlying components with the Arkouda server
- Parameters:
user_defined_name (str) – user defined name the IPv4 is to be registered under, this will be the root name for underlying components
- Returns:
The same IPv4 which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different IPv4s with the same name.
- Return type:
- Raises:
TypeError – Raised if user_defined_name is not a str
RegistrationError – If the server was unable to register the IPv4 with the user_defined_name
See also
Notes
Objects registered with the server are immune to deletion until they are unregistered.
- special_objType = 'IPv4'¶
- to_hdf(prefix_path: str, dataset: str = 'array', mode: str = 'truncate', file_type: str = 'distribute')[source]¶
Override of the pdarray to_hdf to store the special object type
- update_hdf(prefix_path: str, dataset: str = 'array', repack: bool = True)[source]¶
Override the pdarray implementation so that the special object type will be used.
- values¶
- arkouda.util.Index¶
- class arkouda.util.SegArray(segments, values, lengths=None, grouping=None)[source]¶
-
- append(other, axis=0)[source]¶
Append other to self, either vertically (axis=0, length of resulting SegArray increases), or horizontally (axis=1, each sub-array of other appends to the corresponding sub-array of self).
- Parameters:
other (SegArray) – Array of sub-arrays to append
axis (0 or 1) – Whether to append vertically (0) or horizontally (1). If axis=1, other must be same size as self.
- Returns:
axis=0: New SegArray containing all sub-arrays axis=1: New SegArray of same length, with pairs of sub-arrays concatenated
- Return type:
- classmethod attach(user_defined_name)[source]¶
Using the defined name, attach to a SegArray that has been registered to the Symbol Table
- Parameters:
user_defined_name (str) – user defined name which the SegArray object was registered under
- Returns:
The resulting SegArray
- Return type:
- Raises:
RuntimeError – Raised if the server could not attach to the SegArray object
See also
- classmethod concat(x, axis=0, ordered=True)[source]¶
Concatenate a sequence of SegArrays
- Parameters:
x (sequence of SegArray) – The SegArrays to concatenate
axis (0 or 1) – Select vertical (0) or horizontal (1) concatenation. If axis=1, all SegArrays must have same size.
ordered (bool) – Must be True. This option is present for compatibility only, because unordered concatenation is not yet supported.
- Returns:
The input arrays joined into one SegArray
- Return type:
- dtype¶
- classmethod from_multi_array(m)[source]¶
Construct a SegArray from a list of columns. This essentially transposes the input, resulting in an array of rows.
- classmethod from_parts(segments, values, lengths=None, grouping=None) SegArray [source]¶
DEPRECATED Construct a SegArray object from its parts
- Parameters:
- Returns:
Data structure representing an array whose elements are variable-length arrays.
- Return type:
Notes
Keyword args ‘lengths’ and ‘grouping’ are not user-facing. They are used by the attach method.
- get_jth(j, return_origins=True, compressed=False, default=0)[source]¶
Select the j-th element of each sub-array, where possible.
- Parameters:
j (int) – The index of the value to get from each sub-array. If j is negative, it counts backwards from the end of each sub-array.
return_origins (bool) – If True, return a logical index indicating where j is in bounds
compressed (bool) – If False, return array is same size as self, with default value where j is out of bounds. If True, the return array only contains values where j is in bounds.
default (scalar) – When compressed=False, the value to return when j is out of bounds for the sub-array
- Returns:
val (pdarray) – compressed=False: The j-th value of each sub-array where j is in bounds and the default value where j is out of bounds. compressed=True: The j-th values of only the sub-arrays where j is in bounds
origin_indices (pdarray, bool) – A Boolean array that is True where j is in bounds for the sub-array.
Notes
If values are Strings, only the compressed format is supported.
- get_length_n(n, return_origins=True)[source]¶
Return all sub-arrays of length n, as a list of columns.
- Parameters:
n (int) – Length of sub-arrays to select
return_origins (bool) – Return a logical index indicating which sub-arrays are length n
- Returns:
columns (list of pdarray) – An n-long list of pdarray, where each row is one of the n-long sub-arrays from the SegArray. The number of rows is the number of True values in the returned mask.
origin_indices (pdarray, bool) – Array of bool for each element of the SegArray, True where sub-array has length n.
- get_ngrams(n, return_origins=True)[source]¶
Return all n-grams from all sub-arrays.
- Parameters:
n (int) – Length of n-gram
return_origins (bool) – If True, return an int64 array indicating which sub-array each returned n-gram came from.
- Returns:
ngrams (list of pdarray) – An n-long list of pdarrays, essentially a table where each row is an n-gram.
origin_indices (pdarray, int) – The index of the sub-array from which the corresponding n-gram originated
- get_prefixes(n, return_origins=True, proper=True)[source]¶
Return all sub-array prefixes of length n (for sub-arrays that are at least n+1 long)
- Parameters:
n (int) – Length of suffix
return_origins (bool) – If True, return a logical index indicating which sub-arrays were long enough to return an n-prefix
proper (bool) – If True, only return proper prefixes, i.e. from sub-arrays that are at least n+1 long. If False, allow the entire sub-array to be returned as a prefix.
- Returns:
prefixes (list of pdarray) – An n-long list of pdarrays, essentially a table where each row is an n-prefix. The number of rows is the number of True values in the returned mask.
origin_indices (pdarray, bool) – Boolean array that is True where the sub-array was long enough to return an n-suffix, False otherwise.
- get_suffixes(n, return_origins=True, proper=True)[source]¶
Return the n-long suffix of each sub-array, where possible
- Parameters:
n (int) – Length of suffix
return_origins (bool) – If True, return a logical index indicating which sub-arrays were long enough to return an n-suffix
proper (bool) – If True, only return proper suffixes, i.e. from sub-arrays that are at least n+1 long. If False, allow the entire sub-array to be returned as a suffix.
- Returns:
suffixes (list of pdarray) – An n-long list of pdarrays, essentially a table where each row is an n-suffix. The number of rows is the number of True values in the returned mask.
origin_indices (pdarray, bool) – Boolean array that is True where the sub-array was long enough to return an n-suffix, False otherwise.
- property grouping¶
- hash() Tuple[arkouda.numpy.pdarrayclass.pdarray, arkouda.numpy.pdarrayclass.pdarray] [source]¶
Compute a 128-bit hash of each segment.
- intersect(other)[source]¶
Computes the intersection of 2 SegArrays.
- Parameters:
other (SegArray) – SegArray to compute against
- Returns:
Segments are the 1d intersections of the segments of self and other
- Return type:
See also
pdarraysetops.intersect1d
Examples
>>> a = [1, 2, 3, 1, 4] >>> b = [3, 1, 4, 5] >>> c = [1, 3, 3, 5] >>> d = [2, 2, 4] >>> seg_a = ak.segarray(ak.array([0, len(a)]), ak.array(a+b)) >>> seg_b = ak.segarray(ak.array([0, len(c)]), ak.array(c+d)) >>> seg_a.intersect(seg_b) SegArray([ [1, 3], [4] ])
- is_registered() bool [source]¶
Checks if the name of the SegArray object is registered in the Symbol Table
- Returns:
True if SegArray is registered, false if not
- Return type:
bool
See also
- classmethod load(prefix_path, dataset='segarray', segment_name='segments', value_name='values')[source]¶
- logger¶
- property nbytes¶
The size of the segarray in bytes.
- Returns:
The size of the segarray in bytes.
- Return type:
int
- property non_empty¶
- objType = 'SegArray'¶
- classmethod read_hdf(prefix_path, dataset='segarray')[source]¶
Load a saved SegArray from HDF5. All arguments must match what was supplied to SegArray.save()
- Parameters:
prefix_path (str) – Directory and filename prefix
dataset (str) – Name prefix for saved data within the HDF5 files
- Return type:
- register(user_defined_name)[source]¶
Register this SegArray object and underlying components with the Arkouda server
- Parameters:
user_defined_name (str) – user defined name which this SegArray object will be registered under
- Returns:
The same SegArray which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different SegArrays with the same name.
- Return type:
- Raises:
RegistrationError – Raised if the server could not register the SegArray object
Notes
Objects registered with the server are immune to deletion until they are unregistered.
See also
- registered_name: str | None = None¶
- remove_repeats(return_multiplicity=False)[source]¶
Condense sequences of repeated values within a sub-array to a single value.
- Parameters:
return_multiplicity (bool) – If True, also return the number of times each value was repeated.
- Returns:
norepeats (SegArray) – Sub-arrays with runs of repeated values replaced with single value
multiplicity (SegArray) – If return_multiplicity=True, this array contains the number of times each value in the returned SegArray was repeated in the original SegArray.
- save(prefix_path, dataset='segarray', mode='truncate', file_type='distribute')[source]¶
DEPRECATED Save the SegArray to HDF5. The object can be saved to a collection of files or single file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, attempt to create new dataset in existing files.
- Parameters:
file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.
- Return type:
string message indicating result of save operation
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
The prefix_path must be visible to the arkouda server and the user must
have write permission. - Output files have names of the form
<prefix_path>_LOCALE<i>
, where<i>
ranges from 0 tonumLocales
for file_type=’distribute’. Otherwise, the file name will be prefix_path. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, aRuntimeError
will result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.
- segments¶
- set_jth(i, j, v)[source]¶
Set the j-th element of each sub-array in a subset.
- Parameters:
- Raises:
ValueError – If j is out of bounds in any of the sub-arrays specified by i.
- setdiff(other)[source]¶
Computes the set difference of 2 SegArrays.
- Parameters:
other (SegArray) – SegArray to compute against
- Returns:
Segments are the 1d set difference of the segments of self and other
- Return type:
See also
pdarraysetops.setdiff1d
Examples
>>> a = [1, 2, 3, 1, 4] >>> b = [3, 1, 4, 5] >>> c = [1, 3, 3, 5] >>> d = [2, 2, 4] >>> seg_a = ak.segarray(ak.array([0, len(a)]), ak.array(a+b)) >>> seg_b = ak.segarray(ak.array([0, len(c)]), ak.array(c+d)) >>> seg_a.setdiff(seg_b) SegArray([ [2, 4], [1, 3, 5] ])
- setxor(other)[source]¶
Computes the symmetric difference of 2 SegArrays.
- Parameters:
other (SegArray) – SegArray to compute against
- Returns:
Segments are the 1d symmetric difference of the segments of self and other
- Return type:
See also
pdarraysetops.setxor1d
Examples
>>> a = [1, 2, 3, 1, 4] >>> b = [3, 1, 4, 5] >>> c = [1, 3, 3, 5] >>> d = [2, 2, 4] >>> seg_a = ak.segarray(ak.array([0, len(a)]), ak.array(a+b)) >>> seg_b = ak.segarray(ak.array([0, len(c)]), ak.array(c+d)) >>> seg_a.setxor(seg_b) SegArray([ [2, 4, 5], [1, 3, 5, 2] ])
- size¶
- to_hdf(prefix_path, dataset='segarray', mode='truncate', file_type='distribute')[source]¶
Save the SegArray to HDF5. The result is a collection of HDF5 files, one file per locale of the arkouda server, where each filename starts with prefix_path.
- Parameters:
prefix_path (str) – Directory and filename prefix that all output files will share
dataset (str) – Name prefix for saved data within the HDF5 file
mode (str {'truncate' | 'append'}) – By default, truncate (overwrite) output files, if they exist. If ‘append’, add data as a new column to existing files.
file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.
- Return type:
None
See also
- to_list()[source]¶
Convert the segarray into a list containing sub-arrays
- Returns:
A list with the same sub-arrays (also list) as this segarray
- Return type:
list
See also
Examples
>>> segarr = ak.SegArray(ak.array([0, 4, 7]), ak.arange(12)) >>> segarr.to_list() [[0, 1, 2, 3], [4, 5, 6], [7, 8, 9, 10, 11]] >>> type(segarr.to_list()) list
- to_ndarray()[source]¶
Convert the array into a numpy.ndarray containing sub-arrays
- Returns:
A numpy ndarray with the same sub-arrays (also numpy.ndarray) as this array
- Return type:
np.ndarray
See also
array
,to_list
Examples
>>> segarr = ak.SegArray(ak.array([0, 4, 7]), ak.arange(12)) >>> segarr.to_ndarray() array([array([1, 2, 3, 4]), array([5, 6, 7]), array([8, 9, 10, 11, 12])]) >>> type(segarr.to_ndarray()) numpy.ndarray
- to_parquet(prefix_path, dataset='segarray', mode: str = 'truncate', compression: str | None = None)[source]¶
Save the SegArray object to Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. Each locale saves its chunk of the object to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: Deprecated.
Parameter kept to maintain functionality of other calls. Only Truncate supported. By default, truncate (overwrite) output files, if they exist. If ‘append’, attempt to create new dataset in existing files.
- Parameters:
compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Sets the compression type used with Parquet files
- Return type:
string message indicating result of save operation
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
ValueError – If write mode is not Truncate.
Notes
Append mode for Parquet has been deprecated. It was not implemented for SegArray.
The prefix_path must be visible to the arkouda server and the user must
have write permission. - Output files have names of the form
<prefix_path>_LOCALE<i>
, where<i>
ranges from 0 tonumLocales
for file_type=’distribute’. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, aRuntimeError
will result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.
- transfer(hostname: str, port: arkouda.numpy.dtypes.int_scalars)[source]¶
Sends a Segmented Array to a different Arkouda server
- Parameters:
hostname (str) – The hostname where the Arkouda server intended to receive the Segmented Array is running.
port (int_scalars) – The port to send the array over. This needs to be an open port (i.e., not one that the Arkouda server is running on). This will open up numLocales ports, each of which in succession, so will use ports of the range {port..(port+numLocales)} (e.g., running an Arkouda server of 4 nodes, port 1234 is passed as port, Arkouda will use ports 1234, 1235, 1236, and 1237 to send the array data). This port much match the port passed to the call to ak.receive_array().
- Return type:
A message indicating a complete transfer
- Raises:
ValueError – Raised if the op is not within the pdarray.BinOps set
TypeError – Raised if other is not a pdarray or the pdarray.dtype is not a supported dtype
- union(other)[source]¶
Computes the union of 2 SegArrays.
- Parameters:
other (SegArray) – SegArray to compute against
- Returns:
Segments are the 1d union of the segments of self and other
- Return type:
See also
pdarraysetops.union1d
Examples
>>> a = [1, 2, 3, 1, 4] >>> b = [3, 1, 4, 5] >>> c = [1, 3, 3, 5] >>> d = [2, 2, 4] >>> seg_a = ak.segarray(ak.array([0, len(a)]), ak.array(a+b)) >>> seg_b = ak.segarray(ak.array([0, len(c)]), ak.array(c+d)) >>> seg_a.union(seg_b) SegArray([ [1, 2, 3, 4, 5], [1, 2, 3, 4, 5] ])
- unregister()[source]¶
Unregister this SegArray object in the arkouda server which was previously registered using register() and/or attached to using attach()
- Return type:
None
- Raises:
RuntimeError – Raised if the server could not unregister the SegArray object from the Symbol Table
Notes
Objects registered with the server are immune to deletion until they are unregistered.
See also
- static unregister_segarray_by_name(user_defined_name)[source]¶
Using the defined name, remove the registered SegArray object from the Symbol Table
- Parameters:
user_defined_name (str) – user defined name which the SegArray object was registered under
- Return type:
None
- Raises:
RuntimeError – Raised if the server could not unregister the SegArray object from the Symbol Table
See also
- update_hdf(prefix_path: str, dataset: str = 'segarray', repack: bool = True)[source]¶
Overwrite the dataset with the name provided with this SegArray object. If the dataset does not exist it is added.
- Parameters:
prefix_path (str) – Directory and filename prefix that all output files share
dataset (str) – Name of the dataset to create in files
repack (bool) – Default: True HDF5 does not release memory on delete. When True, the inaccessible data (that was overwritten) is removed. When False, the data remains, but is inaccessible. Setting to false will yield better performance, but will cause file sizes to expand.
- Return type:
None
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the SegArray
Notes
If file does not contain File_Format attribute to indicate how it was saved, the file name is checked for _LOCALE#### to determine if it is distributed.
If the dataset provided does not exist, it will be added
Because HDF5 deletes do not release memory, this will create a copy of the file with the new data
- valsize¶
- values¶
- class arkouda.util.Strings(strings_pdarray: arkouda.numpy.pdarrayclass.pdarray, bytes_size: arkouda.numpy.dtypes.int_scalars)[source]¶
Represents an array of strings whose data resides on the arkouda server. The user should not call this class directly; rather its instances are created by other arkouda functions.
- entry¶
Encapsulation of a Segmented Strings array contained on the arkouda server. This is a composite of
offsets array: starting indices for each string
bytes array: raw bytes of all strings joined by nulls
- Type:
- size¶
The number of strings in the array
- Type:
int_scalars
- nbytes¶
The total number of bytes in all strings
- Type:
int_scalars
- ndim¶
The rank of the array (currently only rank 1 arrays supported)
- Type:
int_scalars
- shape¶
The sizes of each dimension of the array
- Type:
tuple
- dtype¶
The dtype is ak.str
- Type:
dtype
- logger¶
Used for all logging operations
- Type:
ArkoudaLogger
Notes
Strings is composed of two pdarrays: (1) offsets, which contains the starting indices for each string and (2) bytes, which contains the raw bytes of all strings, delimited by nulls.
- BinOps¶
- astype(dtype: numpy.dtype | str) arkouda.numpy.pdarrayclass.pdarray [source]¶
Cast values of Strings object to provided dtype
- Parameters:
dtype (np.dtype or str) – Dtype to cast to
- Returns:
An arkouda pdarray with values converted to the specified data type
- Return type:
ak.pdarray
Notes
This is essentially shorthand for ak.cast(x, ‘<dtype>’) where x is a pdarray.
- static attach(user_defined_name: str) Strings [source]¶
class method to return a Strings object attached to the registered name in the arkouda server which was registered using register()
- Parameters:
user_defined_name (str) – user defined name which the Strings object was registered under
- Returns:
the Strings object registered with user_defined_name in the arkouda server
- Return type:
Strings object
- Raises:
TypeError – Raised if user_defined_name is not a str
See also
Notes
Registered names/Strings objects in the server are immune to deletion until they are unregistered.
- cached_regex_patterns() List [source]¶
Returns the regex patterns for which Match objects have been cached
- capitalize() Strings [source]¶
Returns a new Strings from the original replaced with the first letter capitilzed and the remaining letters lowercase.
- Returns:
Strings from the original replaced with the capitalized equivalent.
- Return type:
- Raises:
RuntimeError – Raised if there is a server-side error thrown.
See also
Strings.lower
,String.upper
,String.title
Examples
>>> strings = ak.array([f'StrINgS aRe Here {i}' for i in range(5)]) >>> strings array(['StrINgS aRe Here 0', 'StrINgS aRe Here 1', 'StrINgS aRe Here 2', 'StrINgS aRe Here 3', 'StrINgS aRe Here 4']) >>> strings.title() array(['Strings Are Here 0', 'Strings Are Here 1', 'Strings Are Here 2', 'Strings Are Here 3', 'Strings Are Here 4'])
- contains(substr: bytes | arkouda.numpy.dtypes.str_scalars, regex: bool = False) arkouda.numpy.pdarrayclass.pdarray [source]¶
Check whether each element contains the given substring.
- Parameters:
substr (bytes or str_scalars) – The substring in the form of string or byte array to search for
regex (bool, default=False) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Returns:
True for elements that contain substr, False otherwise
- Return type:
pdarray, bool
- Raises:
TypeError – Raised if the substr parameter is not bytes or str_scalars
ValueError – Rasied if substr is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings = ak.array([f'{i} string {i}' for i in range(1, 6)]) >>> strings array(['1 string 1', '2 string 2', '3 string 3', '4 string 4', '5 string 5']) >>> strings.contains('string') array([True True True True True]) >>> strings.contains('string \d', regex=True) array([True True True True True])
- decode(fromEncoding: str, toEncoding: str = 'UTF-8') Strings [source]¶
Return a new strings object in fromEncoding, expecting that the current Strings is encoded in toEncoding
- Parameters:
fromEncoding (str) – The current encoding of the strings object
toEncoding (str, default="UTF-8") – The encoding that the strings will be converted to, default to UTF-8
- Returns:
A new Strings object in toEncoding
- Return type:
- Raises:
RuntimeError – Raised if there is a server-side error thrown
- dtype¶
- encode(toEncoding: str, fromEncoding: str = 'UTF-8') Strings [source]¶
Return a new strings object in toEncoding, expecting that the current Strings is encoded in fromEncoding
- Parameters:
toEncoding (str) – The encoding that the strings will be converted to
fromEncoding (str, default="UTF-8") – The current encoding of the strings object, default to UTF-8
- Returns:
A new Strings object in toEncoding
- Return type:
- Raises:
RuntimeError – Raised if there is a server-side error thrown
- endswith(substr: bytes | arkouda.numpy.dtypes.str_scalars, regex: bool = False) arkouda.numpy.pdarrayclass.pdarray [source]¶
Check whether each element ends with the given substring.
- Parameters:
substr (bytes or str_scalars) – The suffix to search for
regex (bool, default=False) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Returns:
True for elements that end with substr, False otherwise
- Return type:
pdarray, bool
- Raises:
TypeError – Raised if the substr parameter is not bytes or str_scalars
ValueError – Rasied if substr is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings_start = ak.array([f'{i} string' for i in range(1,6)]) >>> strings_start array(['1 string', '2 string', '3 string', '4 string', '5 string']) >>> strings_start.endswith('ing') array([True True True True True]) >>> strings_end = ak.array([f'string {i}' for i in range(1, 6)]) >>> strings_end array(['string 1', 'string 2', 'string 3', 'string 4', 'string 5']) >>> strings_end.endswith('ing \d', regex = True) array([True True True True True])
- equals(other: Any) arkouda.numpy.dtypes.bool_scalars [source]¶
Whether Strings are the same size and all entries are equal.
- Parameters:
other (Any) – object to compare.
- Returns:
True if the Strings are the same, o.w. False.
- Return type:
bool
Examples
>>> import arkouda as ak >>> ak.connect() >>> s = ak.array(["a", "b", "c"]) >>> s_cpy = ak.array(["a", "b", "c"]) >>> s.equals(s_cpy) True >>> s2 = ak.array(["a", "x", "c"]) >>> s.equals(s2) False
- find_locations(pattern: bytes | arkouda.numpy.dtypes.str_scalars) Tuple[arkouda.numpy.pdarrayclass.pdarray, arkouda.numpy.pdarrayclass.pdarray, arkouda.numpy.pdarrayclass.pdarray] [source]¶
Finds pattern matches and returns pdarrays containing the number, start postitions, and lengths of matches
- Parameters:
pattern (bytes or str_scalars) – The regex pattern used to find matches
- Returns:
pdarray, int64 – For each original string, the number of pattern matches
pdarray, int64 – The start positons of pattern matches
pdarray, int64 – The lengths of pattern matches
- Raises:
TypeError – Raised if the pattern parameter is not bytes or str_scalars
ValueError – Raised if pattern is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings = ak.array([f'{i} string {i}' for i in range(1, 6)]) >>> num_matches, starts, lens = strings.find_locations('\d') >>> num_matches array([2 2 2 2 2]) >>> starts array([0 9 0 9 0 9 0 9 0 9]) >>> lens array([1 1 1 1 1 1 1 1 1 1])
- findall(pattern: bytes | arkouda.numpy.dtypes.str_scalars, return_match_origins: bool = False) Strings | Tuple [source]¶
Return a new Strings containg all non-overlapping matches of pattern
- Parameters:
pattern (bytes or str_scalars) – Regex used to find matches
return_match_origins (bool, default=False) – If True, return a pdarray containing the index of the original string each pattern match is from
- Returns:
Strings – Strings object containing only pattern matches
pdarray, int64 (optional) – The index of the original string each pattern match is from
- Raises:
TypeError – Raised if the pattern parameter is not bytes or str_scalars
ValueError – Raised if pattern is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', '']) >>> strings.findall('_+', return_match_origins=True) (array(['_', '___', '____', '__', '___', '____', '___']), array([0 0 1 3 3 3 3]))
- flatten() Strings [source]¶
Return a copy of the array collapsed into one dimension.
- Return type:
A copy of the input array, flattened to one dimension.
Note
As multidimensional Strings are currently supported, flatten on a Strings object will always return itself.
- static from_parts(offset_attrib: arkouda.numpy.pdarrayclass.pdarray | str, bytes_attrib: arkouda.numpy.pdarrayclass.pdarray | str) Strings [source]¶
Factory method for creating a Strings object from an Arkouda server response where the arrays are separate components.
- Parameters:
- Returns:
object representing a segmented strings array on the server
- Return type:
- Raises:
RuntimeError – Raised if there’s an error converting a server-returned str-descriptor
Notes
This factory method is used when we construct the parts of a Strings object on the client side and transfer the offsets & bytes separately to the server. This results in two entries in the symbol table and we need to instruct the server to assemble the into a composite entity.
- static from_return_msg(rep_msg: str) Strings [source]¶
Factory method for creating a Strings object from an Arkouda server response message
- Parameters:
rep_msg (str) – Server response message currently of form created name type size ndim shape itemsize+created bytes.size 1234
- Returns:
object representing a segmented strings array on the server
- Return type:
- Raises:
RuntimeError – Raised if there’s an error converting a server-returned str-descriptor
Notes
We really don’t have an itemsize because these are variable length strings. In the future we could probably use this position to store the total bytes.
- fullmatch(pattern: bytes | arkouda.numpy.dtypes.str_scalars) arkouda.match.Match [source]¶
Returns a match object where elements match only if the whole string matches the regular expression pattern
- Parameters:
pattern (bytes or str_scalars) – Regex used to find matches
- Returns:
Match object where elements match only if the whole string matches the regular expression pattern
- Return type:
Examples
>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', '']) >>> strings.fullmatch('_+') <ak.Match object: matched=False; matched=True, span=(0, 4); matched=False; matched=False; matched=False>
- get_bytes() arkouda.numpy.pdarrayclass.pdarray [source]¶
Getter for the bytes component (uint8 pdarray) of this Strings.
- Returns:
Pdarray of bytes of the string accessed
- Return type:
pdarray, uint8
Example
>>> x = ak.array(['one', 'two', 'three']) >>> x.get_bytes() [111 110 101 0 116 119 111 0 116 104 114 101 101 0]
- get_lengths() arkouda.numpy.pdarrayclass.pdarray [source]¶
Return the length of each string in the array.
- Returns:
The length of each string
- Return type:
pdarray, int
- Raises:
RuntimeError – Raised if there is a server-side error thrown
- get_offsets() arkouda.numpy.pdarrayclass.pdarray [source]¶
Getter for the offsets component (int64 pdarray) of this Strings.
- Returns:
Pdarray of offsets of the string accessed
- Return type:
pdarray, int64
Example
>>> x = ak.array(['one', 'two', 'three']) >>> x.get_offsets() [0 4 8]
- get_prefixes(n: arkouda.numpy.dtypes.int_scalars, return_origins: bool = True, proper: bool = True) Strings | Tuple[Strings, arkouda.numpy.pdarrayclass.pdarray] [source]¶
Return the n-long prefix of each string, where possible
- Parameters:
n (int_scalars) – Length of prefix
return_origins (bool, default=True) – If True, return a logical index indicating which strings were long enough to return an n-prefix
proper (bool, default=True) – If True, only return proper prefixes, i.e. from strings that are at least n+1 long. If False, allow the entire string to be returned as a prefix.
- Returns:
prefixes (Strings) – The array of n-character prefixes; the number of elements is the number of True values in the returned mask.
origin_indices (pdarray, bool) – Boolean array that is True where the string was long enough to return an n-character prefix, False otherwise.
- get_suffixes(n: arkouda.numpy.dtypes.int_scalars, return_origins: bool = True, proper: bool = True) Strings | Tuple[Strings, arkouda.numpy.pdarrayclass.pdarray] [source]¶
Return the n-long suffix of each string, where possible
- Parameters:
n (int_scalars) – Length of suffix
return_origins (bool, default=True) – If True, return a logical index indicating which strings were long enough to return an n-suffix
proper (bool, default=True) – If True, only return proper suffixes, i.e. from strings that are at least n+1 long. If False, allow the entire string to be returned as a suffix.
- Returns:
suffixes (Strings) – The array of n-character suffixes; the number of elements is the number of True values in the returned mask.
origin_indices (pdarray, bool) – Boolean array that is True where the string was long enough to return an n-character suffix, False otherwise.
- group() arkouda.numpy.pdarrayclass.pdarray [source]¶
Return the permutation that groups the array, placing equivalent strings together. All instances of the same string are guaranteed to lie in one contiguous block of the permuted array, but the blocks are not necessarily ordered.
- Returns:
The permutation that groups the array by value
- Return type:
Notes
If the arkouda server is compiled with “-sSegmentedString.useHash=true”, then arkouda uses 128-bit hash values to group strings, rather than sorting the strings directly. This method is fast, but the resulting permutation merely groups equivalent strings and does not sort them. If the “useHash” parameter is false, then a full sort is performed.
- Raises:
RuntimeError – Raised if there is a server-side error in executing group request or creating the pdarray encapsulating the return message
- hash() Tuple[arkouda.numpy.pdarrayclass.pdarray, arkouda.numpy.pdarrayclass.pdarray] [source]¶
Compute a 128-bit hash of each string.
- Returns:
A tuple of two int64 pdarrays. The ith hash value is the concatenation of the ith values from each array.
- Return type:
Notes
The implementation uses SipHash128, a fast and balanced hash function (used by Python for dictionaries and sets). For realistic numbers of strings (up to about 10**15), the probability of a collision between two 128-bit hash values is negligible.
- property inferred_type: str¶
Return a string of the type inferred from the values.
- info() str [source]¶
Returns a JSON formatted string containing information about all components of self
- Parameters:
None
- Returns:
JSON string containing information about all components of self
- Return type:
str
- is_registered() numpy.bool_ [source]¶
Return True iff the object is contained in the registry
- Parameters:
None
- Returns:
Indicates if the object is contained in the registry
- Return type:
bool
- Raises:
RuntimeError – Raised if there’s a server-side error thrown
- isalnum() arkouda.numpy.pdarrayclass.pdarray [source]¶
Returns a boolean pdarray where index i indicates whether string i of the Strings is alphanumeric.
- Returns:
True for elements that are alphanumeric, False otherwise
- Return type:
pdarray, bool
- Raises:
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> not_alnum = ak.array([f'%Strings {i}' for i in range(3)]) >>> alnum = ak.array([f'Strings{i}' for i in range(3)]) >>> strings = ak.concatenate([not_alnum, alnum]) >>> strings array(['%Strings 0', '%Strings 1', '%Strings 2', 'Strings0', 'Strings1', 'Strings2']) >>> strings.isalnum() array([False False False True True True])
- isalpha() arkouda.numpy.pdarrayclass.pdarray [source]¶
Returns a boolean pdarray where index i indicates whether string i of the Strings is alphabetic. This means there is at least one character, and all the characters are alphabetic.
- Returns:
True for elements that are alphabetic, False otherwise
- Return type:
pdarray, bool
- Raises:
RuntimeError – Raised if there is a server-side error thrown
See also
Strings.islower
,Strings.isupper
,Strings.istitle
,Strings.isalnum
Examples
>>> not_alpha = ak.array([f'%Strings {i}' for i in range(3)]) >>> alpha = ak.array(['StringA','StringB','StringC']) >>> strings = ak.concatenate([not_alpha, alpha]) >>> strings array(['%Strings 0', '%Strings 1', '%Strings 2', 'StringA', 'StringB', 'StringC']) >>> strings.isalpha() array([False False False True True True])
- isdecimal() arkouda.numpy.pdarrayclass.pdarray [source]¶
Returns a boolean pdarray where index i indicates whether string i of the Strings has all decimal characters.
- Returns:
True for elements that are decimals, False otherwise
- Return type:
pdarray, bool
- Raises:
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> not_decimal = ak.array([f'Strings {i}' for i in range(3)]) >>> decimal = ak.array([f'12{i}' for i in range(3)]) >>> strings = ak.concatenate([not_decimal, decimal]) >>> strings array(['Strings 0', 'Strings 1', 'Strings 2', '120', '121', '122']) >>> strings.isdecimal() array([False False False True True True])
Special Character Examples
>>> special_strings = ak.array(["3.14", "0", "²", "2³₇", "2³x₇"]) >>> special_strings array(['3.14', '0', '²', '2³₇', '2³x₇']) >>> special_strings.isdecimal() array([False True False False False])
- isdigit() arkouda.numpy.pdarrayclass.pdarray [source]¶
Returns a boolean pdarray where index i indicates whether string i of the Strings has all digit characters.
- Returns:
True for elements that are digits, False otherwise
- Return type:
pdarray, bool
- Raises:
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> not_digit = ak.array([f'Strings {i}' for i in range(3)]) >>> digit = ak.array([f'12{i}' for i in range(3)]) >>> strings = ak.concatenate([not_digit, digit]) >>> strings array(['Strings 0', 'Strings 1', 'Strings 2', '120', '121', '122']) >>> strings.isdigit() array([False False False True True True])
Special Character Examples
>>> special_strings = ak.array(["3.14", "0", "²", "2³₇", "2³x₇"]) >>> special_strings array(['3.14', '0', '²', '2³₇', '2³x₇']) >>> special_strings.isdigit() array([False True True True False])
- isempty() arkouda.numpy.pdarrayclass.pdarray [source]¶
Returns a boolean pdarray where index i indicates whether string i of the Strings is empty.
True for elements that are the empty string, False otherwise
- Returns:
True for elements that are digits, False otherwise
- Return type:
pdarray, bool
- Raises:
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> not_empty = ak.array([f'Strings {i}' for i in range(3)]) >>> empty = ak.array(['' for i in range(3)]) >>> strings = ak.concatenate([not_empty, empty]) >>> strings array(['Strings 0', 'Strings 1', 'Strings 2', '', '', '']) >>> strings.isempty() array([False False False True True True])
- islower() arkouda.numpy.pdarrayclass.pdarray [source]¶
Returns a boolean pdarray where index i indicates whether string i of the Strings is entirely lowercase
- Returns:
True for elements that are entirely lowercase, False otherwise
- Return type:
pdarray, bool
- Raises:
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> lower = ak.array([f'strings {i}' for i in range(3)]) >>> upper = ak.array([f'STRINGS {i}' for i in range(3)]) >>> strings = ak.concatenate([lower, upper]) >>> strings array(['strings 0', 'strings 1', 'strings 2', 'STRINGS 0', 'STRINGS 1', 'STRINGS 2']) >>> strings.islower() array([True True True False False False])
- isspace() arkouda.numpy.pdarrayclass.pdarray [source]¶
Returns a boolean pdarray where index i indicates whether string i has all whitespace characters (‘ ’, ‘\t’, ‘\n’, ‘\v’, ‘\f’, ‘\r’).
- Returns:
True for elements that are whitespace, False otherwise
- Return type:
pdarray, bool
- Raises:
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> not_space = ak.array([f'Strings {i}' for i in range(3)]) >>> space = ak.array([' ', '\t', '\n', '\v', '\f', '\r', ' \t\n\v\f\r']) >>> strings = ak.concatenate([not_space, space]) >>> strings array(['Strings 0', 'Strings 1', 'Strings 2', ' ', 'u0009', 'n', 'u000B', 'u000C', 'u000D', ' u0009nu000Bu000Cu000D']) >>> strings.isspace() array([False False False True True True True True True True])
- istitle() arkouda.numpy.pdarrayclass.pdarray [source]¶
Returns a boolean pdarray where index i indicates whether string i of the Strings is titlecase
- Returns:
True for elements that are titlecase, False otherwise
- Return type:
pdarray, bool
- Raises:
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> mixed = ak.array([f'sTrINgs {i}' for i in range(3)]) >>> title = ak.array([f'Strings {i}' for i in range(3)]) >>> strings = ak.concatenate([mixed, title]) >>> strings array(['sTrINgs 0', 'sTrINgs 1', 'sTrINgs 2', 'Strings 0', 'Strings 1', 'Strings 2']) >>> strings.istitle() array([False False False True True True])
- isupper() arkouda.numpy.pdarrayclass.pdarray [source]¶
Returns a boolean pdarray where index i indicates whether string i of the Strings is entirely uppercase
- Returns:
True for elements that are entirely uppercase, False otherwise
- Return type:
pdarray, bool
- Raises:
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> lower = ak.array([f'strings {i}' for i in range(3)]) >>> upper = ak.array([f'STRINGS {i}' for i in range(3)]) >>> strings = ak.concatenate([lower, upper]) >>> strings array(['strings 0', 'strings 1', 'strings 2', 'STRINGS 0', 'STRINGS 1', 'STRINGS 2']) >>> strings.isupper() array([False False False True True True])
- logger¶
- lower() Strings [source]¶
Returns a new Strings with all uppercase characters from the original replaced with their lowercase equivalent
- Returns:
Strings with all uppercase characters from the original replaced with their lowercase equivalent
- Return type:
- Raises:
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings = ak.array([f'StrINgS {i}' for i in range(5)]) >>> strings array(['StrINgS 0', 'StrINgS 1', 'StrINgS 2', 'StrINgS 3', 'StrINgS 4']) >>> strings.lower() array(['strings 0', 'strings 1', 'strings 2', 'strings 3', 'strings 4'])
- lstick(other: Strings, delimiter: bytes | arkouda.numpy.dtypes.str_scalars = '') Strings [source]¶
Join the strings from another array onto the left of the strings of this array, optionally inserting a delimiter. Warning: This function is experimental and not guaranteed to work.
- Parameters:
other (Strings) – The strings to join onto self’s strings
delimiter (bytes or str_scalars, default="") – String inserted between self and other
- Returns:
The array of joined strings, as other + self
- Return type:
- Raises:
TypeError – Raised if the delimiter parameter is neither bytes nor a str or if the other parameter is not a Strings instance
RuntimeError – Raised if there is a server-side error thrown
Examples
>>> s = ak.array(['a', 'c', 'e']) >>> t = ak.array(['b', 'd', 'f']) >>> s.lstick(t, delimiter='.') array(['b.a', 'd.c', 'f.e'])
- match(pattern: bytes | arkouda.numpy.dtypes.str_scalars) arkouda.match.Match [source]¶
Returns a match object where elements match only if the beginning of the string matches the regular expression pattern
- Parameters:
pattern (bytes or str_scalars) – Regex used to find matches
- Returns:
Match object where elements match only if the beginning of the string matches the regular expression pattern
- Return type:
Examples
>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', '']) >>> strings.match('_+') <ak.Match object: matched=False; matched=True, span=(0, 4); matched=False; matched=True, span=(0, 2); matched=False>
- objType = 'Strings'¶
- peel(delimiter: bytes | arkouda.numpy.dtypes.str_scalars, times: arkouda.numpy.dtypes.int_scalars = 1, includeDelimiter: bool = False, keepPartial: bool = False, fromRight: bool = False, regex: bool = False) Tuple[Strings, Strings] [source]¶
Peel off one or more delimited fields from each string (similar to string.partition), returning two new arrays of strings. Warning: This function is experimental and not guaranteed to work.
- Parameters:
delimiter (bytes or str_scalars) – The separator where the split will occur
times (int_scalars, default=1) – The number of times the delimiter is sought, i.e. skip over the first (times-1) delimiters
includeDelimiter (bool, default=False) – If true, append the delimiter to the end of the first return array. By default, it is prepended to the beginning of the second return array.
keepPartial (bool, default=False) – If true, a string that does not contain <times> instances of the delimiter will be returned in the first array. By default, such strings are returned in the second array.
fromRight (bool, default=False) – If true, peel from the right instead of the left (see also rpeel)
regex (bool, default=False) – Indicates whether delimiter is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Returns:
- left: Strings
The field(s) peeled from the end of each string (unless fromRight is true)
- right: Strings
The remainder of each string after peeling (unless fromRight is true)
- Return type:
- Raises:
TypeError – Raised if the delimiter parameter is not byte or str_scalars, if times is not int64, or if includeDelimiter, keepPartial, or fromRight is not bool
ValueError – Raised if times is < 1 or if delimiter is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
Examples
>>> s = ak.array(['a.b', 'c.d', 'e.f.g']) >>> s.peel('.') (array(['a', 'c', 'e']), array(['b', 'd', 'f.g'])) >>> s.peel('.', includeDelimiter=True) (array(['a.', 'c.', 'e.']), array(['b', 'd', 'f.g'])) >>> s.peel('.', times=2) (array(['', '', 'e.f']), array(['a.b', 'c.d', 'g'])) >>> s.peel('.', times=2, keepPartial=True) (array(['a.b', 'c.d', 'e.f']), array(['', '', 'g']))
- pretty_print_info() None [source]¶
Prints information about all components of self in a human readable format
- Parameters:
None
- Return type:
None
- regex_split(pattern: bytes | arkouda.numpy.dtypes.str_scalars, maxsplit: int = 0, return_segments: bool = False) Strings | Tuple [source]¶
Returns a new Strings split by the occurrences of pattern. If maxsplit is nonzero, at most maxsplit splits occur
- Parameters:
pattern (bytes or str_scalars) – Regex used to split strings into substrings
maxsplit (int, default=0) – The max number of pattern match occurences in each element to split. The default maxsplit=0 splits on all occurences
return_segments (bool, default=False) – If True, return mapping of original strings to first substring in return array.
- Returns:
Strings – Substrings with pattern matches removed
pdarray, int64 (optional) – For each original string, the index of first corresponding substring in the return array
Examples
>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', '']) >>> strings.regex_split('_+', maxsplit=2, return_segments=True) (array(['1', '2', '', '', '', '3', '', '4', '5____6___7', '']), array([0 3 5 6 9]))
- register(user_defined_name: str) Strings [source]¶
Register this Strings object with a user defined name in the arkouda server so it can be attached to later using Strings.attach() This is an in-place operation, registering a Strings object more than once will update the name in the registry and remove the previously registered name. A name can only be registered to one object at a time.
- Parameters:
user_defined_name (str) – user defined name which the Strings object is to be registered under
- Returns:
The same Strings object which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different objects with the same name.
- Return type:
- Raises:
TypeError – Raised if user_defined_name is not a str
RegistrationError – If the server was unable to register the Strings object with the user_defined_name If the user is attempting to register more than one object with the same name, the former should be unregistered first to free up the registration name.
See also
Notes
Registered names/Strings objects in the server are immune to deletion until they are unregistered.
- registered_name: str | None = None¶
- rpeel(delimiter: bytes | arkouda.numpy.dtypes.str_scalars, times: arkouda.numpy.dtypes.int_scalars = 1, includeDelimiter: bool = False, keepPartial: bool = False, regex: bool = False) Tuple[Strings, Strings] [source]¶
Peel off one or more delimited fields from the end of each string (similar to string.rpartition), returning two new arrays of strings. Warning: This function is experimental and not guaranteed to work.
- Parameters:
delimiter (bytes or str_scalars) – The separator where the split will occur
times (int_scalars, default=1) – The number of times the delimiter is sought, i.e. skip over the last (times-1) delimiters
includeDelimiter (bool, default=False) – If true, prepend the delimiter to the start of the first return array. By default, it is appended to the end of the second return array.
keepPartial (bool, default=False) – If true, a string that does not contain <times> instances of the delimiter will be returned in the second array. By default, such strings are returned in the first array.
regex (bool, default=False) – Indicates whether delimiter is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Returns:
- left: Strings
The remainder of the string after peeling
- right: Strings
The field(s) that were peeled from the right of each string
- Return type:
- Raises:
TypeError – Raised if the delimiter parameter is not bytes or str_scalars or if times is not int64
ValueError – Raised if times is < 1 or if delimiter is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
Examples
>>> s = ak.array(['a.b', 'c.d', 'e.f.g']) >>> s.rpeel('.') (array(['a', 'c', 'e.f']), array(['b', 'd', 'g']))
Compared against peel
>>> s.peel('.') (array(['a', 'c', 'e']), array(['b', 'd', 'f.g']))
- save(prefix_path: str, dataset: str = 'strings_array', mode: Literal['truncate', 'append'] = 'truncate', save_offsets: bool = True, compression: Literal['snappy', 'gzip', 'brotli', 'zstd', 'lz4'] | None = None, file_format: Literal['HDF5', 'Parquet'] = 'HDF5', file_type: Literal['single', 'distribute'] = 'distribute') str [source]¶
DEPRECATED Save the Strings object to HDF5 or Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. HDF5 support single files, in which case the file name will only be that provided. Each locale saves its chunk of the array to its corresponding file.
- Parameters:
prefix_path (str) – Directory and filename prefix that all output files share
dataset (str, default="strings_array") – The name of the Strings dataset to be written, defaults to strings_array
mode ({"truncate", "append"}, default = "truncate") – By default, truncate (overwrite) output files, if they exist. If ‘append’, create a new Strings dataset within existing files.
save_offsets (bool, default=True) – Defaults to True which will instruct the server to save the offsets array to HDF5 If False the offsets array will not be save and will be derived from the string values upon load/read. This is not supported for Parquet files.
compression ({"snappy", "gzip", "brotli", "zstd", "lz4"}, optional) – Sets the compression type used with Parquet files
file_format ({"HDF5", "Parquet"}, default = "HDF5") – By default, saved files will be written to the HDF5 file format. If ‘Parquet’, the files will be written to the Parquet file format. This is case insensitive.
file_type ({"single", "distribute"}, default = "distribute") – Default: Distribute Distribute the dataset over a file per locale. Single file will save the dataset to one file
- Return type:
String message indicating result of save operation
Notes
Important implementation notes: (1) Strings state is saved as two datasets within an hdf5 group: one for the string characters and one for the segments corresponding to the start of each string, (2) the hdf5 group is named via the dataset parameter. (3) Parquet files do not store the segments, only the values.
- search(pattern: bytes | arkouda.numpy.dtypes.str_scalars) arkouda.match.Match [source]¶
Returns a match object with the first location in each element where pattern produces a match. Elements match if any part of the string matches the regular expression pattern
- Parameters:
pattern (bytes or str_scalars) – Regex used to find matches
- Returns:
Match object where elements match if any part of the string matches the regular expression pattern
- Return type:
Examples
>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', '']) >>> strings.search('_+') <ak.Match object: matched=True, span=(1, 2); matched=True, span=(0, 4); matched=False; matched=True, span=(0, 2); matched=False>
- size: arkouda.numpy.dtypes.int_scalars¶
- split(delimiter: str, return_segments: bool = False, regex: bool = False) Strings | Tuple [source]¶
Unpack delimiter-joined substrings into a flat array.
- Parameters:
delimiter (str) – Characters used to split strings into substrings
return_segments (bool, default=False) – If True, also return mapping of original strings to first substring in return array.
regex (bool, default=False) – Indicates whether delimiter is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Returns:
Strings – Flattened substrings with delimiters removed
pdarray, int64 (optional) – For each original string, the index of first corresponding substring in the return array
Examples
>>> orig = ak.array(['one|two', 'three|four|five', 'six']) >>> orig.split('|') array(['one', 'two', 'three', 'four', 'five', 'six']) >>> flat, mapping = orig.split('|', return_segments=True) >>> mapping array([0 2 5]) >>> under = ak.array(['one_two', 'three_____four____five', 'six']) >>> under_split, under_map = under.split('_+', return_segments=True, regex=True) >>> under_split array(['one', 'two', 'three', 'four', 'five', 'six']) >>> under_map array([0 2 5])
- startswith(substr: bytes | arkouda.numpy.dtypes.str_scalars, regex: bool = False) arkouda.numpy.pdarrayclass.pdarray [source]¶
Check whether each element starts with the given substring.
- Parameters:
substr (bytes or str_scalars) – The prefix to search for
regex (bool, default=False) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Returns:
True for elements that start with substr, False otherwise
- Return type:
pdarray, bool
- Raises:
TypeError – Raised if the substr parameter is not a bytes ior str_scalars
ValueError – Rasied if substr is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings_end = ak.array([f'string {i}' for i in range(1, 6)]) >>> strings_end array(['string 1', 'string 2', 'string 3', 'string 4', 'string 5']) >>> strings_end.startswith('string') array([True True True True True]) >>> strings_start = ak.array([f'{i} string' for i in range(1,6)]) >>> strings_start array(['1 string', '2 string', '3 string', '4 string', '5 string']) >>> strings_start.startswith('\d str', regex = True) array([True True True True True])
- stick(other: Strings, delimiter: bytes | arkouda.numpy.dtypes.str_scalars = '', toLeft: bool = False) Strings [source]¶
Join the strings from another array onto one end of the strings of this array, optionally inserting a delimiter. Warning: This function is experimental and not guaranteed to work.
- Parameters:
other (Strings) – The strings to join onto self’s strings
delimiter (bytes or str_scalars, default="") – String inserted between self and other
toLeft (bool, default=False) – If true, join other strings to the left of self. By default, other is joined to the right of self.
- Returns:
The array of joined strings
- Return type:
- Raises:
TypeError – Raised if the delimiter parameter is not bytes or str_scalars or if the other parameter is not a Strings instance
ValueError – Raised if times is < 1
RuntimeError – Raised if there is a server-side error thrown
Examples
>>> s = ak.array(['a', 'c', 'e']) >>> t = ak.array(['b', 'd', 'f']) >>> s.stick(t, delimiter='.') array(['a.b', 'c.d', 'e.f'])
- strip(chars: bytes | arkouda.numpy.dtypes.str_scalars | None = '') Strings [source]¶
Returns a new Strings object with all leading and trailing occurrences of characters contained in chars removed. The chars argument is a string specifying the set of characters to be removed. If omitted, the chars argument defaults to removing whitespace. The chars argument is not a prefix or suffix; rather, all combinations of its values are stripped.
- Parameters:
chars (bytes or str_scalars, optional) – the set of characters to be removed
- Returns:
Strings object with the leading and trailing characters matching the set of characters in the chars argument removed
- Return type:
- Raises:
RuntimeError – Raised if there is a server-side error thrown
Examples
>>> strings = ak.array(['Strings ', ' StringS ', 'StringS ']) >>> s = strings.strip() >>> s array(['Strings', 'StringS', 'StringS'])
>>> strings = ak.array(['Strings 1', '1 StringS ', ' 1StringS 12 ']) >>> s = strings.strip(' 12') >>> s array(['Strings', 'StringS', 'StringS'])
- sub(pattern: bytes | arkouda.numpy.dtypes.str_scalars, repl: bytes | arkouda.numpy.dtypes.str_scalars, count: int = 0) Strings [source]¶
Return new Strings obtained by replacing non-overlapping occurrences of pattern with the replacement repl. If count is nonzero, at most count substitutions occur
- Parameters:
pattern (bytes or str_scalars) – The regex to substitue
repl (bytes or str_scalars) – The substring to replace pattern matches with
count (int, default=0) – The max number of pattern match occurences in each element to replace. The default count=0 replaces all occurences of pattern with repl
- Returns:
Strings with pattern matches replaced
- Return type:
- Raises:
TypeError – Raised if pattern or repl are not bytes or str_scalars
ValueError – Raised if pattern is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', '']) >>> strings.sub(pattern='_+', repl='-', count=2) array(['1-2-', '-', '3', '-4-5____6___7', ''])
- subn(pattern: bytes | arkouda.numpy.dtypes.str_scalars, repl: bytes | arkouda.numpy.dtypes.str_scalars, count: int = 0) Tuple [source]¶
Perform the same operation as sub(), but return a tuple (new_Strings, number_of_substitions)
- Parameters:
pattern (bytes or str_scalars) – The regex to substitue
repl (bytes or str_scalars) – The substring to replace pattern matches with
count (int, default=0) – The max number of pattern match occurences in each element to replace. The default count=0 replaces all occurences of pattern with repl
- Returns:
Strings – Strings with pattern matches replaced
pdarray, int64 – The number of substitutions made for each element of Strings
- Raises:
TypeError – Raised if pattern or repl are not bytes or str_scalars
ValueError – Raised if pattern is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', '']) >>> strings.subn(pattern='_+', repl='-', count=2) (array(['1-2-', '-', '3', '-4-5____6___7', '']), array([2 1 0 2 0]))
- title() Strings [source]¶
Returns a new Strings from the original replaced with their titlecase equivalent.
- Returns:
Strings from the original replaced with their titlecase equivalent.
- Return type:
- Raises:
RuntimeError – Raised if there is a server-side error thrown.
See also
Strings.lower
,String.upper
Examples
>>> strings = ak.array([f'StrINgS {i}' for i in range(5)]) >>> strings array(['StrINgS 0', 'StrINgS 1', 'StrINgS 2', 'StrINgS 3', 'StrINgS 4']) >>> strings.title() array(['Strings 0', 'Strings 1', 'Strings 2', 'Strings 3', 'Strings 4'])
- to_csv(prefix_path: str, dataset: str = 'strings_array', col_delim: str = ',', overwrite: bool = False) str [source]¶
Write Strings to CSV file(s). File will contain a single column with the Strings data. All CSV Files written by Arkouda include a header denoting data types of the columns. Unlike other file formats, CSV files store Strings as their UTF-8 format instead of storing bytes as uint(8).
- Parameters:
prefix_path (str) – The filename prefix to be used for saving files. Files will have _LOCALE#### appended when they are written to disk.
dataset (str, default="strings_array") – Column name to save the Strings under. Defaults to “strings_array”.
col_delim (str, default=",") – Defaults to “,”. Value to be used to separate columns within the file. Please be sure that the value used DOES NOT appear in your dataset.
overwrite (bool, default=False) – Defaults to False. If True, any existing files matching your provided prefix_path will be overwritten. If False, an error will be returned if existing files are found.
- Returns:
response message
- Return type:
str
- Raises:
ValueError – Raised if all datasets are not present in all parquet files or if one or more of the specified files do not exist
RuntimeError – Raised if one or more of the specified files cannot be opened. If allow_errors is true this may be raised if no values are returned from the server.
TypeError – Raised if we receive an unknown arkouda_type returned from the server
Notes
CSV format is not currently supported by load/load_all operations
The column delimiter is expected to be the same for column names and data
Be sure that column delimiters are not found within your data.
All CSV files must delimit rows using newline (
\n
) at this time.
- to_hdf(prefix_path: str, dataset: str = 'strings_array', mode: Literal['truncate', 'append'] = 'truncate', save_offsets: bool = True, file_type: Literal['single', 'distribute'] = 'distribute') str [source]¶
Save the Strings object to HDF5. The object can be saved to a collection of files or single file.
- Parameters:
prefix_path (str) – Directory and filename prefix that all output files share
dataset (str, default="strings_array") – The name of the Strings dataset to be written, defaults to strings_array
mode ({"truncate", "append"}, default = "truncate") – By default, truncate (overwrite) output files, if they exist. If ‘append’, create a new Strings dataset within existing files.
save_offsets (bool, default=True) – Defaults to True which will instruct the server to save the offsets array to HDF5 If False the offsets array will not be save and will be derived from the string values upon load/read.
file_type ({"single", "distribute"}, default = "distribute") – Default: Distribute Distribute the dataset over a file per locale. Single file will save the dataset to one file
- Return type:
String message indicating result of save operation
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
Parquet files do not store the segments, only the values.
Strings state is saved as two datasets within an hdf5 group: one for the string characters and one for the segments corresponding to the start of each string
the hdf5 group is named via the dataset parameter.
The prefix_path must be visible to the arkouda server and the user must have write permission.
Output files have names of the form
<prefix_path>_LOCALE<i>
, where<i>
ranges from 0 tonumLocales
for file_type=’distribute’. Otherwise, the file name will be prefix_path.If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, a
RuntimeError
will result.Any file extension can be used.The file I/O does not rely on the extension to determine the file format.
See also
- to_list() list [source]¶
Convert the SegString to a list, transferring data from the arkouda server to Python. If the SegString exceeds a built-in size limit, a RuntimeError is raised.
- Returns:
A list with the same strings as this SegString
- Return type:
list
Notes
The number of bytes in the array cannot exceed
ak.client.maxTransferBytes
, otherwise aRuntimeError
will be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting ak.client.maxTransferBytes to a larger value, but proceed with caution.See also
Examples
>>> a = ak.array(["hello", "my", "world"]) >>> a.to_list() ['hello', 'my', 'world'] >>> type(a.to_list()) <class 'list'>
- to_ndarray() numpy.ndarray [source]¶
Convert the array to a np.ndarray, transferring array data from the arkouda server to Python. If the array exceeds a built-in size limit, a RuntimeError is raised.
- Returns:
A numpy ndarray with the same strings as this array
- Return type:
np.ndarray
Notes
The number of bytes in the array cannot exceed
ak.client.maxTransferBytes
, otherwise aRuntimeError
will be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting ak.client.maxTransferBytes to a larger value, but proceed with caution.See also
array
,to_list
Examples
>>> a = ak.array(["hello", "my", "world"]) >>> a.to_ndarray() array(['hello', 'my', 'world'], dtype='<U5') >>> type(a.to_ndarray()) <class 'numpy.ndarray'>
- to_parquet(prefix_path: str, dataset: str = 'strings_array', mode: Literal['truncate', 'append'] = 'truncate', compression: Literal['snappy', 'gzip', 'brotli', 'zstd', 'lz4'] | None = None) str [source]¶
Save the Strings object to Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. Each locale saves its chunk of the array to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str, default=”strings_array” :param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, attempt to create new dataset in existing files.
- Parameters:
compression ({"snappy", "gzip", "brotli", "zstd", "lz4"}, optional) – Sets the compression type used with Parquet files
- Return type:
string message indicating result of save operation
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
The prefix_path must be visible to the arkouda server and the user must
have write permission. - Output files have names of the form
<prefix_path>_LOCALE<i>
, where<i>
ranges from 0 tonumLocales
for file_type=’distribute’. - ‘append’ write mode is supported, but is not efficient. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, aRuntimeError
will result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.
- transfer(hostname: str, port: arkouda.numpy.dtypes.int_scalars) str | memoryview [source]¶
Sends a Strings object to a different Arkouda server
- Parameters:
hostname (str) – The hostname where the Arkouda server intended to receive the Strings object is running.
port (int_scalars) – The port to send the array over. This needs to be an open port (i.e., not one that the Arkouda server is running on). This will open up numLocales ports, each of which in succession, so will use ports of the range {port..(port+numLocales)} (e.g., running an Arkouda server of 4 nodes, port 1234 is passed as port, Arkouda will use ports 1234, 1235, 1236, and 1237 to send the array data). This port much match the port passed to the call to ak.receive_array().
- Return type:
A message indicating a complete transfer
- Raises:
ValueError – Raised if the op is not within the pdarray.BinOps set
TypeError – Raised if other is not a pdarray or the pdarray.dtype is not a supported dtype
- unregister() None [source]¶
Unregister a Strings object in the arkouda server which was previously registered using register() and/or attached to using attach()
- Return type:
None
- Raises:
RuntimeError – Raised if the server could not find the internal name/symbol to remove
Notes
Registered names/Strings objects in the server are immune to deletion until they are unregistered.
- static unregister_strings_by_name(user_defined_name: str) None [source]¶
Unregister a Strings object in the arkouda server previously registered via register()
- Parameters:
user_defined_name (str) – The registered name of the Strings object
See also
- update_hdf(prefix_path: str, dataset: str = 'strings_array', save_offsets: bool = True, repack: bool = True) str [source]¶
Overwrite the dataset with the name provided with this Strings object. If the dataset does not exist it is added
- Parameters:
prefix_path (str) – Directory and filename prefix that all output files share
dataset (str, default="strings_array") – Name of the dataset to create in files
save_offsets (bool, default=True) – Defaults to True which will instruct the server to save the offsets array to HDF5 If False the offsets array will not be save and will be derived from the string values upon load/read.
repack (bool, default=True) – Default: True HDF5 does not release memory on delete. When True, the inaccessible data (that was overwritten) is removed. When False, the data remains, but is inaccessible. Setting to false will yield better performance, but will cause file sizes to expand.
- Return type:
str - success message if successful
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the Strings object
Notes
If file does not contain File_Format attribute to indicate how it was saved, the file name is checked for _LOCALE#### to determine if it is distributed.
If the dataset provided does not exist, it will be added
- upper() Strings [source]¶
Returns a new Strings with all lowercase characters from the original replaced with their uppercase equivalent
- Returns:
Strings with all lowercase characters from the original replaced with their uppercase equivalent
- Return type:
- Raises:
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings = ak.array([f'StrINgS {i}' for i in range(5)]) >>> strings array(['StrINgS 0', 'StrINgS 1', 'StrINgS 2', 'StrINgS 3', 'StrINgS 4']) >>> strings.upper() array(['STRINGS 0', 'STRINGS 1', 'STRINGS 2', 'STRINGS 3', 'STRINGS 4'])
- class arkouda.util.Timedelta(pda, unit: str = _BASE_UNIT)[source]¶
Bases:
_AbstractBaseTime
Represents a duration, the difference between two dates or times.
Timedelta is the Arkouda equivalent of pandas.TimedeltaIndex.
- Parameters:
pda (int64 pdarray, pd.TimedeltaIndex, pd.Series, or np.timedelta64 array)
unit (str, default 'ns') –
For int64 pdarray, denotes the unit of the input. Ignored for pandas and numpy arrays, which carry their own unit. Not case-sensitive; prefixes of full names (like ‘sec’) are accepted.
Possible values:
’weeks’ or ‘w’
’days’ or ‘d’
’hours’ or ‘h’
’minutes’, ‘m’, or ‘t’
’seconds’ or ‘s’
’milliseconds’, ‘ms’, or ‘l’
’microseconds’, ‘us’, or ‘u’
’nanoseconds’, ‘ns’, or ‘n’
Unlike in pandas, units cannot be combined or mixed with integers
Notes
The
.values
attribute is always in nanoseconds with int64 dtype.- property components¶
- property days¶
- is_registered() numpy.bool_ [source]¶
Return True iff the object is contained in the registry or is a component of a registered object.
- Returns:
Indicates if the object is contained in the registry
- Return type:
numpy.bool
- Raises:
RegistrationError – Raised if there’s a server-side error or a mis-match of registered components
See also
Notes
Objects registered with the server are immune to deletion until they are unregistered.
- property microseconds¶
- property nanoseconds¶
- register(user_defined_name)[source]¶
Register this Timedelta object and underlying components with the Arkouda server
- Parameters:
user_defined_name (str) – user defined name the timedelta is to be registered under, this will be the root name for underlying components
- Returns:
The same Timedelta which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different Timedeltas with the same name.
- Return type:
- Raises:
TypeError – Raised if user_defined_name is not a str
RegistrationError – If the server was unable to register the timedelta with the user_defined_name
See also
Notes
Objects registered with the server are immune to deletion until they are unregistered.
- property seconds¶
- special_objType = 'Timedelta'¶
- std(ddof: arkouda.numpy.dtypes.int_scalars = 0)[source]¶
Returns the standard deviation as a pd.Timedelta object
- supported_opeq¶
- supported_with_datetime¶
- supported_with_pdarray¶
- supported_with_r_datetime¶
- supported_with_r_pdarray¶
- supported_with_r_timedelta¶
- supported_with_timedelta¶
- to_pandas()[source]¶
Convert array to a pandas TimedeltaIndex. Note: if the array size exceeds client.maxTransferBytes, a RuntimeError is raised.
See also
to_ndarray
- unregister()[source]¶
Unregister this timedelta object in the arkouda server which was previously registered using register() and/or attached to using attach()
- Raises:
RegistrationError – If the object is already unregistered or if there is a server error when attempting to unregister
See also
Notes
Objects registered with the server are immune to deletion until they are unregistered.
- arkouda.util.arange(*args, **kwargs) arkouda.numpy.pdarrayclass.pdarray [source]¶
arange([start,] stop[, stride,] dtype=int64)
Create a pdarray of consecutive integers within the interval [start, stop). If only one arg is given then arg is the stop parameter. If two args are given, then the first arg is start and second is stop. If three args are given, then the first arg is start, second is stop, third is stride.
The return value is cast to type dtype
- Parameters:
start (int_scalars, optional) – Starting value (inclusive)
stop (int_scalars) – Stopping value (exclusive)
stride (int_scalars, optional) – The difference between consecutive elements, the default stride is 1, if stride is specified then start must also be specified.
dtype (np.dtype, type, or str) – The target dtype to cast values to
max_bits (int) – Specifies the maximum number of bits; only used for bigint pdarrays
- Returns:
Integers from start (inclusive) to stop (exclusive) by stride
- Return type:
pdarray, dtype
- Raises:
TypeError – Raised if start, stop, or stride is not an int object
ZeroDivisionError – Raised if stride == 0
See also
linspace
,zeros
,ones
,randint
Notes
Negative strides result in decreasing values. Currently, only int64 pdarrays can be created with this method. For float64 arrays, use the linspace method.
Examples
>>> ak.arange(0, 5, 1) array([0 1 2 3 4])
>>> ak.arange(5, 0, -1) array([5 4 3 2 1])
>>> ak.arange(0, 10, 2) array([0 2 4 6 8])
>>> ak.arange(-5, -10, -1) array([-5 -6 -7 -8 -9])
- arkouda.util.attach_all(names: list)[source]¶
Attach to all objects registered with the names provide
- Parameters:
names (list) – List of names to attach to
- Return type:
dict
- arkouda.util.broadcast(segments: pdarray, values: pdarray | Strings, size: int | np.int64 | np.uint64 = -1, permutation: pdarray | None = None)[source]¶
Broadcast a dense column vector to the rows of a sparse matrix or grouped array.
- Parameters:
segments (pdarray, int64) – Offsets of the start of each row in the sparse matrix or grouped array. Must be sorted in ascending order.
values (pdarray, Strings) – The values to broadcast, one per row (or group)
size (int) – The total number of nonzeros in the matrix. If permutation is given, this argument is ignored and the size is inferred from the permutation array.
permutation (pdarray, int64) – The permutation to go from the original ordering of nonzeros to the ordering grouped by row. To broadcast values back to the original ordering, this permutation will be inverted. If no permutation is supplied, it is assumed that the original nonzeros were already grouped by row. In this case, the size argument must be given.
- Returns:
The broadcast values, one per nonzero
- Return type:
- Raises:
ValueError –
If segments and values are different sizes
If segments are empty
If number of nonzeros (either user-specified or inferred from permutation) is less than one
Examples
>>> # Define a sparse matrix with 3 rows and 7 nonzeros >>> row_starts = ak.array([0, 2, 5]) >>> nnz = 7 # Broadcast the row number to each nonzero element >>> row_number = ak.arange(3) >>> ak.broadcast(row_starts, row_number, nnz) array([0 0 1 1 1 2 2]) # If the original nonzeros were in reverse order... >>> permutation = ak.arange(6, -1, -1) >>> ak.broadcast(row_starts, row_number, permutation=permutation) array([2 2 1 1 1 0 0])
- arkouda.util.broadcast_dims(sa: Sequence[int], sb: Sequence[int]) Tuple[int, Ellipsis] [source]¶
Algorithm to determine shape of broadcasted PD array given two array shapes
see: https://data-apis.org/array-api/latest/API_specification/broadcasting.html#algorithm
- arkouda.util.coargsort(arrays: Sequence[arkouda.numpy.strings.Strings | arkouda.numpy.pdarrayclass.pdarray | arkouda.categorical.Categorical], algorithm: SortingAlgorithm = SortingAlgorithm.RadixSortLSD) arkouda.numpy.pdarrayclass.pdarray [source]¶
Return the permutation that groups the rows (left-to-right), if the input arrays are treated as columns. The permutation sorts numeric columns, but not strings/Categoricals – strings/Categoricals are grouped, but not ordered.
- Parameters:
arrays (Sequence of Strings, pdarray, or Categorical) – The columns (int64, uint64, float64, Strings, or Categorical) to sort by row
algorithm (SortingAlgorithm, default=SortingAlgorithm.RadixSortLSD) – The algorithm to be used for sorting the arrays.
- Returns:
The indices that permute the rows to grouped order
- Return type:
pdarray of int64
- Raises:
ValueError – Raised if the pdarrays are not of the same size or if the parameter is not an Iterable containing pdarrays, Strings, or Categoricals
See also
argsort
Notes
Uses a least-significant-digit radix sort, which is stable and resilient to non-uniformity in data but communication intensive. Starts with the last array and moves forward. This sort operates directly on numeric types, but for Strings, it operates on a hash. Thus, while grouping of equivalent strings is guaranteed, lexicographic ordering of the groups is not. For Categoricals, coargsort sorts based on Categorical.codes which guarantees grouping of equivalent categories but not lexicographic ordering of those groups.
Examples
>>> a = ak.array([0, 1, 0, 1]) >>> b = ak.array([1, 1, 0, 0]) >>> perm = ak.coargsort([a, b]) >>> perm array([2 0 3 1]) >>> a[perm] array([0 0 1 1]) >>> b[perm] array([0 1 0 1])
- arkouda.util.convert_bytes(nbytes: arkouda.numpy.dtypes.int_scalars, unit: Literal['B', 'KB', 'MB', 'GB'] = 'B') arkouda.numpy.dtypes.numeric_scalars [source]¶
Convert the number of bytes to KB, MB, or GB.
- Parameters:
nbytes (int_scalars) – The number of bytes to convert
unit ({"B", "KB", "MB", "GB"}, default="B") – Unit to return. One of {‘B’, ‘KB’, ‘MB’, ‘GB’}.
- Return type:
int
- arkouda.util.convert_if_categorical(values)[source]¶
Convert a Categorical array to Strings for display
- arkouda.util.create_pdarray(repMsg: str, max_bits=None) pdarray [source]¶
Return a pdarray instance pointing to an array created by the arkouda server. The user should not call this function directly.
- Parameters:
repMsg (str) – space-delimited string containing the pdarray name, datatype, size dimension, shape,and itemsize
- Returns:
A pdarray with the same attributes and data as the pdarray; on GPU
- Return type:
- Raises:
ValueError – If there’s an error in parsing the repMsg parameter into the six values needed to create the pdarray instance
RuntimeError – Raised if a server-side error is thrown in the process of creating the pdarray instance
- arkouda.util.dtype(dtype)[source]¶
Create a data type object.
- Parameters:
dtype (object) – Object to be converted to a data type object.
- Return type:
type
- arkouda.util.expand(size: int | numpy.int64 | numpy.uint64, segs: arkouda.numpy.pdarrayclass.pdarray, vals: arkouda.numpy.pdarrayclass.pdarray) arkouda.numpy.pdarrayclass.pdarray [source]¶
Expand an array with values placed into the indicated segments.
- Parameters:
- Returns:
The expanded array.
- Return type:
Notes
This function (with different order of arguments) is now in arkouda proper as ak.broadcast. It is retained here for backwards compatibility.
- arkouda.util.float_scalars¶
- arkouda.util.int_scalars¶
- arkouda.util.invert_permutation(perm: arkouda.numpy.pdarrayclass.pdarray) arkouda.numpy.pdarrayclass.pdarray [source]¶
Find the inverse of a permutation array.
- arkouda.util.is_float(arry: arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings | arkouda.categorical.Categorical | arkouda.series.Series | arkouda.index.Index) bool [source]¶
Check if the dtype of the given array is float.
- Parameters:
arry (((pdarray, Strings, Categorical))) – The input pdarray, Strings, or Categorical object.
- Returns:
True if the dtype of pda is of type float, False otherwise.
- Return type:
bool
Examples
>>> data = ak.array([1.0, 2, 3, 4, np.nan]) >>> ak.util.is_float(data) True
>>> data2 = ak.arange(5) >>> ak.util.is_float(data2) False
- arkouda.util.is_int(arry: arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings | arkouda.categorical.Categorical | arkouda.series.Series | arkouda.index.Index) bool [source]¶
Check if the dtype of the given array is int.
- Parameters:
arry (pdarray, Strings, Categorical, Series, or Index) – The input pdarray, Strings, or Categorical object.
- Returns:
True if the dtype of pda is of type int, False otherwise.
- Return type:
bool
Examples
>>> data = ak.array([1.0, 2, 3, 4, np.nan]) >>> ak.util.is_int(data) False
>>> data2 = ak.arange(5) >>> ak.util.is_int(data2) True
- arkouda.util.is_numeric(arry: arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings | arkouda.categorical.Categorical | arkouda.series.Series | arkouda.index.Index) bool [source]¶
Check if the dtype of the given array is numeric.
- Parameters:
arry (((pdarray, Strings, Categorical))) – The input pdarray, Strings, or Categorical object.
- Returns:
True if the dtype of pda is numeric, False otherwise.
- Return type:
bool
Examples
>>> data = ak.array([1, 2, 3, 4, 5]) >>> ak.util.is_numeric(data) True
>>> strings = ak.array(["a", "b", "c"]) >>> ak.util.is_numeric(strings) False
- arkouda.util.is_registered(name: str, as_component: bool = False) bool [source]¶
Determine if the name provided is associated with a registered Object
- Parameters:
name (str) – The name to check for in the registry
as_component (bool, default=False) – When True, the name will be checked to determine if it is registered as a component of a registered object
- Return type:
bool
- arkouda.util.list_registry(detailed: bool = False)[source]¶
Return a list containing the names of all registered objects
- Parameters:
detailed (bool) – Default = False Return details of registry objects. Currently includes object type for any objects
- Returns:
Dict containing keys “Components” and “Objects”.
- Return type:
dict
- Raises:
RuntimeError – Raised if there’s a server-side error thrown
- arkouda.util.map(values: arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings | arkouda.categorical.Categorical, mapping: dict | arkouda.Series) arkouda.numpy.pdarrayclass.pdarray | arkouda.numpy.strings.Strings [source]¶
Map values of an array according to an input mapping.
- Parameters:
values (pdarray, Strings, or Categorical) – The values to be mapped.
mapping (dict or Series) – The mapping correspondence.
- Returns:
A new array with the values mapped by the mapping correspondence. When the input Series has Categorical values, the return Series will have Strings values. Otherwise, the return type will match the input type.
- Return type:
arkouda.numpy.pdarrayclass.pdarray or arkouda.numpy.strings.Strings
- Raises:
TypeError – Raised if arg is not of type dict or arkouda.Series. Raised if values not of type pdarray, Categorical, or Strings.
Examples
>>> import arkouda as ak >>> ak.connect() >>> from arkouda.numpy.util import map >>> a = ak.array([2, 3, 2, 3, 4]) >>> a array([2 3 2 3 4]) >>> ak.util.map(a, {4: 25.0, 2: 30.0, 1: 7.0, 3: 5.0}) array([30.00000000000000000 5.00000000000000000 30.00000000000000000 5.00000000000000000 25.00000000000000000]) >>> s = ak.Series(ak.array(["a","b","c","d"]), index = ak.array([4,2,1,3])) >>> ak.util.map(a, s) array(['b', 'd', 'b', 'd', 'a'])
- arkouda.util.numeric_scalars¶
- class arkouda.util.pdarray(name: str, mydtype: numpy.dtype | str, size: arkouda.numpy.dtypes.int_scalars, ndim: arkouda.numpy.dtypes.int_scalars, shape: Sequence[int], itemsize: arkouda.numpy.dtypes.int_scalars, max_bits: int | None = None)[source]¶
The basic arkouda array class. This class contains only the attributies of the array; the data resides on the arkouda server. When a server operation results in a new array, arkouda will create a pdarray instance that points to the array data on the server. As such, the user should not initialize pdarray instances directly.
- name¶
The server-side identifier for the array
- Type:
str
- dtype¶
The element type of the array
- Type:
dtype
- size¶
The number of elements in the array
- Type:
int_scalars
- ndim¶
The rank of the array (currently only rank 1 arrays supported)
- Type:
int_scalars
- shape¶
A list or tuple containing the sizes of each dimension of the array
- Type:
Sequence[int]
- itemsize¶
The size in bytes of each element
- Type:
int_scalars
- BinOps¶
- OpEqOps¶
- all(axis: int | Tuple[int, Ellipsis] | None = None, keepdims: bool = False) arkouda.numpy.dtypes.bool_scalars | pdarray [source]¶
Return True iff all elements of the array evaluate to True.
- any(axis: int | Tuple[int, Ellipsis] | None = None, keepdims: bool = False) arkouda.numpy.dtypes.bool_scalars | pdarray [source]¶
Return True iff any element of the array evaluates to True.
- argmax(axis: int | None | None = None, keepdims: bool = False) numpy.int64 | numpy.uint64 | pdarray [source]¶
Return the index of the first occurrence of the array max value.
- argmaxk(k: arkouda.numpy.dtypes.int_scalars) pdarray [source]¶
Finds the indices corresponding to the maximum “k” values.
- Parameters:
k (int_scalars) – The desired count of maximum values to be returned by the output.
- Returns:
Indices corresponding to the maximum k values, sorted
- Return type:
pdarray, int
- Raises:
TypeError – Raised if pda is not a pdarray
- argmin(axis: int | None | None = None, keepdims: bool = False) numpy.int64 | numpy.uint64 | pdarray [source]¶
Return the index of the first occurrence of the array min value
- argmink(k: arkouda.numpy.dtypes.int_scalars) pdarray [source]¶
Compute the minimum “k” values.
- Parameters:
k (int_scalars) – The desired count of maximum values to be returned by the output.
- Returns:
Indices corresponding to the maximum k values from pda
- Return type:
pdarray, int
- Raises:
TypeError – Raised if pda is not a pdarray
- astype(dtype) pdarray [source]¶
Cast values of pdarray to provided dtype
- Parameters:
dtype (np.dtype or str) – Dtype to cast to
- Returns:
An arkouda pdarray with values converted to the specified data type
- Return type:
ak.pdarray
Notes
This is essentially shorthand for ak.cast(x, ‘<dtype>’) where x is a pdarray.
- static attach(user_defined_name: str) pdarray [source]¶
class method to return a pdarray attached to the registered name in the arkouda server which was registered using register()
- Parameters:
user_defined_name (str) – user defined name which array was registered under
- Returns:
pdarray which is bound to the corresponding server side component which was registered with user_defined_name
- Return type:
- Raises:
TypeError – Raised if user_defined_name is not a str
See also
register
,unregister
,is_registered
,unregister_pdarray_by_name
,list_registry
Notes
Registered names/pdarrays in the server are immune to deletion until they are unregistered.
Examples
>>> a = zeros(100) >>> a.register("my_zeros") >>> # potentially disconnect from server and reconnect to server >>> b = ak.pdarray.attach("my_zeros") >>> # ...other work... >>> b.unregister()
- bigint_to_uint_arrays() List[pdarray] [source]¶
Creates a list of uint pdarrays from a bigint pdarray. The first item in return will be the highest 64 bits of the bigint pdarray and the last item will be the lowest 64 bits.
- Returns:
A list of uint pdarrays where: The first item in return will be the highest 64 bits of the bigint pdarray and the last item will be the lowest 64 bits.
- Return type:
List[pdarrays]
- Raises:
RuntimeError – Raised if there is a server-side error thrown
See also
pdarraycreation.bigint_from_uint_arrays
Examples
>>> a = ak.arange(2**64, 2**64 + 5) >>> a array(["18446744073709551616" "18446744073709551617" "18446744073709551618" "18446744073709551619" "18446744073709551620"])
>>> a.bigint_to_uint_arrays() [array([1 1 1 1 1]), array([0 1 2 3 4])]
- corr(y: pdarray) numpy.float64 [source]¶
Compute the correlation between self and y using pearson correlation coefficient.
- Parameters:
y (pdarray) – Other pdarray used to calculate correlation
- Returns:
The scalar correlation of the two arrays
- Return type:
np.float64
- Raises:
TypeError – Raised if y is not a pdarray instance
RuntimeError – Raised if there’s a server-side error thrown
- cov(y: pdarray) numpy.float64 [source]¶
Compute the covariance between self and y.
- Parameters:
y (pdarray) – Other pdarray used to calculate covariance
- Returns:
The scalar covariance of the two arrays
- Return type:
np.float64
- Raises:
TypeError – Raised if y is not a pdarray instance
RuntimeError – Raised if there’s a server-side error thrown
- dtype¶
- equals(other) arkouda.numpy.dtypes.bool_scalars [source]¶
Whether pdarrays are the same size and all entries are equal.
- Parameters:
other (object) – object to compare.
- Returns:
True if the pdarrays are the same, o.w. False.
- Return type:
bool
Examples
>>> import arkouda as ak >>> ak.connect() >>> a = ak.array([1, 2, 3]) >>> a_cpy = ak.array([1, 2, 3]) >>> a.equals(a_cpy) True >>> a2 = ak.array([1, 2, 5) >>> a.equals(a2) False
- fill(value: arkouda.numpy.dtypes.numeric_scalars) None [source]¶
Fill the array (in place) with a constant value.
- Parameters:
value (numeric_scalars)
- Raises:
TypeError – Raised if value is not an int, int64, float, or float64
- flatten()[source]¶
Return a copy of the array collapsed into one dimension.
- Return type:
A copy of the input array, flattened to one dimension.
- format_other(other) str [source]¶
Attempt to cast scalar other to the element dtype of this pdarray, and print the resulting value to a string (e.g. for sending to a server command). The user should not call this function directly.
- Parameters:
other (object) – The scalar to be cast to the pdarray.dtype
- Return type:
string representation of np.dtype corresponding to the other parameter
- Raises:
TypeError – Raised if the other parameter cannot be converted to Numpy dtype
- property inferred_type: str | None¶
Return a string of the type inferred from the values.
- info() str [source]¶
Returns a JSON formatted string containing information about all components of self
- Parameters:
None
- Returns:
JSON string containing information about all components of self
- Return type:
str
- is_registered() numpy.bool_ [source]¶
Return True iff the object is contained in the registry
- Parameters:
None
- Returns:
Indicates if the object is contained in the registry
- Return type:
bool
- Raises:
RuntimeError – Raised if there’s a server-side error thrown
Note
This will return True if the object is registered itself or as a component of another object
- is_sorted(axis: int | Tuple[int, Ellipsis] | None = None, keepdims: bool = False) arkouda.numpy.dtypes.bool_scalars | pdarray [source]¶
Return True iff the array is monotonically non-decreasing.
- Parameters:
None
- Returns:
Indicates if the array is monotonically non-decreasing
- Return type:
bool
- Raises:
TypeError – Raised if pda is not a pdarray instance
RuntimeError – Raised if there’s a server-side error thrown
- itemsize¶
- max(axis: int | Tuple[int, Ellipsis] | None = None, keepdims: bool = False) arkouda.numpy.dtypes.numpy_scalars | pdarray [source]¶
Return the maximum value of the array.
- property max_bits¶
- maxk(k: arkouda.numpy.dtypes.int_scalars) pdarray [source]¶
Compute the maximum “k” values.
- Parameters:
k (int_scalars) – The desired count of maximum values to be returned by the output.
- Returns:
The maximum k values from pda
- Return type:
pdarray, int
- Raises:
TypeError – Raised if pda is not a pdarray
- min(axis: int | Tuple[int, Ellipsis] | None = None, keepdims: bool = False) arkouda.numpy.dtypes.numpy_scalars | pdarray [source]¶
Return the minimum value of the array.
- mink(k: arkouda.numpy.dtypes.int_scalars) pdarray [source]¶
Compute the minimum “k” values.
- Parameters:
k (int_scalars) – The desired count of maximum values to be returned by the output.
- Returns:
The maximum k values from pda
- Return type:
pdarray, int
- Raises:
TypeError – Raised if pda is not a pdarray
- name¶
- property nbytes¶
The size of the pdarray in bytes.
- Returns:
The size of the pdarray in bytes.
- Return type:
int
- ndim¶
- objType = 'pdarray'¶
- popcount() pdarray [source]¶
Find the population (number of bits set) in each element. See ak.popcount.
- pretty_print_info() None [source]¶
Prints information about all components of self in a human readable format
- Parameters:
None
- Return type:
None
- prod(axis: int | Tuple[int, Ellipsis] | None = None, keepdims: bool = False) arkouda.numpy.dtypes.numpy_scalars | pdarray [source]¶
Return the product of all elements in the array. Return value is always a np.float64 or np.int64.
- register(user_defined_name: str) pdarray [source]¶
Register this pdarray with a user defined name in the arkouda server so it can be attached to later using pdarray.attach() This is an in-place operation, registering a pdarray more than once will update the name in the registry and remove the previously registered name. A name can only be registered to one pdarray at a time.
- Parameters:
user_defined_name (str) – user defined name array is to be registered under
- Returns:
The same pdarray which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different pdarrays with the same name.
- Return type:
- Raises:
TypeError – Raised if user_defined_name is not a str
RegistrationError – If the server was unable to register the pdarray with the user_defined_name If the user is attempting to register more than one pdarray with the same name, the former should be unregistered first to free up the registration name.
See also
attach
,unregister
,is_registered
,list_registry
,unregister_pdarray_by_name
Notes
Registered names/pdarrays in the server are immune to deletion until they are unregistered.
Examples
>>> a = zeros(100) >>> a.register("my_zeros") >>> # potentially disconnect from server and reconnect to server >>> b = ak.pdarray.attach("my_zeros") >>> # ...other work... >>> b.unregister()
- registered_name: str | None = None¶
- save(prefix_path: str, dataset: str = 'array', mode: str = 'truncate', compression: str | None = None, file_format: str = 'HDF5', file_type: str = 'distribute') str [source]¶
DEPRECATED Save the pdarray to HDF5 or Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. HDF5 support single files, in which case the file name will only be that provided. Each locale saves its chunk of the array to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, attempt to create new dataset in existing files.
- Parameters:
compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Sets the compression type used with Parquet files
file_format (str {'HDF5', 'Parquet'}) – By default, saved files will be written to the HDF5 file format. If ‘Parquet’, the files will be written to the Parquet file format. This is case insensitive.
file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.
- Return type:
string message indicating result of save operation
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
ValueError – Raised if there is an error in parsing the prefix path pointing to file write location or if the mode parameter is neither truncate nor append
TypeError – Raised if any one of the prefix_path, dataset, or mode parameters is not a string
See also
save_all
,load
,read
,to_parquet
,to_hdf
Notes
The prefix_path must be visible to the arkouda server and the user must have write permission. Output files have names of the form
<prefix_path>_LOCALE<i>
, where<i>
ranges from 0 tonumLocales
. If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, aRuntimeError
will result. Previously all files saved in Parquet format were saved with a.parquet
file extension. This will require you to use load as if you saved the file with the extension. Try this if an older file is not being found. Any file extension can be used.The file I/O does not rely on the extension to determine the file format.Examples
>>> a = ak.arange(25) >>> # Saving without an extension >>> a.save('path/prefix', dataset='array') Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####`` >>> # Saving with an extension (HDF5) >>> a.save('path/prefix.h5', dataset='array') Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####.h5`` where #### is replaced by each locale number >>> # Saving with an extension (Parquet) >>> a.save('path/prefix.parquet', dataset='array', file_format='Parquet') Saves the array in numLocales Parquet files with the name ``cwd/path/name_prefix_LOCALE####.parquet`` where #### is replaced by each locale number
- property shape¶
Return the shape of an array.
- Returns:
The elements of the shape tuple give the lengths of the corresponding array dimensions.
- Return type:
tuple of int
- size¶
- slice_bits(low, high) pdarray [source]¶
Returns a pdarray containing only bits from low to high of self.
This is zero indexed and inclusive on both ends, so slicing the bottom 64 bits is pda.slice_bits(0, 63)
- Parameters:
low (int) – The lowest bit included in the slice (inclusive) zero indexed, so the first bit is 0
high (int) – The highest bit included in the slice (inclusive)
- Returns:
A new pdarray containing the bits of self from low to high
- Return type:
- Raises:
RuntimeError – Raised if there is a server-side error thrown
Examples
>>> p = ak.array([2**65 + (2**64 - 1)]) >>> bin(p[0]) '0b101111111111111111111111111111111111111111111111111111111111111111'
>>> bin(p.slice_bits(64, 65)[0]) '0b10'
- std(ddof: arkouda.numpy.dtypes.int_scalars = 0) numpy.float64 [source]¶
Compute the standard deviation. See
arkouda.std
for details.- Parameters:
ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating std
- Returns:
The scalar standard deviation of the array
- Return type:
np.float64
- Raises:
TypeError – Raised if pda is not a pdarray instance
RuntimeError – Raised if there’s a server-side error thrown
- sum(axis: int | Tuple[int, Ellipsis] | None = None, keepdims: bool = False) arkouda.numpy.dtypes.numpy_scalars | pdarray [source]¶
Return the sum of all elements in the array.
- to_csv(prefix_path: str, dataset: str = 'array', col_delim: str = ',', overwrite: bool = False)[source]¶
Write pdarray to CSV file(s). File will contain a single column with the pdarray data. All CSV Files written by Arkouda include a header denoting data types of the columns.
- prefix_path: str
The filename prefix to be used for saving files. Files will have _LOCALE#### appended when they are written to disk.
- dataset: str
Column name to save the pdarray under. Defaults to “array”.
- col_delim: str
Defaults to “,”. Value to be used to separate columns within the file. Please be sure that the value used DOES NOT appear in your dataset.
- overwrite: bool
Defaults to False. If True, any existing files matching your provided prefix_path will be overwritten. If False, an error will be returned if existing files are found.
str reponse message
- ValueError
Raised if all datasets are not present in all parquet files or if one or more of the specified files do not exist
- RuntimeError
Raised if one or more of the specified files cannot be opened. If allow_errors is true this may be raised if no values are returned from the server.
- TypeError
Raised if we receive an unknown arkouda_type returned from the server
CSV format is not currently supported by load/load_all operations
The column delimiter is expected to be the same for column names and data
Be sure that column delimiters are not found within your data.
All CSV files must delimit rows using newline (`
`) at this time.
- to_cuda()[source]¶
Convert the array to a Numba DeviceND array, transferring array data from the arkouda server to Python via ndarray. If the array exceeds a builtin size limit, a RuntimeError is raised.
- Returns:
A Numba ndarray with the same attributes and data as the pdarray; on GPU
- Return type:
numba.DeviceNDArray
- Raises:
ImportError – Raised if CUDA is not available
ModuleNotFoundError – Raised if Numba is either not installed or not enabled
RuntimeError – Raised if there is a server-side error thrown in the course of retrieving the pdarray.
Notes
The number of bytes in the array cannot exceed
client.maxTransferBytes
, otherwise aRuntimeError
will be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting client.maxTransferBytes to a larger value, but proceed with caution.See also
array
Examples
>>> a = ak.arange(0, 5, 1) >>> a.to_cuda() array([0, 1, 2, 3, 4])
>>> type(a.to_cuda()) numpy.devicendarray
- to_hdf(prefix_path: str, dataset: str = 'array', mode: str = 'truncate', file_type: str = 'distribute') str [source]¶
Save the pdarray to HDF5. The object can be saved to a collection of files or single file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, attempt to create new dataset in existing files.
- Parameters:
file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.
- Return type:
string message indicating result of save operation
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
The prefix_path must be visible to the arkouda server and the user must
have write permission. - Output files have names of the form
<prefix_path>_LOCALE<i>
, where<i>
ranges from 0 tonumLocales
for file_type=’distribute’. Otherwise, the file name will be prefix_path. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, aRuntimeError
will result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.Examples
>>> a = ak.arange(25) >>> # Saving without an extension >>> a.to_hdf('path/prefix', dataset='array') Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####`` >>> # Saving with an extension (HDF5) >>> a.to_hdf('path/prefix.h5', dataset='array') Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####.h5`` where #### is replaced by each locale number >>> # Saving to a single file >>> a.to_hdf('path/prefix.hdf5', dataset='array', file_type='single') Saves the array in to single hdf5 file on the root node. ``cwd/path/name_prefix.hdf5``
- to_list() List [source]¶
Convert the array to a list, transferring array data from the Arkouda server to client-side Python. Note: if the pdarray size exceeds client.maxTransferBytes, a RuntimeError is raised.
- Returns:
A list with the same data as the pdarray
- Return type:
list
- Raises:
RuntimeError – Raised if there is a server-side error thrown, if the pdarray size exceeds the built-in client.maxTransferBytes size limit, or if the bytes received does not match expected number of bytes
Notes
The number of bytes in the array cannot exceed
client.maxTransferBytes
, otherwise aRuntimeError
will be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting client.maxTransferBytes to a larger value, but proceed with caution.See also
Examples
>>> a = ak.arange(0, 5, 1) >>> a.to_list() [0, 1, 2, 3, 4]
>>> type(a.to_list()) list
- to_ndarray() numpy.ndarray [source]¶
Convert the array to a np.ndarray, transferring array data from the Arkouda server to client-side Python. Note: if the pdarray size exceeds client.maxTransferBytes, a RuntimeError is raised.
- Returns:
A numpy ndarray with the same attributes and data as the pdarray
- Return type:
np.ndarray
- Raises:
RuntimeError – Raised if there is a server-side error thrown, if the pdarray size exceeds the built-in client.maxTransferBytes size limit, or if the bytes received does not match expected number of bytes
Notes
The number of bytes in the array cannot exceed
client.maxTransferBytes
, otherwise aRuntimeError
will be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting client.maxTransferBytes to a larger value, but proceed with caution.See also
array
,to_list
Examples
>>> a = ak.arange(0, 5, 1) >>> a.to_ndarray() array([0, 1, 2, 3, 4])
>>> type(a.to_ndarray()) numpy.ndarray
- to_parquet(prefix_path: str, dataset: str = 'array', mode: str = 'truncate', compression: str | None = None) str [source]¶
Save the pdarray to Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. Each locale saves its chunk of the array to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, attempt to create new dataset in existing files.
- Parameters:
compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Sets the compression type used with Parquet files
- Return type:
string message indicating result of save operation
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
The prefix_path must be visible to the arkouda server and the user must
have write permission. - Output files have names of the form
<prefix_path>_LOCALE<i>
, where<i>
ranges from 0 tonumLocales
for file_type=’distribute’. - ‘append’ write mode is supported, but is not efficient. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, aRuntimeError
will result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.Examples
>>> a = ak.arange(25) >>> # Saving without an extension >>> a.to_parquet('path/prefix', dataset='array') Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####`` >>> # Saving with an extension (HDF5) >>> a.to_parqet('path/prefix.parquet', dataset='array') Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####.parquet`` where #### is replaced by each locale number
- transfer(hostname: str, port: arkouda.numpy.dtypes.int_scalars)[source]¶
Sends a pdarray to a different Arkouda server
- Parameters:
hostname (str) – The hostname where the Arkouda server intended to receive the pdarray is running.
port (int_scalars) – The port to send the array over. This needs to be an open port (i.e., not one that the Arkouda server is running on). This will open up numLocales ports, each of which in succession, so will use ports of the range {port..(port+numLocales)} (e.g., running an Arkouda server of 4 nodes, port 1234 is passed as port, Arkouda will use ports 1234, 1235, 1236, and 1237 to send the array data). This port much match the port passed to the call to ak.receive_array().
- Return type:
A message indicating a complete transfer
- Raises:
ValueError – Raised if the op is not within the pdarray.BinOps set
TypeError – Raised if other is not a pdarray or the pdarray.dtype is not a supported dtype
- unregister() None [source]¶
Unregister a pdarray in the arkouda server which was previously registered using register() and/or attahced to using attach()
- Return type:
None
- Raises:
RuntimeError – Raised if the server could not find the internal name/symbol to remove
See also
register
,unregister
,is_registered
,unregister_pdarray_by_name
,list_registry
Notes
Registered names/pdarrays in the server are immune to deletion until they are unregistered.
Examples
>>> a = zeros(100) >>> a.register("my_zeros") >>> # potentially disconnect from server and reconnect to server >>> b = ak.pdarray.attach("my_zeros") >>> # ...other work... >>> b.unregister()
- update_hdf(prefix_path: str, dataset: str = 'array', repack: bool = True)[source]¶
Overwrite the dataset with the name provided with this pdarray. If the dataset does not exist it is added
- Parameters:
prefix_path (str) – Directory and filename prefix that all output files share
dataset (str) – Name of the dataset to create in files
repack (bool) – Default: True HDF5 does not release memory on delete. When True, the inaccessible data (that was overwritten) is removed. When False, the data remains, but is inaccessible. Setting to false will yield better performance, but will cause file sizes to expand.
- Return type:
str - success message if successful
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
If file does not contain File_Format attribute to indicate how it was saved, the file name is checked for _LOCALE#### to determine if it is distributed.
If the dataset provided does not exist, it will be added
- value_counts()[source]¶
Count the occurrences of the unique values of self.
- Returns:
unique_values (pdarray) – The unique values, sorted in ascending order
counts (pdarray, int64) – The number of times the corresponding unique value occurs
Examples
>>> ak.array([2, 0, 2, 4, 0, 0]).value_counts() (array([0, 2, 4]), array([3, 2, 1]))
- var(ddof: arkouda.numpy.dtypes.int_scalars = 0) numpy.float64 [source]¶
Compute the variance. See
arkouda.var
for details.- Parameters:
ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating var
- Returns:
The scalar variance of the array
- Return type:
np.float64
- Raises:
TypeError – Raised if pda is not a pdarray instance
ValueError – Raised if the ddof >= pdarray size
RuntimeError – Raised if there’s a server-side error thrown
- arkouda.util.register(obj, name)[source]¶
Register an arkouda object with a user-specified name. Backwards compatible with earlier arkouda versions.
- arkouda.util.register_all(data: dict)[source]¶
Register all objects in the provided dictionary
- Parameters:
data (dict) – Maps name to register the object to the object. For example, {“MyArray”: ak.array([0, 1, 2])
- Return type:
None
- arkouda.util.sparse_sum_help(idx1: arkouda.numpy.pdarrayclass.pdarray, idx2: arkouda.numpy.pdarrayclass.pdarray, val1: arkouda.numpy.pdarrayclass.pdarray, val2: arkouda.numpy.pdarrayclass.pdarray, merge: bool = True, percent_transfer_limit: int = 100) Tuple[arkouda.numpy.pdarrayclass.pdarray, arkouda.numpy.pdarrayclass.pdarray] [source]¶
Helper for summing two sparse matrices together
Return is equivalent to ak.GroupBy(ak.concatenate([idx1, idx2])).sum(ak.concatenate((val1, val2)))
- Parameters:
idx1 (pdarray) – indices for the first sparse matrix
idx2 (pdarray) – indices for the second sparse matrix
val1 (pdarray) – values for the first sparse matrix
val2 (pdarray) – values for the second sparse matrix
merge (bool, default=True) – If true the indices are combined using a merge based workflow, otherwise they are combine using a sort based workflow.
percent_transfer_limit (int, default=100) – Only used when merge is true. This is the maximum percentage of the data allowed to be moved between locales during the merge workflow. If we would exceed this percentage, we fall back to using the sort based workflow.
- Returns:
indices and values for the summed sparse matrix
- Return type:
Examples
>>> idx1 = ak.array([0, 1, 3, 4, 7, 9]) >>> idx2 = ak.array([0, 1, 3, 6, 9]) >>> vals1 = idx1 >>> vals2 = ak.array([10, 11, 13, 16, 19]) >>> ak.util.sparse_sum_help(idx1, idx2, vals1, vals2) (array([0 1 3 4 6 7 9]), array([10 12 16 4 16 7 28]))
>>> ak.GroupBy(ak.concatenate([idx1, idx2])).sum(ak.concatenate((vals1, vals2))) (array([0 1 3 4 6 7 9]), array([10 12 16 4 16 7 28]))
- arkouda.util.unique(pda: groupable, return_groups: bool = False, assume_sorted: bool = False, return_indices: bool = False) groupable | Tuple[groupable, pdarray, pdarray, int] [source]¶
Find the unique elements of an array.
Returns the unique elements of an array, sorted if the values are integers. There is an optional output in addition to the unique elements: the number of times each unique value comes up in the input array.
- Parameters:
pda ((list of) pdarray, Strings, or Categorical) – Input array.
return_groups (bool, optional) – If True, also return grouping information for the array.
assume_sorted (bool, optional) – If True, assume pda is sorted and skip sorting step
return_indices (bool, optional) – Only applicable if return_groups is True. If True, return unique key indices along with other groups
- Returns:
unique ((list of) pdarray, Strings, or Categorical) – The unique values. If input dtype is int64, return values will be sorted.
permutation (pdarray, optional) – Permutation that groups equivalent values together (only when return_groups=True)
segments (pdarray, optional) – The offset of each group in the permuted array (only when return_groups=True)
- Raises:
TypeError – Raised if pda is not a pdarray or Strings object
RuntimeError – Raised if the pdarray or Strings dtype is unsupported
Notes
For integer arrays, this function checks to see whether pda is sorted and, if so, whether it is already unique. This step can save considerable computation. Otherwise, this function will sort pda.
Examples
>>> A = ak.array([3, 2, 1, 1, 2, 3]) >>> ak.unique(A) array([1, 2, 3])