SegArrays in Arkouda¶
In NumPy, arrays containing variable-length sub-arrays are supported as an array containing a single column. Each column contains another ndarray
of some length. Depending on the chosen approach in NumPy, this can result in a loss of functioanlity. In Arkouda, the array containing variable-length sub-arrays is its own class: SegArray
In order to efficiently store arrays with varying row and column dimensions, Arkouda uses a “segmented array” data strucuture:
segments
: Anint64
array containing the start index of each sub-array within the flattened values arrayvalues
: The flattened values of all sub-arrays
Performance¶
SegArray
objects are currently processed entire on the Arkouda client side. The data structure is reflective of the data structure that will be used for Arkouda server side processing.
Iteration¶
Because SegArray
is currently processing entirely on the Arkouda client side, iteration is natively supported. Thus, for row in segarr
with iterate over each sub-array. Each of these sub-arrays is currently returned as a numpy.ndarray
.
Similar to Strings
, SegArrays
will be moved to process server side. This will remove the ability to natively iterate to discourage transferring all of the objects data to the client. In order to support this moving forward, SegArray
includes a to_ndarray()
function. It is recommended that this function be used for iteration over SegArray
objects, to prevent issues associated with moving processing server side. For more information on the usage of to_ndarray
with SegArray
- arkouda.SegArray.to_ndarray(self)¶
Convert the array into a numpy.ndarray containing sub-arrays
- Returns:
A numpy ndarray with the same sub-arrays (also numpy.ndarray) as this array
- Return type:
np.ndarray
See also
array
,to_list
Examples
>>> segarr = ak.SegArray(ak.array([0, 4, 7]), ak.arange(12)) >>> segarr.to_ndarray() array([array([1, 2, 3, 4]), array([5, 6, 7]), array([8, 9, 10, 11, 12])]) >>> type(segarr.to_ndarray()) numpy.ndarray
Operation¶
Arkouda SegArray
objects support the following operations:
Indexing with integer, slice, integer
pdarray
, and booleanpdarray
(see Indexing and Assignment)Comparison (==) Provides an Arkouda
pdarray
containingbool
values indicating the equality of each sub-array in theSegArray
.Array Set Operations, e.g.
unique
Concatenation with other
SegArrays
. Horizontal and vertical axis supported.
SegArray Specific Methods¶
Prefix & Suffix¶
- arkouda.SegArray.get_prefixes(self, n, return_origins=True, proper=True)¶
Return all sub-array prefixes of length n (for sub-arrays that are at least n+1 long)
- Parameters:
n (int) – Length of suffix
return_origins (bool) – If True, return a logical index indicating which sub-arrays were long enough to return an n-prefix
proper (bool) – If True, only return proper prefixes, i.e. from sub-arrays that are at least n+1 long. If False, allow the entire sub-array to be returned as a prefix.
- Returns:
prefixes (list of pdarray) – An n-long list of pdarrays, essentially a table where each row is an n-prefix. The number of rows is the number of True values in the returned mask.
origin_indices (pdarray, bool) – Boolean array that is True where the sub-array was long enough to return an n-suffix, False otherwise.
- arkouda.SegArray.get_suffixes(self, n, return_origins=True, proper=True)¶
Return the n-long suffix of each sub-array, where possible
- Parameters:
n (int) – Length of suffix
return_origins (bool) – If True, return a logical index indicating which sub-arrays were long enough to return an n-suffix
proper (bool) – If True, only return proper suffixes, i.e. from sub-arrays that are at least n+1 long. If False, allow the entire sub-array to be returned as a suffix.
- Returns:
suffixes (list of pdarray) – An n-long list of pdarrays, essentially a table where each row is an n-suffix. The number of rows is the number of True values in the returned mask.
origin_indices (pdarray, bool) – Boolean array that is True where the sub-array was long enough to return an n-suffix, False otherwise.
NGrams¶
- arkouda.SegArray.get_ngrams(self, n, return_origins=True)¶
Return all n-grams from all sub-arrays.
- Parameters:
n (int) – Length of n-gram
return_origins (bool) – If True, return an int64 array indicating which sub-array each returned n-gram came from.
- Returns:
ngrams (list of pdarray) – An n-long list of pdarrays, essentially a table where each row is an n-gram.
origin_indices (pdarray, int) – The index of the sub-array from which the corresponding n-gram originated
Sub-array of Size¶
- arkouda.SegArray.get_length_n(self, n, return_origins=True)¶
Return all sub-arrays of length n, as a list of columns.
- Parameters:
n (int) – Length of sub-arrays to select
return_origins (bool) – Return a logical index indicating which sub-arrays are length n
- Returns:
columns (list of pdarray) – An n-long list of pdarray, where each row is one of the n-long sub-arrays from the SegArray. The number of rows is the number of True values in the returned mask.
origin_indices (pdarray, bool) – Array of bool for each element of the SegArray, True where sub-array has length n.
Access/Set Specific Elements in Sub-Array¶
- arkouda.SegArray.get_jth(self, j, return_origins=True, compressed=False, default=0)¶
Select the j-th element of each sub-array, where possible.
- Parameters:
j (int) – The index of the value to get from each sub-array. If j is negative, it counts backwards from the end of each sub-array.
return_origins (bool) – If True, return a logical index indicating where j is in bounds
compressed (bool) – If False, return array is same size as self, with default value where j is out of bounds. If True, the return array only contains values where j is in bounds.
default (scalar) – When compressed=False, the value to return when j is out of bounds for the sub-array
- Returns:
val (pdarray) – compressed=False: The j-th value of each sub-array where j is in bounds and the default value where j is out of bounds. compressed=True: The j-th values of only the sub-arrays where j is in bounds
origin_indices (pdarray, bool) – A Boolean array that is True where j is in bounds for the sub-array.
Notes
If values are Strings, only the compressed format is supported.
- arkouda.SegArray.set_jth(self, i, j, v)¶
Set the j-th element of each sub-array in a subset.
- Parameters:
- Raises:
ValueError – If j is out of bounds in any of the sub-arrays specified by i.
Append & Prepend¶
- arkouda.SegArray.append(self, other, axis=0)¶
Append other to self, either vertically (axis=0, length of resulting SegArray increases), or horizontally (axis=1, each sub-array of other appends to the corresponding sub-array of self).
- Parameters:
other (SegArray) – Array of sub-arrays to append
axis (0 or 1) – Whether to append vertically (0) or horizontally (1). If axis=1, other must be same size as self.
- Returns:
axis=0: New SegArray containing all sub-arrays axis=1: New SegArray of same length, with pairs of sub-arrays concatenated
- Return type:
- arkouda.SegArray.append_single(self, x, prepend=False)¶
Append a single value to each sub-array.
- arkouda.SegArray.prepend_single(self, x)¶
Deduplication¶
- arkouda.SegArray.remove_repeats(self, return_multiplicity=False)¶
Condense sequences of repeated values within a sub-array to a single value.
- Parameters:
return_multiplicity (bool) – If True, also return the number of times each value was repeated.
- Returns:
norepeats (SegArray) – Sub-arrays with runs of repeated values replaced with single value
multiplicity (SegArray) – If return_multiplicity=True, this array contains the number of times each value in the returned SegArray was repeated in the original SegArray.
SegArray SetOps¶
Union¶
- arkouda.SegArray.union(self, other)¶
Computes the union of 2 SegArrays.
- Parameters:
other (SegArray) – SegArray to compute against
- Returns:
Segments are the 1d union of the segments of self and other
- Return type:
See also
pdarraysetops.union1d
Examples
>>> a = [1, 2, 3, 1, 4] >>> b = [3, 1, 4, 5] >>> c = [1, 3, 3, 5] >>> d = [2, 2, 4] >>> seg_a = ak.segarray(ak.array([0, len(a)]), ak.array(a+b)) >>> seg_b = ak.segarray(ak.array([0, len(c)]), ak.array(c+d)) >>> seg_a.union(seg_b) SegArray([ [1, 2, 3, 4, 5], [1, 2, 3, 4, 5] ])
Intersect¶
- arkouda.SegArray.intersect(self, other)¶
Computes the intersection of 2 SegArrays.
- Parameters:
other (SegArray) – SegArray to compute against
- Returns:
Segments are the 1d intersections of the segments of self and other
- Return type:
See also
pdarraysetops.intersect1d
Examples
>>> a = [1, 2, 3, 1, 4] >>> b = [3, 1, 4, 5] >>> c = [1, 3, 3, 5] >>> d = [2, 2, 4] >>> seg_a = ak.segarray(ak.array([0, len(a)]), ak.array(a+b)) >>> seg_b = ak.segarray(ak.array([0, len(c)]), ak.array(c+d)) >>> seg_a.intersect(seg_b) SegArray([ [1, 3], [4] ])
Set Difference¶
- arkouda.SegArray.setdiff(self, other)¶
Computes the set difference of 2 SegArrays.
- Parameters:
other (SegArray) – SegArray to compute against
- Returns:
Segments are the 1d set difference of the segments of self and other
- Return type:
See also
pdarraysetops.setdiff1d
Examples
>>> a = [1, 2, 3, 1, 4] >>> b = [3, 1, 4, 5] >>> c = [1, 3, 3, 5] >>> d = [2, 2, 4] >>> seg_a = ak.segarray(ak.array([0, len(a)]), ak.array(a+b)) >>> seg_b = ak.segarray(ak.array([0, len(c)]), ak.array(c+d)) >>> seg_a.setdiff(seg_b) SegArray([ [2, 4], [1, 3, 5] ])
Symmetric Difference¶
- arkouda.SegArray.setxor(self, other)¶
Computes the symmetric difference of 2 SegArrays.
- Parameters:
other (SegArray) – SegArray to compute against
- Returns:
Segments are the 1d symmetric difference of the segments of self and other
- Return type:
See also
pdarraysetops.setxor1d
Examples
>>> a = [1, 2, 3, 1, 4] >>> b = [3, 1, 4, 5] >>> c = [1, 3, 3, 5] >>> d = [2, 2, 4] >>> seg_a = ak.segarray(ak.array([0, len(a)]), ak.array(a+b)) >>> seg_b = ak.segarray(ak.array([0, len(c)]), ak.array(c+d)) >>> seg_a.setxor(seg_b) SegArray([ [2, 4, 5], [1, 3, 5, 2] ])