arkouda.segarray

Attributes

Classes

Functions

segarray(segments, values[, lengths, grouping])

Alias for the from_parts function. Prevents user from needing to call ak.SegArray constructor

Module Contents

arkouda.segarray.LEN_SUFFIX = '_lengths'
arkouda.segarray.SEG_SUFFIX = '_segments'
class arkouda.segarray.SegArray(segments, values, lengths=None, grouping=None)[source]
AND(x=None)[source]
OR(x=None)[source]
XOR(x=None)[source]
aggregate(op, x=None)[source]
all(x=None)[source]
any(x=None)[source]
append(other, axis=0)[source]

Append other to self, either vertically (axis=0, length of resulting SegArray increases), or horizontally (axis=1, each sub-array of other appends to the corresponding sub-array of self).

Parameters:
  • other (SegArray) – Array of sub-arrays to append

  • axis (0 or 1) – Whether to append vertically (0) or horizontally (1). If axis=1, other must be same size as self.

Returns:

axis=0: New SegArray containing all sub-arrays axis=1: New SegArray of same length, with pairs of sub-arrays concatenated

Return type:

SegArray

append_single(x, prepend=False)[source]

Append a single value to each sub-array.

Parameters:

x (pdarray or scalar) – Single value to append to each sub-array

Returns:

Copy of original SegArray with values from x appended to each sub-array

Return type:

SegArray

argmax(x=None)[source]
argmin(x=None)[source]
classmethod attach(user_defined_name)[source]

Using the defined name, attach to a SegArray that has been registered to the Symbol Table

Parameters:

user_defined_name (str) – user defined name which the SegArray object was registered under

Returns:

The resulting SegArray

Return type:

SegArray

Raises:

RuntimeError – Raised if the server could not attach to the SegArray object

classmethod concat(x, axis=0, ordered=True)[source]

Concatenate a sequence of SegArrays

Parameters:
  • x (sequence of SegArray) – The SegArrays to concatenate

  • axis (0 or 1) – Select vertical (0) or horizontal (1) concatenation. If axis=1, all SegArrays must have same size.

  • ordered (bool) – Must be True. This option is present for compatibility only, because unordered concatenation is not yet supported.

Returns:

The input arrays joined into one SegArray

Return type:

SegArray

copy()[source]

Return a deep copy.

dtype
filter(filter, discard_empty: bool = False)[source]

Filter values out of the SegArray object

Parameters:
  • filter (pdarray, list, or value) – The value/s to be filtered out of the SegArray

  • discard_empty (bool) – Defaults to False. When True, empty segments are removed from the return SegArray

Return type:

SegArray

classmethod from_multi_array(m)[source]

Construct a SegArray from a list of columns. This essentially transposes the input, resulting in an array of rows.

Parameters:

m (list of pdarray or Strings) – List of columns, the rows of which will form the sub-arrays of the output

Returns:

Array of rows of input

Return type:

SegArray

classmethod from_parts(segments, values, lengths=None, grouping=None) SegArray[source]

DEPRECATED Construct a SegArray object from its parts

Parameters:
  • segments (pdarray, int64) – Start index of each sub-array in the flattened values array

  • values (pdarray) – The flattened values of all sub-arrays

  • lengths (pdarray) – The length of each segment

  • grouping (GroupBy) – grouping of segments

Returns:

Data structure representing an array whose elements are variable-length arrays.

Return type:

SegArray

Notes

Keyword args ‘lengths’ and ‘grouping’ are not user-facing. They are used by the attach method.

classmethod from_return_msg(rep_msg) SegArray[source]
get_jth(j, return_origins=True, compressed=False, default=0)[source]

Select the j-th element of each sub-array, where possible.

Parameters:
  • j (int) – The index of the value to get from each sub-array. If j is negative, it counts backwards from the end of each sub-array.

  • return_origins (bool) – If True, return a logical index indicating where j is in bounds

  • compressed (bool) – If False, return array is same size as self, with default value where j is out of bounds. If True, the return array only contains values where j is in bounds.

  • default (scalar) – When compressed=False, the value to return when j is out of bounds for the sub-array

Returns:

  • val (pdarray) – compressed=False: The j-th value of each sub-array where j is in bounds and the default value where j is out of bounds. compressed=True: The j-th values of only the sub-arrays where j is in bounds

  • origin_indices (pdarray, bool) – A Boolean array that is True where j is in bounds for the sub-array.

Notes

If values are Strings, only the compressed format is supported.

get_length_n(n, return_origins=True)[source]

Return all sub-arrays of length n, as a list of columns.

Parameters:
  • n (int) – Length of sub-arrays to select

  • return_origins (bool) – Return a logical index indicating which sub-arrays are length n

Returns:

  • columns (list of pdarray) – An n-long list of pdarray, where each row is one of the n-long sub-arrays from the SegArray. The number of rows is the number of True values in the returned mask.

  • origin_indices (pdarray, bool) – Array of bool for each element of the SegArray, True where sub-array has length n.

get_ngrams(n, return_origins=True)[source]

Return all n-grams from all sub-arrays.

Parameters:
  • n (int) – Length of n-gram

  • return_origins (bool) – If True, return an int64 array indicating which sub-array each returned n-gram came from.

Returns:

  • ngrams (list of pdarray) – An n-long list of pdarrays, essentially a table where each row is an n-gram.

  • origin_indices (pdarray, int) – The index of the sub-array from which the corresponding n-gram originated

get_prefixes(n, return_origins=True, proper=True)[source]

Return all sub-array prefixes of length n (for sub-arrays that are at least n+1 long)

Parameters:
  • n (int) – Length of suffix

  • return_origins (bool) – If True, return a logical index indicating which sub-arrays were long enough to return an n-prefix

  • proper (bool) – If True, only return proper prefixes, i.e. from sub-arrays that are at least n+1 long. If False, allow the entire sub-array to be returned as a prefix.

Returns:

  • prefixes (list of pdarray) – An n-long list of pdarrays, essentially a table where each row is an n-prefix. The number of rows is the number of True values in the returned mask.

  • origin_indices (pdarray, bool) – Boolean array that is True where the sub-array was long enough to return an n-suffix, False otherwise.

get_suffixes(n, return_origins=True, proper=True)[source]

Return the n-long suffix of each sub-array, where possible

Parameters:
  • n (int) – Length of suffix

  • return_origins (bool) – If True, return a logical index indicating which sub-arrays were long enough to return an n-suffix

  • proper (bool) – If True, only return proper suffixes, i.e. from sub-arrays that are at least n+1 long. If False, allow the entire sub-array to be returned as a suffix.

Returns:

  • suffixes (list of pdarray) – An n-long list of pdarrays, essentially a table where each row is an n-suffix. The number of rows is the number of True values in the returned mask.

  • origin_indices (pdarray, bool) – Boolean array that is True where the sub-array was long enough to return an n-suffix, False otherwise.

property grouping
hash() Tuple[arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray][source]

Compute a 128-bit hash of each segment.

Returns:

A tuple of two int64 pdarrays. The ith hash value is the concatenation of the ith values from each array.

Return type:

Tuple[pdarray,pdarray]

intersect(other)[source]

Computes the intersection of 2 SegArrays.

Parameters:

other (SegArray) – SegArray to compute against

Returns:

Segments are the 1d intersections of the segments of self and other

Return type:

SegArray

See also

pdarraysetops.intersect1d

Examples

>>> a = [1, 2, 3, 1, 4]
>>> b = [3, 1, 4, 5]
>>> c = [1, 3, 3, 5]
>>> d = [2, 2, 4]
>>> seg_a = ak.segarray(ak.array([0, len(a)]), ak.array(a+b))
>>> seg_b = ak.segarray(ak.array([0, len(c)]), ak.array(c+d))
>>> seg_a.intersect(seg_b)
SegArray([
[1, 3],
[4]
])
is_registered() bool[source]

Checks if the name of the SegArray object is registered in the Symbol Table

Returns:

True if SegArray is registered, false if not

Return type:

bool

classmethod load(prefix_path, dataset='segarray', segment_name='segments', value_name='values')[source]
logger
max(x=None)[source]
mean(x=None)[source]
min(x=None)[source]
property nbytes

The size of the segarray in bytes.

Returns:

The size of the segarray in bytes.

Return type:

int

property non_empty
nunique(x=None)[source]
objType = 'SegArray'
prepend_single(x)[source]
prod(x=None)[source]
classmethod read_hdf(prefix_path, dataset='segarray')[source]

Load a saved SegArray from HDF5. All arguments must match what was supplied to SegArray.save()

Parameters:
  • prefix_path (str) – Directory and filename prefix

  • dataset (str) – Name prefix for saved data within the HDF5 files

Return type:

SegArray

register(user_defined_name)[source]

Register this SegArray object and underlying components with the Arkouda server

Parameters:

user_defined_name (str) – user defined name which this SegArray object will be registered under

Returns:

The same SegArray which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different SegArrays with the same name.

Return type:

SegArray

Raises:

RegistrationError – Raised if the server could not register the SegArray object

Notes

Objects registered with the server are immune to deletion until they are unregistered.

registered_name: str | None = None
remove_repeats(return_multiplicity=False)[source]

Condense sequences of repeated values within a sub-array to a single value.

Parameters:

return_multiplicity (bool) – If True, also return the number of times each value was repeated.

Returns:

  • norepeats (SegArray) – Sub-arrays with runs of repeated values replaced with single value

  • multiplicity (SegArray) – If return_multiplicity=True, this array contains the number of times each value in the returned SegArray was repeated in the original SegArray.

save(prefix_path, dataset='segarray', mode='truncate', file_type='distribute')[source]

DEPRECATED Save the SegArray to HDF5. The object can be saved to a collection of files or single file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.

If ‘append’, attempt to create new dataset in existing files.

Parameters:

file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.

Return type:

string message indicating result of save operation

Raises:

RuntimeError – Raised if a server-side error is thrown saving the pdarray

Notes

  • The prefix_path must be visible to the arkouda server and the user must

have write permission. - Output files have names of the form <prefix_path>_LOCALE<i>, where <i> ranges from 0 to numLocales for file_type=’distribute’. Otherwise, the file name will be prefix_path. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, a RuntimeError will result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.

See also

to_hdf, load

segments
set_jth(i, j, v)[source]

Set the j-th element of each sub-array in a subset.

Parameters:
  • i (pdarray, int) – Indices of sub-arrays to set j-th element

  • j (int) – Index of value to set in each sub-array. If j is negative, it counts backwards from the end of the sub-array.

  • v (pdarray or scalar) – The value(s) to set. If v is a pdarray, it must have same length as i.

Raises:

ValueError – If j is out of bounds in any of the sub-arrays specified by i.

setdiff(other)[source]

Computes the set difference of 2 SegArrays.

Parameters:

other (SegArray) – SegArray to compute against

Returns:

Segments are the 1d set difference of the segments of self and other

Return type:

SegArray

See also

pdarraysetops.setdiff1d

Examples

>>> a = [1, 2, 3, 1, 4]
>>> b = [3, 1, 4, 5]
>>> c = [1, 3, 3, 5]
>>> d = [2, 2, 4]
>>> seg_a = ak.segarray(ak.array([0, len(a)]), ak.array(a+b))
>>> seg_b = ak.segarray(ak.array([0, len(c)]), ak.array(c+d))
>>> seg_a.setdiff(seg_b)
SegArray([
[2, 4],
[1, 3, 5]
])
setxor(other)[source]

Computes the symmetric difference of 2 SegArrays.

Parameters:

other (SegArray) – SegArray to compute against

Returns:

Segments are the 1d symmetric difference of the segments of self and other

Return type:

SegArray

See also

pdarraysetops.setxor1d

Examples

>>> a = [1, 2, 3, 1, 4]
>>> b = [3, 1, 4, 5]
>>> c = [1, 3, 3, 5]
>>> d = [2, 2, 4]
>>> seg_a = ak.segarray(ak.array([0, len(a)]), ak.array(a+b))
>>> seg_b = ak.segarray(ak.array([0, len(c)]), ak.array(c+d))
>>> seg_a.setxor(seg_b)
SegArray([
[2, 4, 5],
[1, 3, 5, 2]
])
size
sum(x=None)[source]
to_hdf(prefix_path, dataset='segarray', mode='truncate', file_type='distribute')[source]

Save the SegArray to HDF5. The result is a collection of HDF5 files, one file per locale of the arkouda server, where each filename starts with prefix_path.

Parameters:
  • prefix_path (str) – Directory and filename prefix that all output files will share

  • dataset (str) – Name prefix for saved data within the HDF5 file

  • mode (str {'truncate' | 'append'}) – By default, truncate (overwrite) output files, if they exist. If ‘append’, add data as a new column to existing files.

  • file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.

Return type:

None

See also

load

to_list()[source]

Convert the segarray into a list containing sub-arrays

Returns:

A list with the same sub-arrays (also list) as this segarray

Return type:

list

See also

to_ndarray

Examples

>>> segarr = ak.SegArray(ak.array([0, 4, 7]), ak.arange(12))
>>> segarr.to_list()
[[0, 1, 2, 3], [4, 5, 6], [7, 8, 9, 10, 11]]
>>> type(segarr.to_list())
list
to_ndarray()[source]

Convert the array into a numpy.ndarray containing sub-arrays

Returns:

A numpy ndarray with the same sub-arrays (also numpy.ndarray) as this array

Return type:

np.ndarray

See also

array, to_list

Examples

>>> segarr = ak.SegArray(ak.array([0, 4, 7]), ak.arange(12))
>>> segarr.to_ndarray()
array([array([1, 2, 3, 4]), array([5, 6, 7]), array([8, 9, 10, 11, 12])])
>>> type(segarr.to_ndarray())
numpy.ndarray
to_parquet(prefix_path, dataset='segarray', mode: str = 'truncate', compression: str | None = None)[source]

Save the SegArray object to Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. Each locale saves its chunk of the object to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: Deprecated.

Parameter kept to maintain functionality of other calls. Only Truncate supported. By default, truncate (overwrite) output files, if they exist. If ‘append’, attempt to create new dataset in existing files.

Parameters:

compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Sets the compression type used with Parquet files

Return type:

string message indicating result of save operation

Raises:
  • RuntimeError – Raised if a server-side error is thrown saving the pdarray

  • ValueError – If write mode is not Truncate.

Notes

  • Append mode for Parquet has been deprecated. It was not implemented for SegArray.

  • The prefix_path must be visible to the arkouda server and the user must

have write permission. - Output files have names of the form <prefix_path>_LOCALE<i>, where <i> ranges from 0 to numLocales for file_type=’distribute’. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, a RuntimeError will result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.

transfer(hostname: str, port: arkouda.numpy.dtypes.int_scalars)[source]

Sends a Segmented Array to a different Arkouda server

Parameters:
  • hostname (str) – The hostname where the Arkouda server intended to receive the Segmented Array is running.

  • port (int_scalars) – The port to send the array over. This needs to be an open port (i.e., not one that the Arkouda server is running on). This will open up numLocales ports, each of which in succession, so will use ports of the range {port..(port+numLocales)} (e.g., running an Arkouda server of 4 nodes, port 1234 is passed as port, Arkouda will use ports 1234, 1235, 1236, and 1237 to send the array data). This port much match the port passed to the call to ak.receive_array().

Return type:

A message indicating a complete transfer

Raises:
  • ValueError – Raised if the op is not within the pdarray.BinOps set

  • TypeError – Raised if other is not a pdarray or the pdarray.dtype is not a supported dtype

union(other)[source]

Computes the union of 2 SegArrays.

Parameters:

other (SegArray) – SegArray to compute against

Returns:

Segments are the 1d union of the segments of self and other

Return type:

SegArray

See also

pdarraysetops.union1d

Examples

>>> a = [1, 2, 3, 1, 4]
>>> b = [3, 1, 4, 5]
>>> c = [1, 3, 3, 5]
>>> d = [2, 2, 4]
>>> seg_a = ak.segarray(ak.array([0, len(a)]), ak.array(a+b))
>>> seg_b = ak.segarray(ak.array([0, len(c)]), ak.array(c+d))
>>> seg_a.union(seg_b)
SegArray([
[1, 2, 3, 4, 5],
[1, 2, 3, 4, 5]
])
unique(x=None)[source]

Return sub-arrays of unique values.

Parameters:

x (pdarray) – The values to unique, per group. By default, the values of this SegArray’s sub-arrays.

Returns:

Same number of sub-arrays as original SegArray, but elements in sub-array are unique and in sorted order.

Return type:

SegArray

unregister()[source]

Unregister this SegArray object in the arkouda server which was previously registered using register() and/or attached to using attach()

Return type:

None

Raises:

RuntimeError – Raised if the server could not unregister the SegArray object from the Symbol Table

Notes

Objects registered with the server are immune to deletion until they are unregistered.

static unregister_segarray_by_name(user_defined_name)[source]

Using the defined name, remove the registered SegArray object from the Symbol Table

Parameters:

user_defined_name (str) – user defined name which the SegArray object was registered under

Return type:

None

Raises:

RuntimeError – Raised if the server could not unregister the SegArray object from the Symbol Table

update_hdf(prefix_path: str, dataset: str = 'segarray', repack: bool = True)[source]

Overwrite the dataset with the name provided with this SegArray object. If the dataset does not exist it is added.

Parameters:
  • prefix_path (str) – Directory and filename prefix that all output files share

  • dataset (str) – Name of the dataset to create in files

  • repack (bool) – Default: True HDF5 does not release memory on delete. When True, the inaccessible data (that was overwritten) is removed. When False, the data remains, but is inaccessible. Setting to false will yield better performance, but will cause file sizes to expand.

Return type:

None

Raises:

RuntimeError – Raised if a server-side error is thrown saving the SegArray

Notes

  • If file does not contain File_Format attribute to indicate how it was saved, the file name is checked for _LOCALE#### to determine if it is distributed.

  • If the dataset provided does not exist, it will be added

  • Because HDF5 deletes do not release memory, this will create a copy of the file with the new data

valsize
values
arkouda.segarray.VAL_SUFFIX = '_values'
arkouda.segarray.segarray(segments: arkouda.pdarrayclass.pdarray, values: arkouda.pdarrayclass.pdarray, lengths=None, grouping=None)[source]

Alias for the from_parts function. Prevents user from needing to call ak.SegArray constructor DEPRECATED