GroupBy

The groupby-aggregate pattern is the workhorse operation in many data science applications, such as feature extraction and graph construction. It relies on argsort() to group an array of keys and then perform aggregations on other arrays of values.

For example, imagine a dataset with two columns, userID and dayOfWeek. The following groupby-aggregate operation would show how many user IDs were active on each day of the week:

# Note: The GroupBy arg should be the values of the dayOfWeek column
#       and must be an Arkouda compatible data structure i.e. `pdarray`
byDayOfWeek = ak.GroupBy(data['dayOfWeek'])
day, numIDs = byDayOfWeek.aggregate(userID, 'nunique')
class arkouda.GroupBy(keys=None, assume_sorted=False, dropna=True, **kwargs)[source]

Group an array or list of arrays by value.

Usually in preparation for aggregating the within-group values of another array.

Parameters:
  • keys ((list of) pdarray, Strings, or Categorical) – The array to group by value, or if list, the column arrays to group by row

  • assume_sorted (bool) – If True, assume keys is already sorted (Default: False)

nkeys

The number of key arrays (columns)

Type:

int

permutation

The permutation that sorts the keys array(s) by value (row)

Type:

pdarray

unique_keys

The unique values of the keys array(s), in grouped order

Type:

pdarray, Strings, or Categorical

ngroups

The length of the unique_keys array(s), i.e. number of groups

Type:

int_scalars

segments

The start index of each group in the grouped array(s)

Type:

pdarray

logger

Used for all logging operations

Type:

ArkoudaLogger

dropna

If True, and the groupby keys contain NaN values, the NaN values together with the corresponding row will be dropped. Otherwise, the rows corresponding to NaN values will be kept. The default is True

Type:

bool (default=True)

Raises:

TypeError – Raised if keys is a pdarray with a dtype other than int64

Notes

Integral pdarrays, Strings, and Categoricals are natively supported, but float64 and bool arrays are not.

For a user-defined class to be groupable, it must inherit from pdarray and define or overload the grouping API:

  1. a ._get_grouping_keys() method that returns a list of pdarrays that can be (co)argsorted.

  2. (Optional) a .group() method that returns the permutation that groups the array

If the input is a single array with a .group() method defined, method 2 will be used; otherwise, method 1 will be used.

AND(values)[source]

Bitwise AND of values in each segment.

Group another array of values and perform a bitwise AND reduction on each group.

Group using the permutation stored in the GroupBy instance.

Parameters:

values (pdarray, int64) – The values to group and reduce with AND

Returns:

unique_keys(list of) pdarray or Strings

The unique keys, in grouped order

resultpdarray, int64

Bitwise AND of values in segments corresponding to keys

Return type:

Tuple[Union[pdarray, List[Union[pdarray, Strings]]], pdarray]

Raises:
  • TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not int64

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if all is not supported for the values dtype

OR(values)[source]

Bitwise OR of values in each segment.

Group another array of values and perform a bitwise OR reduction on each group.

Group using the permutation stored in the GroupBy instance.

Parameters:

values (pdarray, int64) – The values to group and reduce with OR

Returns:

unique_keys(list of) pdarray or Strings

The unique keys, in grouped order

resultpdarray, int64

Bitwise OR of values in segments corresponding to keys

Return type:

Tuple[Union[pdarray, List[Union[pdarray, Strings]]], pdarray]

Raises:
  • TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not int64

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if all is not supported for the values dtype

XOR(values)[source]

Bitwise XOR of values in each segment.

Group another array of values and perform a bitwise XOR reduction on each group.

Group using the permutation stored in the GroupBy instance.

Parameters:

values (pdarray, int64) – The values to group and reduce with XOR

Returns:

unique_keys(list of) pdarray or Strings

The unique keys, in grouped order

resultpdarray, int64

Bitwise XOR of values in segments corresponding to keys

Return type:

Tuple[Union[pdarray, List[Union[pdarray, Strings]]], pdarray]

Raises:
  • TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not int64

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if all is not supported for the values dtype

aggregate(values, operator, skipna=True, ddof=1)[source]

Group another array of values and apply a reduction to each group’s values.

Group using the permutation stored in the GroupBy instance.

Parameters:
  • values (pdarray) – The values to group and reduce

  • operator (str) – The name of the reduction operator to use

  • skipna (bool) – boolean which determines if NANs should be skipped

  • ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating std

Returns:

unique_keysgroupable

The unique keys, in grouped order

aggregatesgroupable

One aggregate value per unique key in the GroupBy instance

Return type:

Tuple[groupable, groupable]

Raises:
  • TypeError – Raised if the values array is not a pdarray

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if the requested operator is not supported for the values dtype

Examples

>>> import arkouda as ak
>>> keys = ak.arange(0, 10)
>>> vals = ak.linspace(-1, 1, 10)
>>> g = ak.GroupBy(keys)
>>> g.aggregate(vals, 'sum')
(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), array([-1, -0.77777777777777768,
-0.55555555555555536, -0.33333333333333348, -0.11111111111111116,
0.11111111111111116, 0.33333333333333348, 0.55555555555555536, 0.77777777777777768,
1]))
>>> g.aggregate(vals, 'min')
(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), array([-1, -0.77777777777777779,
-0.55555555555555558, -0.33333333333333337, -0.11111111111111116, 0.11111111111111116,
0.33333333333333326, 0.55555555555555536, 0.77777777777777768, 1]))
all(values)[source]

Group another array of values and perform an “and” reduction on each group.

Group using the permutation stored in the GroupBy instance.

Parameters:

values (pdarray, bool) – The values to group and reduce with “and”

Returns:

unique_keys(list of) pdarray or Strings

The unique keys, in grouped order

group_anypdarray, bool

One bool per unique key in the GroupBy instance

Return type:

Tuple[Union[pdarray, List[Union[pdarray, Strings]]], pdarray]

Raises:
  • TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not bool

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if all is not supported for the values dtype

any(values)[source]

Group another array of values and perform an “or” reduction on each group.

Group using the permutation stored in the GroupBy instance.

Parameters:

values (pdarray, bool) – The values to group and reduce with “or”

Returns:

unique_keys(list of) pdarray or Strings

The unique keys, in grouped order

group_anypdarray, bool

One bool per unique key in the GroupBy instance

Return type:

Tuple[Union[pdarray, List[Union[pdarray, Strings]]], pdarray]

Raises:
  • TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not bool

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

argmax(values)[source]

Group another array of values and return the location of the first maximum of each group.

Group using the permutation stored in the GroupBy instance.

Parameters:

values (pdarray) – The values to group and find argmax

Returns:

unique_keys(list of) pdarray or Strings

The unique keys, in grouped order

group_argmaximapdarray, int64

One index per unique key in the GroupBy instance

Return type:

Tuple[groupable, pdarray]

Raises:
  • TypeError – Raised if the values array is not a pdarray object or if argmax is not supported for the values dtype

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

Notes

The returned indices refer to the original values array as passed in, not the permutation applied by the GroupBy instance.

Examples

>>> import arkouda as ak
>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.argmax(b)
(array([2, 3, 4]), array([9, 3, 2]))
argmin(values)[source]

Group another array of values and return the location of the first minimum of each group.

Group using the permutation stored in the GroupBy instance.

Parameters:

values (pdarray) – The values to group and find argmin

Returns:

unique_keys(list of) pdarray or Strings

The unique keys, in grouped order

group_argminimapdarray, int64

One index per unique key in the GroupBy instance

Return type:

Tuple[groupable, pdarray]

Raises:
  • TypeError – Raised if the values array is not a pdarray object or if argmax is not supported for the values dtype

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if argmin is not supported for the values dtype

Notes

The returned indices refer to the original values array as passed in, not the permutation applied by the GroupBy instance.

Examples

>>> import arkouda as ak
>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.argmin(b)
(array([2, 3, 4]), array([5, 4, 2]))
broadcast(values, permute=True)[source]

Fill each group’s segment with a constant value.

Parameters:
  • values (pdarray, Strings) – The values to put in each group’s segment

  • permute (bool) – If True (default), permute broadcast values back to the ordering of the original array on which GroupBy was called. If False, the broadcast values are grouped by value.

Returns:

The broadcasted values

Return type:

pdarray, Strings

Raises:
  • TypeError – Raised if value is not a pdarray object

  • ValueError – Raised if the values array does not have one value per segment

Notes

This function is a sparse analog of np.broadcast. If a GroupBy object represents a sparse matrix (tensor), then this function takes a (dense) column vector and replicates each value to the non-zero elements in the corresponding row.

Examples

>>> import arkouda as ak
>>> a = ak.array([0, 1, 0, 1, 0])
>>> values = ak.array([3, 5])
>>> g = ak.GroupBy(a)
# By default, result is in original order
>>> g.broadcast(values)
array([3, 5, 3, 5, 3])
# With permute=False, result is in grouped order
>>> g.broadcast(values, permute=False)
array([3, 3, 3, 5, 5]
>>> a = ak.randint(1,5,10)
>>> a
array([3, 1, 4, 4, 4, 1, 3, 3, 2, 2])
>>> g = ak.GroupBy(a)
>>> keys,counts = g.size()
>>> g.broadcast(counts > 2)
array([True False True True True False True True False False])
>>> g.broadcast(counts == 3)
array([True False True True True False True True False False])
>>> g.broadcast(counts < 4)
array([True True True True True True True True True True])
static build_from_components(user_defined_name=None, **kwargs)[source]

Build a new GroupBy object from component keys and permutation.

Parameters:
  • user_defined_name (str (Optional) Passing a name will init the new GroupBy) – and assign it the given name

  • kwargs (dict Dictionary of components required for rebuilding the GroupBy.) – Expected keys are “orig_keys”, “permutation”, “unique_keys”, and “segments”

Returns:

The GroupBy object created by using the given components

Return type:

GroupBy

count(values)[source]

Count the number of elements in each group.

NaN values will be excluded from the total.

Parameters:

values (pdarray) – The values to be count by group (excluding NaN values).

Returns:

unique_keys(list of) pdarray or Strings

The unique keys, in grouped order

countspdarray, int64

The number of times each unique key appears (excluding NaN values).

Return type:

List[pdarray|Strings], pdarray|int64

Examples

>>> import arkouda as ak
>>> a = ak.array([1, 0, -1, 1, 0, -1])
>>> a
array([1 0 -1 1 0 -1])
>>> b = ak.array([1, np.nan, -1, np.nan, np.nan, -1], dtype = "float64")
>>> b
array([1.00000000000000000 nan -1.00000000000000000 nan nan -1.00000000000000000])
>>> g = ak.GroupBy(a)
>>> keys,counts = g.count(b)
>>> keys
array([-1 0 1])
>>> counts
array([2 0 1])
first(values)[source]

First value in each group.

Parameters:

values (pdarray-like) – The values from which to take the first of each group

Returns:

unique_keys(list of) pdarray-like

The unique keys, in grouped order

resultpdarray-like

The first value of each group

Return type:

Tuple[groupable, groupable_element_type]

head(values, n=5, return_indices=True)[source]

Return the first n values from each group.

Parameters:
  • values ((list of) pdarray-like) – The values from which to select, according to their group membership.

  • n (int, optional, default = 5) – Maximum number of items to return for each group. If the number of values in a group is less than n, all the values from that group will be returned.

  • return_indices (bool, default False) – If True, return the indices of the sampled values. Otherwise, return the selected values.

Returns:

unique_keys(list of) pdarray-like

The unique keys, in grouped order

resultpdarray-like

The first n items of each group. If return_indices is True, the result are indices. O.W. the result are values.

Return type:

Tuple[groupable, groupable_element_type]

Examples

>>> import arkouda as ak
>>> a = ak.arange(10) %3
>>> a
array([0 1 2 0 1 2 0 1 2 0])
>>> v = ak.arange(10)
>>> v
array([0 1 2 3 4 5 6 7 8 9])
>>> g = GroupBy(a)
>>> unique_keys, idx = g.head(v, 2, return_indices=True)
>>> _, values = g.head(v, 2, return_indices=False)
>>> unique_keys
array([0 1 2])
>>> idx
array([0 3 1 4 2 5])
>>> values
array([0 3 1 4 2 5])
>>> v2 =  -2 * ak.arange(10)
>>> v2
array([0 -2 -4 -6 -8 -10 -12 -14 -16 -18])
>>> _, idx2 = g.head(v2, 2, return_indices=True)
>>> _, values2 = g.head(v2, 2, return_indices=False)
>>> idx2
array([0 3 1 4 2 5])
>>> values2
array([0 -6 -2 -8 -4 -10])
is_registered()[source]

Return True if the object is contained in the registry.

Returns:

Indicates if the object is contained in the registry

Return type:

bool

Raises:

RegistrationError – Raised if there’s a server-side error or a mismatch of registered components

See also

register, attach, unregister, unregister_groupby_by_name

Notes

Objects registered with the server are immune to deletion until they are unregistered.

max(values, skipna=True)[source]

Group another array of values and return the maximum of each group’s values.

Group using the permutation stored in the GroupBy instance.

Parameters:
  • values (pdarray) – The values to group and find maxima

  • skipna (bool) – boolean which determines if NANs should be skipped

Returns:

unique_keys(list of) pdarray or Strings

The unique keys, in grouped order

group_maximapdarray

One maximum per unique key in the GroupBy instance

Return type:

Tuple[groupable, pdarray]

Raises:
  • TypeError – Raised if the values array is not a pdarray object or if max is not supported for the values dtype

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if max is not supported for the values dtype

Examples

>>> import arkouda as ak
>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.max(b)
(array([2, 3, 4]), array([4, 4, 3]))
mean(values, skipna=True)[source]

Group another array of values and compute the mean of each group’s values.

Group using the permutation stored in the GroupBy instance.

Parameters:
  • values (pdarray) – The values to group and average

  • skipna (bool) – boolean which determines if NANs should be skipped

Returns:

unique_keys(list of) pdarray or Strings

The unique keys, in grouped order

group_meanspdarray, float64

One mean value per unique key in the GroupBy instance

Return type:

Tuple[groupable, pdarray]

Raises:
  • TypeError – Raised if the values array is not a pdarray object

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

Notes

The return dtype is always float64.

Examples

>>> import arkouda as ak
>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.mean(b)
(array([2, 3, 4]), array([2.6666666666666665, 2.7999999999999998, 3]))
median(values, skipna=True)[source]

Group another array of values and compute the median of each group’s values.

Group using the permutation stored in the GroupBy instance.

Parameters:
  • values (pdarray) – The values to group and find median

  • skipna (bool) – boolean which determines if NANs should be skipped

Returns:

unique_keys(list of) pdarray or Strings

The unique keys, in grouped order

group_medianspdarray, float64

One median value per unique key in the GroupBy instance

Return type:

Tuple[groupable, pdarray]

Raises:
  • TypeError – Raised if the values array is not a pdarray object

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

Notes

The return dtype is always float64.

Examples

>>> import arkouda as ak
>>> a = ak.randint(1,5,9)
>>> a
array([4 1 4 3 2 2 2 3 3])
>>> g = ak.GroupBy(a)
>>> g.keys
array([4 1 4 3 2 2 2 3 3])
>>> b = ak.linspace(-5,5,9)
>>> b
array([-5 -3.75 -2.5 -1.25 0 1.25 2.5 3.75 5])
>>> g.median(b)
(array([1 2 3 4]), array([-3.75 1.25 3.75 -3.75]))
min(values, skipna=True)[source]

Group another array of values and return the minimum of each group’s values.

Group using the permutation stored in the GroupBy instance.

Parameters:
  • values (pdarray) – The values to group and find minima

  • skipna (bool) – boolean which determines if NANs should be skipped

Returns:

unique_keys(list of) pdarray or Strings

The unique keys, in grouped order

group_minimapdarray

One minimum per unique key in the GroupBy instance

Return type:

Tuple[groupable, pdarray]

Raises:
  • TypeError – Raised if the values array is not a pdarray object or if min is not supported for the values dtype

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if min is not supported for the values dtype

Examples

>>> import arkouda as ak
>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.min(b)
(array([2, 3, 4]), array([1, 1, 3]))
mode(values)[source]

Return the most common value in each group.

If a group is multi-modal, return the modal value that occurs first.

Parameters:

values ((list of) pdarray-like) – The values from which to take the mode of each group

Returns:

unique_keys(list of) pdarray-like

The unique keys, in grouped order

result(list of) pdarray-like

The most common value of each group

Return type:

Tuple[groupable, groupable]

nunique(values)[source]

Group another array of values and return the number of unique values in each group.

Group using the permutation stored in the GroupBy instance.

Parameters:

values (pdarray, int64) – The values to group and find unique values

Returns:

unique_keysgroupable

The unique keys, in grouped order

group_nuniquegroupable

Number of unique values per unique key in the GroupBy instance

Return type:

Tuple[groupable, pdarray]

Raises:
  • TypeError – Raised if the dtype(s) of values array(s) does/do not support the nunique method

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if nunique is not supported for the values dtype

Examples

>>> import arkouda as ak
>>> data = ak.array([3, 4, 3, 1, 1, 4, 3, 4, 1, 4])
>>> data
array([3, 4, 3, 1, 1, 4, 3, 4, 1, 4])
>>> labels = ak.array([1, 1, 1, 2, 2, 2, 3, 3, 3, 4])
>>> labels
ak.array([1, 1, 1, 2, 2, 2, 3, 3, 3, 4])
>>> g = ak.GroupBy(labels)
>>> g.keys
ak.array([1, 1, 1, 2, 2, 2, 3, 3, 3, 4])
>>> g.nunique(data)
array([1,2,3,4]), array([2, 2, 3, 1])
#    Group (1,1,1) has values [3,4,3] -> there are 2 unique values 3&4
#    Group (2,2,2) has values [1,1,4] -> 2 unique values 1&4
#    Group (3,3,3) has values [3,4,1] -> 3 unique values
#    Group (4) has values [4] -> 1 unique value
prod(values, skipna=True)[source]

Group another array of values and compute the product of each group’s values.

Group using the permutation stored in the GroupBy instance.

Parameters:
  • values (pdarray) – The values to group and multiply

  • skipna (bool) – boolean which determines if NANs should be skipped

Returns:

unique_keys(list of) pdarray or Strings

The unique keys, in grouped order

group_productspdarray, float64

One product per unique key in the GroupBy instance

Return type:

Tuple[groupable, pdarray]

Raises:
  • TypeError – Raised if the values array is not a pdarray object

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if prod is not supported for the values dtype

Notes

The return dtype is always float64.

Examples

>>> import arkouda as ak
>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.prod(b)
(array([2, 3, 4]), array([12, 108.00000000000003, 8.9999999999999982]))
register(user_defined_name)[source]

Register this GroupBy object and underlying components with the Arkouda server.

Parameters:

user_defined_name (str) – user defined name the GroupBy is to be registered under, this will be the root name for underlying components

Returns:

The same GroupBy which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different GroupBys with the same name.

Return type:

GroupBy

Raises:
  • TypeError – Raised if user_defined_name is not a str

  • RegistrationError – If the server was unable to register the GroupBy with the user_defined_name

See also

unregister, attach, unregister_groupby_by_name, is_registered

Notes

Objects registered with the server are immune to deletion until they are unregistered.

sample(values, n=None, frac=None, replace=False, weights=None, random_state=None, return_indices=False, permute_samples=False)[source]

Return a random sample from each group.

You can either specify the number of elements or the fraction of elements to be sampled. random_state can be used for reproducibility

Parameters:
  • values ((list of) pdarray-like) – The values from which to sample, according to their group membership.

  • n (int, optional) – Number of items to return for each group. Cannot be used with frac and must be no larger than the smallest group unless replace is True. Default is one if frac is None.

  • frac (float, optional) – Fraction of items to return. Cannot be used with n.

  • replace (bool, default False) – Allow or disallow sampling of the value more than once.

  • weights (pdarray, optional) – Default None results in equal probability weighting. If passed a pdarray, then values must have the same length as the groupby keys and will be used as sampling probabilities after normalization within each group. Weights must be non-negative with at least one positive element within each group.

  • random_state (int or ak.random.Generator, optional) – If int, seed for random number generator. If ak.random.Generator, use as given.

  • return_indices (bool, default False) – if True, return the indices of the sampled values. Otherwise, return the sample values.

  • permute_samples (bool, default False) – if True, return permute the samples according to group Otherwise, keep samples in original order.

Returns:

if return_indices is True, return the indices of the sampled values. Otherwise, return the sample values.

Return type:

pdarray

size()[source]

Count the number of elements in each group, i.e. the number of times each key appears.

This counts the total number of rows (including NaN values).

Parameters:

none

Returns:

unique_keys(list of) pdarray or Strings

The unique keys, in grouped order

countspdarray, int64

The number of times each unique key appears

Return type:

List[pdarray|Strings], pdarray|int64

See also

count

Examples

>>> import arkouda as ak
>>> a = ak.randint(1,5,10)
>>> a
array([3, 2, 3, 1, 2, 4, 3, 4, 3, 4])
>>> g = ak.GroupBy(a)
>>> keys,counts = g.size()
>>> keys
array([1, 2, 3, 4])
>>> counts
array([1, 2, 4, 3])
std(values, skipna=True, ddof=1)[source]

Group another array of values and compute the standard deviation of each group’s values.

Group using the permutation stored in the GroupBy instance.

Parameters:
  • values (pdarray) – The values to group and find standard deviation

  • skipna (bool) – boolean which determines if NANs should be skipped

  • ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating std

Returns:

unique_keys(list of) pdarray or Strings

The unique keys, in grouped order

group_stdspdarray, float64

One std value per unique key in the GroupBy instance

Return type:

Tuple[groupable, pdarray]

Raises:
  • TypeError – Raised if the values array is not a pdarray object

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

Notes

The return dtype is always float64.

The standard deviation is the square root of the average of the squared deviations from the mean, i.e., std = sqrt(mean((x - x.mean())**2)).

The average squared deviation is normally calculated as x.sum() / N, where N = len(x). If, however, ddof is specified, the divisor N - ddof is used instead. In standard statistical practice, ddof=1 provides an unbiased estimator of the variance of the infinite population. ddof=0 provides a maximum likelihood estimate of the variance for normally distributed variables. The standard deviation computed in this function is the square root of the estimated variance, so even with ddof=1, it will not be an unbiased estimate of the standard deviation per se.

Examples

>>> import arkouda as ak
>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.std(b)
(array([2 3 4]), array([1.5275252316519465 1.0954451150103321 0]))
sum(values, skipna=True)[source]

Group another array of values and sum each group’s values.

Group using the permutation stored in the GroupBy instance.

Parameters:
  • values (pdarray) – The values to group and sum

  • skipna (bool) – boolean which determines if NANs should be skipped

Returns:

unique_keys(list of) pdarray or Strings

The unique keys, in grouped order

group_sumspdarray

One sum per unique key in the GroupBy instance

Return type:

Tuple[groupable, pdarray]

Raises:
  • TypeError – Raised if the values array is not a pdarray object

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

Notes

The grouped sum of a boolean pdarray returns integers.

Examples

>>> import arkouda as ak
>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.sum(b)
(array([2, 3, 4]), array([8, 14, 6]))
tail(values, n=5, return_indices=True)[source]

Return the last n values from each group.

Parameters:
  • values ((list of) pdarray-like) – The values from which to select, according to their group membership.

  • n (int, optional, default = 5) – Maximum number of items to return for each group. If the number of values in a group is less than n, all the values from that group will be returned.

  • return_indices (bool, default False) – If True, return the indices of the sampled values. Otherwise, return the selected values.

Returns:

unique_keys(list of) pdarray-like

The unique keys, in grouped order

resultpdarray-like

The last n items of each group. If return_indices is True, the result are indices. O.W. the result are values.

Return type:

Tuple[groupable, groupable_element_type]

Examples

>>> import arkouda as ak
>>> a = ak.arange(10) %3
>>> a
array([0 1 2 0 1 2 0 1 2 0])
>>> v = ak.arange(10)
>>> v
array([0 1 2 3 4 5 6 7 8 9])
>>> g = GroupBy(a)
>>> unique_keys, idx = g.tail(v, 2, return_indices=True)
>>> _, values = g.tail(v, 2, return_indices=False)
>>> unique_keys
array([0 1 2])
>>> idx
array([6 9 4 7 5 8])
>>> values
array([6 9 4 7 5 8])
>>> v2 =  -2 * ak.arange(10)
>>> v2
array([0 -2 -4 -6 -8 -10 -12 -14 -16 -18])
>>> _, idx2 = g.tail(v2, 2, return_indices=True)
>>> _, values2 = g.tail(v2, 2, return_indices=False)
>>> idx2
array([6 9 4 7 5 8])
>>> values2
array([-12 -18 -8 -14 -10 -16])
to_hdf(prefix_path, dataset='groupby', mode='truncate', file_type='distribute')[source]

Save the GroupBy to HDF5.

The result is a collection of HDF5 files, one file per locale of the arkouda server, where each filename starts with prefix_path.

Parameters:
  • prefix_path (str) – Directory and filename prefix that all output files will share

  • dataset (str) – Name prefix for saved data within the HDF5 file

  • mode (str {'truncate' | 'append'}) – By default, truncate (overwrite) output files, if they exist. If ‘append’, add data as a new column to existing files.

  • file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.

Notes

GroupBy is not currently supported by Parquet

unique(values)[source]

Return the set of unique values in each group, as a SegArray.

Parameters:

values ((list of) pdarray-like) – The values to unique

Returns:

unique_keys(list of) pdarray-like

The unique keys, in grouped order

result(list of) SegArray

The unique values of each group

Return type:

(list of) pdarray-like, (list of) SegArray

Raises:

TypeError – Raised if values is or contains Strings or Categorical

unregister()[source]

Unregister this GroupBy object.

Unregister this GroupBy object in the arkouda server which was previously registered using register() and/or attached to using attach()

Raises:

RegistrationError – If the object is already unregistered or if there is a server error when attempting to unregister

See also

register, attach, unregister_groupby_by_name, is_registered

Notes

Objects registered with the server are immune to deletion until they are unregistered.

var(values, skipna=True, ddof=1)[source]

Group another array of values and compute the variance of each group’s values.

Group using the permutation stored in the GroupBy instance.

Parameters:
  • values (pdarray) – The values to group and find variance

  • skipna (bool) – boolean which determines if NANs should be skipped

  • ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating var

Returns:

unique_keys(list of) pdarray or Strings

The unique keys, in grouped order

group_varspdarray, float64

One var value per unique key in the GroupBy instance

Return type:

Tuple[groupable, pdarray]

Raises:
  • TypeError – Raised if the values array is not a pdarray object

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

Notes

The return dtype is always float64.

The variance is the average of the squared deviations from the mean, i.e., var = mean((x - x.mean())**2).

The mean is normally calculated as x.sum() / N, where N = len(x). If, however, ddof is specified, the divisor N - ddof is used instead. In standard statistical practice, ddof=1 provides an unbiased estimator of the variance of a hypothetical infinite population. ddof=0 provides a maximum likelihood estimate of the variance for normally distributed variables.

Examples

>>> import arkouda as ak
>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.var(b)
(array([2 3 4]), array([2.333333333333333 1.2 0]))