Summarizing Data¶

Descriptive Statistics¶

Simple descriptive statistics are available as reduction methods on pdarray objects.

>>> A = ak.randint(-10, 11, 1000)
>>> A.min()
-10
>>> A.max()
10
>>> A.sum()
13
>>> A.mean()
0.013
>>> A.var()
36.934176000000015
>>> A.std()
6.07734942223993

The list of reductions supported on pdarray objects is:

pdarray.any(axis=None, keepdims=False)[source]¶

Return True iff any element of the array along the given axis evaluates to True.

Parameters:

axis (int, Tuple[int, ...], optional, default = None) – The axis or axes along which to do the operation If None, the computation is done across the entire array.
keepdims (bool, optional, default = False) – Whether to keep the singleton dimension(s) along axis in the result.

Returns:

boolean if axis is omitted, else pdarray if axis is supplied

Return type:

boolean or pdarray

Examples

>>> import arkouda as ak
>>> ak.any(ak.array([True,False,False]))
np.True_
>>> ak.any(ak.array([[True,True,False],[False,True,True]]),axis=0)
array([True True True])
>>> ak.any(ak.array([[True,True,True],[False,False,False]]),axis=0,keepdims=True)
array([array([True True True])])
>>> ak.any(ak.array([[True,True,True],[False,False,False]]),axis=1,keepdims=True)
array([array([True]) array([False])])
>>> ak.array([True,False,False]).any()
np.True_

Raises:

TypeError – Raised if pda is not a pdarray instance
RuntimeError – Raised if there’s a server-side error thrown

Notes

Works as a method of a pdarray (e.g. a.any()) or a standalone function (e.g. ak.any(a))

pdarray.all(axis=None, keepdims=False)[source]¶

Return True iff all elements of the array along the given axis evaluate to True.

Parameters:

axis (int, Tuple[int, ...], optional, default = None) – The axis or axes along which to do the operation If None, the computation is done across the entire array.
keepdims (bool, optional, default = False) – Whether to keep the singleton dimension(s) along axis in the result.

Returns:

boolean if axis is omitted, pdarray if axis is supplied

Return type:

boolean or pdarray

Examples

>>> import arkouda as ak
>>> ak.all(ak.array([True,False,False]))
np.False_
>>> ak.all(ak.array([[True,True,False],[False,True,True]]),axis=0)
array([False True False])
>>> ak.all(ak.array([[True,True,True],[False,False,False]]),axis=0,keepdims=True)
array([array([False False False])])
>>> ak.all(ak.array([[True,True,True],[False,False,False]]),axis=1,keepdims=True)
array([array([True]) array([False])])
>>> ak.array([True,False,False]).all()
np.False_

Raises:

TypeError – Raised if pda is not a pdarray instance
RuntimeError – Raised if there’s a server-side error thrown

Notes

Works as a method of a pdarray (e.g. a.any()) or a standalone function (e.g. ak.all(a))

pdarray.is_sorted(axis=None, keepdims=False)[source]¶

Return True iff the array (or given axis of the array) is monotonically non-decreasing.

Parameters:

axis (int, Tuple[int, ...], optional, default = None) – The axis or axes along which to do the operation If None, the computation is done across the entire array.
keepdims (bool, optional, default = False) – Whether to keep the singleton dimension(s) along axis in the result.

Returns:

boolean if axis is omitted, else pdarray if axis is supplied

Return type:

boolean or pdarray

Raises:

TypeError – Raised if pda is not a pdarray instance
RuntimeError – Raised if there’s a server-side error thrown

Examples

>>> import arkouda as ak
>>> ak.is_sorted(ak.array([1,2,3,4,5]))
np.True_
>>> ak.is_sorted(ak.array([5,4,3,2,1]))
np.False_
>>> ak.array([[1,2,3],[5,4,3]]).is_sorted(axis=1)
array([True False])

Notes

Works as a method of a pdarray (e.g. a.is_sorted()) or a standalone function (e.g. ak.is_sorted(a))

pdarray.sum(axis=None, keepdims=False)[source]¶

Return sum of array elements along the given axis.

Parameters:

axis (int, Tuple[int, ...], optional, default = None) – The axis or axes along which to do the operation If None, the computation is done across the entire array.
keepdims (bool, optional, default = False) – Whether to keep the singleton dimension(s) along axis in the result.

Returns:

numeric_scalars if axis is omitted, in which case operation is done over entire array pdarray if axis is supplied, in which case the operation is done along that axis

Return type:

numeric_scalars or pdarray

Raises:

TypeError – Raised if pda is not a pdarray instance
RuntimeError – Raised if there’s a server-side error thrown

Examples

>>> import arkouda as ak
>>> ak.sum(ak.array([1,2,3,4,5]))
np.int64(15)
>>> ak.sum(ak.array([5.5,4.5,3.5,2.5,1.5]))
np.float64(17.5)
>>> ak.array([[1,2,3],[5,4,3]]).sum(axis=1)
array([6 12])

Notes

Works as a method of a pdarray (e.g. a.sum()) or a standalone function (e.g. ak.sum(a))

pdarray.prod(axis=None, keepdims=False)[source]¶

Return prod of array elements along the given axis.

Parameters:

axis (int, Tuple[int, ...], optional, defalt = None) – The axis or axes along which to do the operation If None, the computation is done across the entire array.
keepdims (bool, optional, default = False) – Whether to keep the singleton dimension(s) along axis in the result.

Returns:

numeric_scalars if axis is omitted, in which case operation is done over entire array pdarray if axis is supplied, in which case the operation is done along that axis

Return type:

numeric_scalars or pdarray

Raises:

TypeError – Raised if pda is not a pdarray instance
RuntimeError – Raised if there’s a server-side error thrown

Examples

>>> import arkouda as ak
>>> ak.prod(ak.array([1,2,3,4,5]))
np.int64(120)
>>> ak.prod(ak.array([5.5,4.5,3.5,2.5,1.5]))
np.float64(324.84375)
>>> ak.array([[1,2,3],[5,4,3]]).prod(axis=1)
array([6 60])

Notes

Works as a method of a pdarray (e.g. a.prod()) or a standalone function (e.g. ak.prod(a))

pdarray.min(axis=None, keepdims=False)[source]¶

Return min of array elements along the given axis.

Parameters:

axis (int, Tuple[int, ...], optional, default = None) – The axis or axes along which to do the operation If None, the computation is done across the entire array.
keepdims (bool, optional, default = False) – Whether to keep the singleton dimension(s) along axis in the result.

Returns:

numeric_scalar if axis is omitted, in which case operation is done over entire array pdarray if axis is supplied, in which case the operation is done along that axis

Return type:

numeric_scalar or pdarray

Raises:

TypeError – Raised if pda is not a pdarray instance
RuntimeError – Raised if there’s a server-side error thrown

Examples

>>> import arkouda as ak
>>> ak.min(ak.array([1,2,3,4,5]))
np.int64(1)
>>> ak.min(ak.array([5.5,4.5,3.5,2.5,1.5]))
np.float64(1.5)
>>> ak.array([[1,2,3],[5,4,3]]).min(axis=1)
array([1 3])

Notes

Works as a method of a pdarray (e.g. a.min()) or a standalone function (e.g. ak.min(a))

pdarray.max(axis=None, keepdims=False)[source]¶

Return max of array elements along the given axis.

Parameters:

axis (int, Tuple[int, ...], optional, default = None) – The axis or axes along which to do the operation If None, the computation is done across the entire array.
keepdims (bool, optional, default = False) – Whether to keep the singleton dimension(s) along axis in the result.

Returns:

numeric_scalar if axis is omitted, in which case operation is done over entire array pdarray if axis is supplied, in which case the operation is done along that axis

Return type:

numeric_scalar or pdarray

Raises:

TypeError – Raised if pda is not a pdarray instance
RuntimeError – Raised if there’s a server-side error thrown

Examples

>>> import arkouda as ak
>>> ak.max(ak.array([1,2,3,4,5]))
np.int64(5)
>>> ak.max(ak.array([5.5,4.5,3.5,2.5,1.5]))
np.float64(5.5)
>>> ak.array([[1,2,3],[5,4,3]]).max(axis=1)
array([3 5])

Notes

Works as a method of a pdarray (e.g. a.max()) or a standalone function (e.g. ak.max(a))

pdarray.argmin(axis=None, keepdims=False)[source]¶

Return index of the first occurrence of the minimum along the given axis.

Parameters:

axis (int, Tuple[int, ...], optional, default = None) – The axis or axes along which to do the operation If None, the computation is done across the entire array.
keepdims (bool, optional, default = False) – Whether to keep the singleton dimension(s) along axis in the result.

Returns:

int64 or uint64 if axis is omitted, in which case operation is done over entire array pdarray if axis is supplied, in which case the operation is done along that axis

Return type:

int64, uint64 or pdarray

Raises:

TypeError – Raised if pda is not a pdarray instance
RuntimeError – Raised if there’s a server-side error thrown

Examples

>>> import arkouda as ak
>>> ak.argmin(ak.array([1,2,3,4,5]))
np.int64(0)
>>> ak.argmin(ak.array([5.5,4.5,3.5,2.5,1.5]))
np.int64(4)
>>> ak.array([[1,2,3],[5,4,3]]).argmin(axis=1)
array([0 2])

Notes

Works as a method of a pdarray (e.g. a.argmin()) or a standalone function (e.g. ak.argmin(a))

pdarray.argmax(axis=None, keepdims=False)[source]¶

Return index of the first occurrence of the maximum along the given axis.

Parameters:

axis (int, Tuple[int, ...], optional, default = None) – The axis or axes along which to do the operation If None, the computation is done across the entire array.
keepdims (bool, optional, default = False) – Whether to keep the singleton dimension(s) along axis in the result.

Returns:

int64 or uint64 if axis is omitted, in which case operation is done over entire array pdarray if axis is supplied, in which case the operation is done along that axis

Return type:

int64, uint64 or pdarray

Raises:

TypeError – Raised if pda is not a pdarray instance
RuntimeError – Raised if there’s a server-side error thrown

Examples

>>> import arkouda as ak
>>> ak.argmax(ak.array([1,2,3,4,5]))
np.int64(4)
>>> ak.argmax(ak.array([5.5,4.5,3.5,2.5,1.5]))
np.int64(0)
>>> ak.array([[1,2,3],[5,4,3]]).argmax(axis=1)
array([2 0])

Notes

Works as a method of a pdarray (e.g. a.argmax()) or a standalone function (e.g. ak.argmax(a))

pdarray.mean(axis=None, keepdims=False)[source]¶

Return the mean of the array.

Parameters:

axis (int, Tuple[int, ...], optional, default = None) – The axis or axes along which to do the operation If None, the computation is done across the entire array.
keepdims (bool, optional, default = False) – Whether to keep the singleton dimension(s) along axis in the result.

Returns:

The mean calculated from the pda sum and size, along the axis/axes if those are given.

Return type:

Union[np.float64, pdarray]

Examples

>>> import arkouda as ak
>>> a = ak.arange(10)
>>> ak.mean(a)
np.float64(4.5)
>>> a.mean()
np.float64(4.5)
>>> a = ak.arange(10).reshape(2,5)
>>> a.mean(axis=0)
array([2.5 3.5 4.5 5.5 6.5])
>>> ak.mean(a,axis=0)
array([2.5 3.5 4.5 5.5 6.5])
>>> a.mean(axis=1)
array([2.00000000000000000 7.00000000000000000])
>>> ak.mean(a,axis=1)
array([2.00000000000000000 7.00000000000000000])

Raises: Raised if pda is not a pdarray instance
RuntimeError: Raised if there’s a server-side error thrown

pdarray.var(ddof=0, axis=None, keepdims=False)[source]¶

Return the variance of values in the array.

Parameters:

ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating var
axis (int, Tuple[int, ...], optional, default = None) – The axis or axes along which to do the operation If None, the computation is done across the entire array.
keepdims (bool, optional, default = False) – Whether to keep the singleton dimension(s) along axis in the result.

Returns:

The scalar variance of the array, or the variance along the axis/axes if supplied

Return type:

Union[np.float64, pdarray]

Examples

>>> import arkouda as ak
>>> a = ak.arange(10)
>>> ak.var(a)
np.float64(8.25)
>>> a.var()
np.float64(8.25)
>>> a = ak.arange(10).reshape(2,5)
>>> a.var(axis=0)
array([6.25 6.25 6.25 6.25 6.25])
>>> ak.var(a,axis=0)
array([6.25 6.25 6.25 6.25 6.25])
>>> a.var(axis=1)
array([2.00000000000000000 2.00000000000000000])
>>> ak.var(a,axis=1)
array([2.00000000000000000 2.00000000000000000])

Raises:

TypeError – Raised if pda is not a pdarray instance
ValueError – Raised if the ddof >= pdarray size
RuntimeError – Raised if there’s a server-side error thrown

See also

mean, std

Notes

The variance is the average of the squared deviations from the mean, i.e., var = mean((x - x.mean())**2).

The mean is normally calculated as x.sum() / N, where N = len(x). If, however, ddof is specified, the divisor N - ddof is used instead. In standard statistical practice, ddof=1 provides an unbiased estimator of the variance of a hypothetical infinite population. ddof=0 provides a maximum likelihood estimate of the variance for normally distributed variables.

pdarray.std(ddof=0, axis=None, keepdims=False)[source]¶

Return the standard deviation of values in the array. The standard deviation is implemented as the square root of the variance.

Parameters:

ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating std
axis (int, Tuple[int, ...], optional, default = None) – The axis or axes along which to do the operation If None, the computation is done across the entire array.
keepdims (bool, optional, default = False) – Whether to keep the singleton dimension(s) along axis in the result.

Returns:

The scalar standard deviation of the array, or the standard deviation: along the axis/axes if supplied

Return type:

Union[np.float64, pdarray]

Examples

>>> import arkouda as ak
>>> a = ak.arange(10)
>>> ak.std(a)
np.float64(2.8722813232690143)
>>> a.std()
np.float64(2.8722813232690143)
>>> a = ak.arange(10).reshape(2,5)
>>> a.std(axis=0)
array([2.5 2.5 2.5 2.5 2.5])
>>> ak.std(a,axis=0)
array([2.5 2.5 2.5 2.5 2.5])
>>> a.std(axis=1)
array([1.4142135623730951 1.4142135623730951])
>>> ak.std(a,axis=1)
array([1.4142135623730951 1.4142135623730951])

Raises:

TypeError – Raised if pda is not a pdarray instance or ddof is not an integer
ValueError – Raised if ddof is an integer < 0
RuntimeError – Raised if there’s a server-side error thrown

See also

mean, var

Notes

The standard deviation is the square root of the average of the squared deviations from the mean, i.e., std = sqrt(mean((x - x.mean())**2)).

The average squared deviation is normally calculated as x.sum() / N, where N = len(x). If, however, ddof is specified, the divisor N - ddof is used instead. In standard statistical practice, ddof=1 provides an unbiased estimator of the variance of the infinite population. ddof=0 provides a maximum likelihood estimate of the variance for normally distributed variables. The standard deviation computed in this function is the square root of the estimated variance, so even with ddof=1, it will not be an unbiased estimate of the standard deviation per se.

pdarray.mink(k)[source]¶

Compute the minimum “k” values. See arkouda.mink for details.

Return type:: pdarray

pdarray.maxk(k)[source]¶

Compute the maximum “k” values. See arkouda.maxk for details.

Return type:: pdarray

pdarray.argmink(k)[source]¶

Finds the indices corresponding to the k minimum values of an array. See arkouda.argmink for details.

Return type:: pdarray

pdarray.argmaxk(k)[source]¶

Finds the indices corresponding to the k maximum values of an array. See arkouda.argmaxk for details.

Return type:: pdarray

Histogram¶

Arkouda can compute simple histograms on pdarray data. Currently, this function can only create histograms over evenly spaced bins between the min and max of the data. In the future, we plan to support using a pdarray to define custom bin edges.

arkouda.histogram(pda, bins=10, range=None)[source]¶

Compute a histogram of evenly spaced bins over the range of an array.

Parameters:

pda (pdarray) – The values to histogram
bins (int_scalars, default=10) – The number of equal-size bins to use (default: 10)
range ((minVal, maxVal), optional) – The range of the values to count. Values outside of this range are dropped. By default, all values are counted.

Returns:

The number of values present in each bin and the bin edges

Return type:

(pdarray, Union[pdarray, int64 or float64])

Raises:

TypeError – Raised if the parameter is not a pdarray or if bins is not an int.
ValueError – Raised if bins < 1
NotImplementedError – Raised if pdarray dtype is bool or uint8

See also

value_counts, histogram2d

Notes

The bins are evenly spaced in the interval [pda.min(), pda.max()]. If range parameter is provided, the interval is [range[0], range[1]].

Examples

>>> import arkouda as ak
>>> import matplotlib.pyplot as plt
>>> A = ak.arange(0, 10, 1)
>>> nbins = 3
>>> h, b = ak.histogram(A, bins=nbins)
>>> h
array([3 3 4])
>>> b
array([0.00000000... 3.00000000... 6.00000000... 9.00000000...])

To plot, export the left edges and the histogram to NumPy >>> b_np = b.to_ndarray() >>> import numpy as np >>> b_widths = np.diff(b_np) >>> plt.bar(b_np[:-1], h.to_ndarray(), width=b_widths, align=’edge’, edgecolor=’black’) <BarContainer object of 3 artists> >>> plt.show() # doctest: +SKIP

Value Counts¶

For int64 pdarray objects, it is often useful to count only the unique values that appear. This function finds all unique values and their counts.

arkouda.value_counts(pda)[source]¶

Count the occurrences of the unique values of an array.

Parameters:

pda (pdarray) – The array of values to count

Return type:

tuple[Union[pdarray, Strings, Categorical, Sequence[Union[pdarray, Strings, Categorical]]], pdarray]

Returns:

unique_values (pdarray, int64 or Strings) – The unique values, sorted in ascending order
counts (pdarray, int64) – The number of times the corresponding unique value occurs

Raises:

TypeError – Raised if the parameter is not a pdarray

See also

unique, histogram

Notes

This function differs from histogram() in that it only returns counts for values that are present, leaving out empty “bins”. This function delegates all logic to the unique() method where the return_counts parameter is set to True.

Examples

>>> import arkouda as ak
>>> A = ak.array([2, 0, 2, 4, 0, 0])
>>> ak.value_counts(A)
(array([0 2 4]), array([3 2 1]))