Summarizing Data¶
Descriptive Statistics¶
Simple descriptive statistics are available as reduction methods on pdarray
objects.
>>> A = ak.randint(-10, 11, 1000)
>>> A.min()
-10
>>> A.max()
10
>>> A.sum()
13
>>> A.mean()
0.013
>>> A.var()
36.934176000000015
>>> A.std()
6.07734942223993
The list of reductions supported on pdarray
objects is:
- pdarray.any(axis=None, keepdims=False)[source]¶
Return True iff any element of the array along the given axis evaluates to True.
- Parameters:
axis (int, Tuple[int, ...], optional, default = None) – The axis or axes along which to do the operation If None, the computation is done across the entire array.
keepdims (bool, optional, default = False) – Whether to keep the singleton dimension(s) along axis in the result.
- Returns:
boolean if axis is omitted, else pdarray if axis is supplied
- Return type:
boolean or pdarray
Examples
>>> import arkouda as ak >>> ak.any(ak.array([True,False,False])) np.True_ >>> ak.any(ak.array([[True,True,False],[False,True,True]]),axis=0) array([True True True]) >>> ak.any(ak.array([[True,True,True],[False,False,False]]),axis=0,keepdims=True) array([array([True True True])]) >>> ak.any(ak.array([[True,True,True],[False,False,False]]),axis=1,keepdims=True) array([array([True]) array([False])]) >>> ak.array([True,False,False]).any() np.True_
- Raises:
TypeError – Raised if pda is not a pdarray instance
RuntimeError – Raised if there’s a server-side error thrown
Notes
Works as a method of a pdarray (e.g. a.any()) or a standalone function (e.g. ak.any(a))
- pdarray.all(axis=None, keepdims=False)[source]¶
Return True iff all elements of the array along the given axis evaluate to True.
- Parameters:
axis (int, Tuple[int, ...], optional, default = None) – The axis or axes along which to do the operation If None, the computation is done across the entire array.
keepdims (bool, optional, default = False) – Whether to keep the singleton dimension(s) along axis in the result.
- Returns:
boolean if axis is omitted, pdarray if axis is supplied
- Return type:
boolean or pdarray
Examples
>>> import arkouda as ak >>> ak.all(ak.array([True,False,False])) np.False_ >>> ak.all(ak.array([[True,True,False],[False,True,True]]),axis=0) array([False True False]) >>> ak.all(ak.array([[True,True,True],[False,False,False]]),axis=0,keepdims=True) array([array([False False False])]) >>> ak.all(ak.array([[True,True,True],[False,False,False]]),axis=1,keepdims=True) array([array([True]) array([False])]) >>> ak.array([True,False,False]).all() np.False_
- Raises:
TypeError – Raised if pda is not a pdarray instance
RuntimeError – Raised if there’s a server-side error thrown
Notes
Works as a method of a pdarray (e.g. a.any()) or a standalone function (e.g. ak.all(a))
- pdarray.is_sorted(axis=None, keepdims=False)[source]¶
Return True iff the array (or given axis of the array) is monotonically non-decreasing.
- Parameters:
axis (int, Tuple[int, ...], optional, default = None) – The axis or axes along which to do the operation If None, the computation is done across the entire array.
keepdims (bool, optional, default = False) – Whether to keep the singleton dimension(s) along axis in the result.
- Returns:
boolean if axis is omitted, else pdarray if axis is supplied
- Return type:
boolean or pdarray
- Raises:
TypeError – Raised if pda is not a pdarray instance
RuntimeError – Raised if there’s a server-side error thrown
Examples
>>> import arkouda as ak >>> ak.is_sorted(ak.array([1,2,3,4,5])) np.True_ >>> ak.is_sorted(ak.array([5,4,3,2,1])) np.False_ >>> ak.array([[1,2,3],[5,4,3]]).is_sorted(axis=1) array([True False])
Notes
Works as a method of a pdarray (e.g. a.is_sorted()) or a standalone function (e.g. ak.is_sorted(a))
- pdarray.sum(axis=None, keepdims=False)[source]¶
Return sum of array elements along the given axis.
- Parameters:
axis (int, Tuple[int, ...], optional, default = None) – The axis or axes along which to do the operation If None, the computation is done across the entire array.
keepdims (bool, optional, default = False) – Whether to keep the singleton dimension(s) along axis in the result.
- Returns:
numeric_scalars if axis is omitted, in which case operation is done over entire array pdarray if axis is supplied, in which case the operation is done along that axis
- Return type:
- Raises:
TypeError – Raised if pda is not a pdarray instance
RuntimeError – Raised if there’s a server-side error thrown
Examples
>>> import arkouda as ak >>> ak.sum(ak.array([1,2,3,4,5])) np.int64(15) >>> ak.sum(ak.array([5.5,4.5,3.5,2.5,1.5])) np.float64(17.5) >>> ak.array([[1,2,3],[5,4,3]]).sum(axis=1) array([6 12])
Notes
Works as a method of a pdarray (e.g. a.sum()) or a standalone function (e.g. ak.sum(a))
- pdarray.prod(axis=None, keepdims=False)[source]¶
Return prod of array elements along the given axis.
- Parameters:
axis (int, Tuple[int, ...], optional, defalt = None) – The axis or axes along which to do the operation If None, the computation is done across the entire array.
keepdims (bool, optional, default = False) – Whether to keep the singleton dimension(s) along axis in the result.
- Returns:
numeric_scalars if axis is omitted, in which case operation is done over entire array pdarray if axis is supplied, in which case the operation is done along that axis
- Return type:
- Raises:
TypeError – Raised if pda is not a pdarray instance
RuntimeError – Raised if there’s a server-side error thrown
Examples
>>> import arkouda as ak >>> ak.prod(ak.array([1,2,3,4,5])) np.int64(120) >>> ak.prod(ak.array([5.5,4.5,3.5,2.5,1.5])) np.float64(324.84375) >>> ak.array([[1,2,3],[5,4,3]]).prod(axis=1) array([6 60])
Notes
Works as a method of a pdarray (e.g. a.prod()) or a standalone function (e.g. ak.prod(a))
- pdarray.min(axis=None, keepdims=False)[source]¶
Return min of array elements along the given axis.
- Parameters:
axis (int, Tuple[int, ...], optional, default = None) – The axis or axes along which to do the operation If None, the computation is done across the entire array.
keepdims (bool, optional, default = False) – Whether to keep the singleton dimension(s) along axis in the result.
- Returns:
numeric_scalar if axis is omitted, in which case operation is done over entire array pdarray if axis is supplied, in which case the operation is done along that axis
- Return type:
numeric_scalar or pdarray
- Raises:
TypeError – Raised if pda is not a pdarray instance
RuntimeError – Raised if there’s a server-side error thrown
Examples
>>> import arkouda as ak >>> ak.min(ak.array([1,2,3,4,5])) np.int64(1) >>> ak.min(ak.array([5.5,4.5,3.5,2.5,1.5])) np.float64(1.5) >>> ak.array([[1,2,3],[5,4,3]]).min(axis=1) array([1 3])
Notes
Works as a method of a pdarray (e.g. a.min()) or a standalone function (e.g. ak.min(a))
- pdarray.max(axis=None, keepdims=False)[source]¶
Return max of array elements along the given axis.
- Parameters:
axis (int, Tuple[int, ...], optional, default = None) – The axis or axes along which to do the operation If None, the computation is done across the entire array.
keepdims (bool, optional, default = False) – Whether to keep the singleton dimension(s) along axis in the result.
- Returns:
numeric_scalar if axis is omitted, in which case operation is done over entire array pdarray if axis is supplied, in which case the operation is done along that axis
- Return type:
numeric_scalar or pdarray
- Raises:
TypeError – Raised if pda is not a pdarray instance
RuntimeError – Raised if there’s a server-side error thrown
Examples
>>> import arkouda as ak >>> ak.max(ak.array([1,2,3,4,5])) np.int64(5) >>> ak.max(ak.array([5.5,4.5,3.5,2.5,1.5])) np.float64(5.5) >>> ak.array([[1,2,3],[5,4,3]]).max(axis=1) array([3 5])
Notes
Works as a method of a pdarray (e.g. a.max()) or a standalone function (e.g. ak.max(a))
- pdarray.argmin(axis=None, keepdims=False)[source]¶
Return index of the first occurrence of the minimum along the given axis.
- Parameters:
axis (int, Tuple[int, ...], optional, default = None) – The axis or axes along which to do the operation If None, the computation is done across the entire array.
keepdims (bool, optional, default = False) – Whether to keep the singleton dimension(s) along axis in the result.
- Returns:
int64 or uint64 if axis is omitted, in which case operation is done over entire array pdarray if axis is supplied, in which case the operation is done along that axis
- Return type:
- Raises:
TypeError – Raised if pda is not a pdarray instance
RuntimeError – Raised if there’s a server-side error thrown
Examples
>>> import arkouda as ak >>> ak.argmin(ak.array([1,2,3,4,5])) np.int64(0) >>> ak.argmin(ak.array([5.5,4.5,3.5,2.5,1.5])) np.int64(4) >>> ak.array([[1,2,3],[5,4,3]]).argmin(axis=1) array([0 2])
Notes
Works as a method of a pdarray (e.g. a.argmin()) or a standalone function (e.g. ak.argmin(a))
- pdarray.argmax(axis=None, keepdims=False)[source]¶
Return index of the first occurrence of the maximum along the given axis.
- Parameters:
axis (int, Tuple[int, ...], optional, default = None) – The axis or axes along which to do the operation If None, the computation is done across the entire array.
keepdims (bool, optional, default = False) – Whether to keep the singleton dimension(s) along axis in the result.
- Returns:
int64 or uint64 if axis is omitted, in which case operation is done over entire array pdarray if axis is supplied, in which case the operation is done along that axis
- Return type:
- Raises:
TypeError – Raised if pda is not a pdarray instance
RuntimeError – Raised if there’s a server-side error thrown
Examples
>>> import arkouda as ak >>> ak.argmax(ak.array([1,2,3,4,5])) np.int64(4) >>> ak.argmax(ak.array([5.5,4.5,3.5,2.5,1.5])) np.int64(0) >>> ak.array([[1,2,3],[5,4,3]]).argmax(axis=1) array([2 0])
Notes
Works as a method of a pdarray (e.g. a.argmax()) or a standalone function (e.g. ak.argmax(a))
- pdarray.mean(axis=None, keepdims=False)[source]¶
Return the mean of the array.
- Parameters:
axis (int, Tuple[int, ...], optional, default = None) – The axis or axes along which to do the operation If None, the computation is done across the entire array.
keepdims (bool, optional, default = False) – Whether to keep the singleton dimension(s) along axis in the result.
- Returns:
The mean calculated from the pda sum and size, along the axis/axes if those are given.
- Return type:
Examples
>>> import arkouda as ak >>> a = ak.arange(10) >>> ak.mean(a) np.float64(4.5) >>> a.mean() np.float64(4.5) >>> a = ak.arange(10).reshape(2,5) >>> a.mean(axis=0) array([2.5 3.5 4.5 5.5 6.5]) >>> ak.mean(a,axis=0) array([2.5 3.5 4.5 5.5 6.5]) >>> a.mean(axis=1) array([2.00000000000000000 7.00000000000000000]) >>> ak.mean(a,axis=1) array([2.00000000000000000 7.00000000000000000])
- Raises
Raised if pda is not a pdarray instance
- RuntimeError
Raised if there’s a server-side error thrown
- pdarray.var(ddof=0, axis=None, keepdims=False)[source]¶
Return the variance of values in the array.
- Parameters:
ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating var
axis (int, Tuple[int, ...], optional, default = None) – The axis or axes along which to do the operation If None, the computation is done across the entire array.
keepdims (bool, optional, default = False) – Whether to keep the singleton dimension(s) along axis in the result.
- Returns:
The scalar variance of the array, or the variance along the axis/axes if supplied
- Return type:
Examples
>>> import arkouda as ak >>> a = ak.arange(10) >>> ak.var(a) np.float64(8.25) >>> a.var() np.float64(8.25) >>> a = ak.arange(10).reshape(2,5) >>> a.var(axis=0) array([6.25 6.25 6.25 6.25 6.25]) >>> ak.var(a,axis=0) array([6.25 6.25 6.25 6.25 6.25]) >>> a.var(axis=1) array([2.00000000000000000 2.00000000000000000]) >>> ak.var(a,axis=1) array([2.00000000000000000 2.00000000000000000])
- Raises:
TypeError – Raised if pda is not a pdarray instance
ValueError – Raised if the ddof >= pdarray size
RuntimeError – Raised if there’s a server-side error thrown
Notes
The variance is the average of the squared deviations from the mean, i.e.,
var = mean((x - x.mean())**2)
.The mean is normally calculated as
x.sum() / N
, whereN = len(x)
. If, however, ddof is specified, the divisorN - ddof
is used instead. In standard statistical practice,ddof=1
provides an unbiased estimator of the variance of a hypothetical infinite population.ddof=0
provides a maximum likelihood estimate of the variance for normally distributed variables.
- pdarray.std(ddof=0, axis=None, keepdims=False)[source]¶
Return the standard deviation of values in the array. The standard deviation is implemented as the square root of the variance.
- Parameters:
ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating std
axis (int, Tuple[int, ...], optional, default = None) – The axis or axes along which to do the operation If None, the computation is done across the entire array.
keepdims (bool, optional, default = False) – Whether to keep the singleton dimension(s) along axis in the result.
- Returns:
- The scalar standard deviation of the array, or the standard deviation
along the axis/axes if supplied
- Return type:
Examples
>>> import arkouda as ak >>> a = ak.arange(10) >>> ak.std(a) np.float64(2.8722813232690143) >>> a.std() np.float64(2.8722813232690143) >>> a = ak.arange(10).reshape(2,5) >>> a.std(axis=0) array([2.5 2.5 2.5 2.5 2.5]) >>> ak.std(a,axis=0) array([2.5 2.5 2.5 2.5 2.5]) >>> a.std(axis=1) array([1.4142135623730951 1.4142135623730951]) >>> ak.std(a,axis=1) array([1.4142135623730951 1.4142135623730951])
- Raises:
TypeError – Raised if pda is not a pdarray instance or ddof is not an integer
ValueError – Raised if ddof is an integer < 0
RuntimeError – Raised if there’s a server-side error thrown
Notes
The standard deviation is the square root of the average of the squared deviations from the mean, i.e.,
std = sqrt(mean((x - x.mean())**2))
.The average squared deviation is normally calculated as
x.sum() / N
, whereN = len(x)
. If, however, ddof is specified, the divisorN - ddof
is used instead. In standard statistical practice,ddof=1
provides an unbiased estimator of the variance of the infinite population.ddof=0
provides a maximum likelihood estimate of the variance for normally distributed variables. The standard deviation computed in this function is the square root of the estimated variance, so even withddof=1
, it will not be an unbiased estimate of the standard deviation per se.
Histogram¶
Arkouda can compute simple histograms on pdarray
data. Currently, this function can only create histograms over evenly spaced bins between the min and max of the data. In the future, we plan to support using a pdarray
to define custom bin edges.
- arkouda.histogram(pda, bins=10, range=None)[source]¶
Compute a histogram of evenly spaced bins over the range of an array.
- Parameters:
pda (pdarray) – The values to histogram
bins (int_scalars, default=10) – The number of equal-size bins to use (default: 10)
range ((minVal, maxVal), optional) – The range of the values to count. Values outside of this range are dropped. By default, all values are counted.
- Returns:
The number of values present in each bin and the bin edges
- Return type:
- Raises:
TypeError – Raised if the parameter is not a pdarray or if bins is not an int.
ValueError – Raised if bins < 1
NotImplementedError – Raised if pdarray dtype is bool or uint8
See also
value_counts
,histogram2d
Notes
The bins are evenly spaced in the interval [pda.min(), pda.max()]. If range parameter is provided, the interval is [range[0], range[1]].
Examples
>>> import arkouda as ak >>> import matplotlib.pyplot as plt >>> A = ak.arange(0, 10, 1) >>> nbins = 3 >>> h, b = ak.histogram(A, bins=nbins) >>> h array([3 3 4]) >>> b array([0.00000000... 3.00000000... 6.00000000... 9.00000000...])
To plot, export the left edges and the histogram to NumPy >>> b_np = b.to_ndarray() >>> import numpy as np >>> b_widths = np.diff(b_np) >>> plt.bar(b_np[:-1], h.to_ndarray(), width=b_widths, align=’edge’, edgecolor=’black’) <BarContainer object of 3 artists> >>> plt.show() # doctest: +SKIP
Value Counts¶
For int64 pdarray
objects, it is often useful to count only the unique values that appear. This function finds all unique values and their counts.
- arkouda.value_counts(pda)[source]¶
Count the occurrences of the unique values of an array.
- Parameters:
pda (pdarray) – The array of values to count
- Return type:
tuple
[Union
[pdarray
,Strings
,Categorical
,Sequence
[Union
[pdarray
,Strings
,Categorical
]]],pdarray
]- Returns:
unique_values (pdarray, int64 or Strings) – The unique values, sorted in ascending order
counts (pdarray, int64) – The number of times the corresponding unique value occurs
- Raises:
TypeError – Raised if the parameter is not a pdarray
Notes
This function differs from
histogram()
in that it only returns counts for values that are present, leaving out empty “bins”. This function delegates all logic to the unique() method where the return_counts parameter is set to True.Examples
>>> import arkouda as ak >>> A = ak.array([2, 0, 2, 4, 0, 0]) >>> ak.value_counts(A) (array([0 2 4]), array([3 2 1]))