Categoricals¶

Categorical arrays are a concept from Pandas that speeds up many operations on strings, especially when an array of strings contains many repeated values. A Categorical object stores the unique strings as category labels and represents the values of the original array as integer indices into this category array.

Construction¶

The typical way to construct a Categorical is from a Strings object:

class arkouda.Categorical(values, **kwargs)[source]¶

Represents an array of values belonging to named categories.

Converting a Strings object to Categorical often saves memory and speeds up operations, especially if there are many repeated values, at the cost of some one-time work in initialization.

Parameters:

values (Strings, Categorical, pd.Categorical) – Values to convert to categories

NAvalue (str scalar) – The value to use to represent missing/null data

categories¶

The set of category labels (determined automatically)

Type:

Strings

codes¶

The category indices of the values or -1 for N/A

Type:

pdarray, int64

permutation¶

The permutation that groups the values in the same order as categories

Type:

pdarray, int64

segments¶

When values are grouped, the starting offset of each group

Type:

Union[pdarray, None]

size¶

The number of items in the array

Type:

int_scalars

nlevels¶

The number of distinct categories

Type:

int_scalars

ndim¶

The rank of the array (currently only rank 1 arrays supported)

Type:

int_scalars

shape¶

The sizes of each dimension of the array

Type:

tuple

However, if one already has pre-computed unique categories and integer indices, the following constructor is useful:

classmethod Categorical.from_codes(cls, codes, categories, permutation=None, segments=None, **kwargs)[source]¶

Make a Categorical from codes and categories arrays.

If codes and categories have already been pre-computed, this constructor saves time. If not, please use the normal constructor.

Parameters:

codes (pdarray, int64) – Category indices of each value

categories (Strings) – Unique category labels

permutation (pdarray, int64) – The permutation that groups the values in the same order as categories

segments (pdarray, int64) – When values are grouped, the starting offset of each group

Returns:

The Categorical object created from the input parameters

Return type:

Categorical

Raises:

TypeError – Raised if codes is not a pdarray of int64 objects or if categories is not a Strings object

Operations¶

Arkouda Categorical objects support all operations that Strings support, and they will almost always execute faster:

Indexing with integer, slice, integer pdarray, and boolean pdarray (see Indexing and Assignment)
Comparison (== and !=) with string literal or other Categorical object of same size
Substring search
Categorical.contains(substr, regex=False)[source]¶
Check whether each element contains the given substring.
Parameters:
substr (Union[bytes, str_scalars]) – The substring to search for

regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
Returns:

True for elements that contain substr, False otherwise

Return type:

pdarray

Raises:
TypeError – Raised if the substr parameter is not bytes or str_scalars

ValueError – Rasied if substr is not a valid regex

RuntimeError – Raised if there is a server-side error thrown
See also

Categorical.startswith, Categorical.endswith

Notes

This method can be significantly faster than the corresponding method on Strings objects, because it searches the unique category labels instead of the full array.
Categorical.startswith(substr, regex=False)[source]¶
Check whether each element starts with the given substring.
Parameters:
substr (Union[bytes, str_scalars]) – The substring to search for

regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
Returns:

True for elements that start with substr, False otherwise

Return type:

pdarray

Raises:
TypeError – Raised if the substr parameter is not bytes or str_scalars

ValueError – Rasied if substr is not a valid regex

RuntimeError – Raised if there is a server-side error thrown
See also

Categorical.contains, Categorical.endswith

Notes

This method can be significantly faster than the corresponding method on Strings objects, because it searches the unique category labels instead of the full array.
Categorical.endswith(substr, regex=False)[source]¶
Check whether each element ends with the given substring.
Parameters:
substr (Union[bytes, str_scalars]) – The substring to search for

regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
Returns:

True for elements that end with substr, False otherwise

Return type:

pdarray

Raises:
TypeError – Raised if the substr parameter is not bytes or str_scalars

ValueError – Rasied if substr is not a valid regex

RuntimeError – Raised if there is a server-side error thrown
See also

Categorical.startswith, Categorical.contains

Notes

This method can be significantly faster than the corresponding method on Strings objects, because it searches the unique category labels instead of the full array.
Array Set Operations, e.g. unique and in1d
Sorting, via argsort and coargsort
GroupBy, both alone and in conjunction with numeric arrays

Iteration¶

Iterating directly over a Categorical with for x in categorical is not supported to discourage transferring all the Categorical object’s data from the arkouda server to the Python client since there is almost always a more array-oriented way to express an iterator-based computation. To force this transfer, use the to_ndarray function to return the categorical as a numpy.ndarray. This transfer will raise an error if it exceeds the byte limit defined in ak.client.maxTransferBytes.

arkouda.Categorical.to_ndarray(self)¶

Convert the array to a np.ndarray.

Convert the array to a np.ndarray, transferring array data from the arkouda server to Python. This conversion discards category information and produces an ndarray of strings. If the arrays exceeds a built-in size limit, a RuntimeError is raised.

Returns:: A numpy ndarray of strings corresponding to the values in this array
Return type:: np.ndarray

Notes

The number of bytes in the array cannot exceed ak.client.maxTransferBytes, otherwise a RuntimeError will be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting ak.client.maxTransferBytes to a larger value, but proceed with caution.