Categoricals¶
Categorical arrays are a concept from Pandas that speeds up many operations on strings, especially when an array of strings contains many repeated values. A Categorical
object stores the unique strings as category labels and represents the values of the original array as integer indices into this category array.
Construction¶
The typical way to construct a Categorical
is from a Strings
object:
- class arkouda.Categorical(values, **kwargs)[source]¶
Represents an array of values belonging to named categories. Converting a Strings object to Categorical often saves memory and speeds up operations, especially if there are many repeated values, at the cost of some one-time work in initialization.
- Parameters:
values (Strings, Categorical, pd.Categorical) – Values to convert to categories
NAvalue (str scalar) – The value to use to represent missing/null data
- permutation¶
The permutation that groups the values in the same order as categories
- shape¶
The sizes of each dimension of the array
- Type:
tuple
However, if one already has pre-computed unique categories and integer indices, the following constructor is useful:
- classmethod Categorical.from_codes(cls, codes, categories, permutation=None, segments=None, **kwargs)[source]¶
Make a Categorical from codes and categories arrays. If codes and categories have already been pre-computed, this constructor saves time. If not, please use the normal constructor.
- Parameters:
- Returns:
The Categorical object created from the input parameters
- Return type:
- Raises:
TypeError – Raised if codes is not a pdarray of int64 objects or if categories is not a Strings object
Operations¶
Arkouda Categorical
objects support all operations that Strings
support, and they will almost always execute faster:
Indexing with integer, slice, integer
pdarray
, and booleanpdarray
(see Indexing and Assignment)Comparison (
==
and!=
) with string literal or otherCategorical
object of same sizeSubstring search
- Categorical.contains(substr, regex=False)[source]¶
Check whether each element contains the given substring.
- Parameters:
substr (Union[bytes, str_scalars]) – The substring to search for
regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Returns:
True for elements that contain substr, False otherwise
- Return type:
pdarray, bool
- Raises:
TypeError – Raised if the substr parameter is not bytes or str_scalars
ValueError – Rasied if substr is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Notes
This method can be significantly faster than the corresponding method on Strings objects, because it searches the unique category labels instead of the full array.
- Categorical.startswith(substr, regex=False)[source]¶
Check whether each element starts with the given substring.
- Parameters:
substr (Union[bytes, str_scalars]) – The substring to search for
regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Returns:
True for elements that start with substr, False otherwise
- Return type:
pdarray, bool
- Raises:
TypeError – Raised if the substr parameter is not bytes or str_scalars
ValueError – Rasied if substr is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Notes
This method can be significantly faster than the corresponding method on Strings objects, because it searches the unique category labels instead of the full array.
- Categorical.endswith(substr, regex=False)[source]¶
Check whether each element ends with the given substring.
- Parameters:
substr (Union[bytes, str_scalars]) – The substring to search for
regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Returns:
True for elements that end with substr, False otherwise
- Return type:
pdarray, bool
- Raises:
TypeError – Raised if the substr parameter is not bytes or str_scalars
ValueError – Rasied if substr is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Notes
This method can be significantly faster than the corresponding method on Strings objects, because it searches the unique category labels instead of the full array.
Array Set Operations, e.g.
unique
andin1d
Sorting, via
argsort
andcoargsort
GroupBy, both alone and in conjunction with numeric arrays
Iteration¶
Iterating directly over a Categorical
with for x in categorical
is not supported to discourage transferring all the Categorical
object’s data from the arkouda server to the Python client since there is almost always a more array-oriented way to express an iterator-based computation. To force this transfer, use the to_ndarray
function to return the categorical
as a numpy.ndarray
. This transfer will raise an error if it exceeds the byte limit defined in ak.client.maxTransferBytes
.
- arkouda.Categorical.to_ndarray(self)¶
Convert the array to a np.ndarray, transferring array data from the arkouda server to Python. This conversion discards category information and produces an ndarray of strings. If the arrays exceeds a built-in size limit, a RuntimeError is raised.
- Returns:
A numpy ndarray of strings corresponding to the values in this array
- Return type:
np.ndarray
Notes
The number of bytes in the array cannot exceed
ak.client.maxTransferBytes
, otherwise aRuntimeError
will be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting ak.client.maxTransferBytes to a larger value, but proceed with caution.