Categoricals

Categorical arrays are a concept from Pandas that speeds up many operations on strings, especially when an array of strings contains many repeated values. A Categorical object stores the unique strings as category labels and represents the values of the original array as integer indices into this category array.

Construction

The typical way to construct a Categorical is from a Strings object:

class arkouda.Categorical(values, **kwargs)[source]

Represents an array of values belonging to named categories. Converting a Strings object to Categorical often saves memory and speeds up operations, especially if there are many repeated values, at the cost of some one-time work in initialization.

Parameters:
  • values (Strings) – String values to convert to categories

  • NAvalue (str scalar) – The value to use to represent missing/null data

categories

The set of category labels (determined automatically)

Type:

Strings

codes

The category indices of the values or -1 for N/A

Type:

pdarray, int64

permutation

The permutation that groups the values in the same order as categories

Type:

pdarray, int64

segments

When values are grouped, the starting offset of each group

Type:

pdarray, int64

size

The number of items in the array

Type:

Union[int,np.int64]

nlevels

The number of distinct categories

Type:

Union[int,np.int64]

ndim

The rank of the array (currently only rank 1 arrays supported)

Type:

Union[int,np.int64]

shape

The sizes of each dimension of the array

Type:

tuple

However, if one already has pre-computed unique categories and integer indices, the following constructor is useful:

classmethod Categorical.from_codes(cls, codes, categories, permutation=None, segments=None, **kwargs)[source]

Make a Categorical from codes and categories arrays. If codes and categories have already been pre-computed, this constructor saves time. If not, please use the normal constructor.

Parameters:
  • codes (pdarray, int64) – Category indices of each value

  • categories (Strings) – Unique category labels

  • permutation (pdarray, int64) – The permutation that groups the values in the same order as categories

  • segments (pdarray, int64) – When values are grouped, the starting offset of each group

Returns:

The Categorical object created from the input parameters

Return type:

Categorical

Raises:

TypeError – Raised if codes is not a pdarray of int64 objects or if categories is not a Strings object

Operations

Arkouda Categorical objects support all operations that Strings support, and they will almost always execute faster:

  • Indexing with integer, slice, integer pdarray, and boolean pdarray (see Indexing and Assignment)

  • Comparison (== and !=) with string literal or other Categorical object of same size

  • Substring search

    Categorical.contains(substr, regex=False)[source]

    Check whether each element contains the given substring.

    Parameters:
    • substr (Union[bytes, str_scalars]) – The substring to search for

    • regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)

    Returns:

    True for elements that contain substr, False otherwise

    Return type:

    pdarray, bool

    Raises:
    • TypeError – Raised if the substr parameter is not bytes or str_scalars

    • ValueError – Rasied if substr is not a valid regex

    • RuntimeError – Raised if there is a server-side error thrown

    Notes

    This method can be significantly faster than the corresponding method on Strings objects, because it searches the unique category labels instead of the full array.

    Categorical.startswith(substr, regex=False)[source]

    Check whether each element starts with the given substring.

    Parameters:
    • substr (Union[bytes, str_scalars]) – The substring to search for

    • regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)

    Returns:

    True for elements that start with substr, False otherwise

    Return type:

    pdarray, bool

    Raises:
    • TypeError – Raised if the substr parameter is not bytes or str_scalars

    • ValueError – Rasied if substr is not a valid regex

    • RuntimeError – Raised if there is a server-side error thrown

    Notes

    This method can be significantly faster than the corresponding method on Strings objects, because it searches the unique category labels instead of the full array.

    Categorical.endswith(substr, regex=False)[source]

    Check whether each element ends with the given substring.

    Parameters:
    • substr (Union[bytes, str_scalars]) – The substring to search for

    • regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)

    Returns:

    True for elements that end with substr, False otherwise

    Return type:

    pdarray, bool

    Raises:
    • TypeError – Raised if the substr parameter is not bytes or str_scalars

    • ValueError – Rasied if substr is not a valid regex

    • RuntimeError – Raised if there is a server-side error thrown

    Notes

    This method can be significantly faster than the corresponding method on Strings objects, because it searches the unique category labels instead of the full array.

  • Array Set Operations, e.g. unique and in1d

  • Sorting, via argsort and coargsort

  • GroupBy, both alone and in conjunction with numeric arrays

Iteration

Iterating directly over a Categorical with for x in categorical is not supported to discourage transferring all the Categorical object’s data from the arkouda server to the Python client since there is almost always a more array-oriented way to express an iterator-based computation. To force this transfer, use the to_ndarray function to return the categorical as a numpy.ndarray. This transfer will raise an error if it exceeds the byte limit defined in ak.client.maxTransferBytes.

arkouda.Categorical.to_ndarray(self)

Convert the array to a np.ndarray, transferring array data from the arkouda server to Python. This conversion discards category information and produces an ndarray of strings. If the arrays exceeds a built-in size limit, a RuntimeError is raised.

Returns:

A numpy ndarray of strings corresponding to the values in this array

Return type:

np.ndarray

Notes

The number of bytes in the array cannot exceed ak.client.maxTransferBytes, otherwise a RuntimeError will be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting ak.client.maxTransferBytes to a larger value, but proceed with caution.