arkouda.categorical¶
Classes¶
Represents an array of values belonging to named categories. Converting a |
Module Contents¶
- class arkouda.categorical.Categorical(values, **kwargs)[source]¶
Represents an array of values belonging to named categories. Converting a Strings object to Categorical often saves memory and speeds up operations, especially if there are many repeated values, at the cost of some one-time work in initialization.
- Parameters:
values (Strings, Categorical, pd.Categorical) – Values to convert to categories
NAvalue (str scalar) – The value to use to represent missing/null data
- permutation¶
The permutation that groups the values in the same order as categories
- shape¶
The sizes of each dimension of the array
- Type:
tuple
- BinOps¶
- RegisterablePieces¶
- RequiredPieces¶
- static attach(user_defined_name: str) Categorical [source]¶
DEPRECATED Function to return a Categorical object attached to the registered name in the arkouda server which was registered using register()
- Parameters:
user_defined_name (str) – user defined name which Categorical object was registered under
- Returns:
The Categorical object created by re-attaching to the corresponding server components
- Return type:
- Raises:
TypeError – if user_defined_name is not a string
- concatenate(others: Sequence[Categorical], ordered: bool = True) Categorical [source]¶
Merge this Categorical with other Categorical objects in the array, concatenating the arrays and synchronizing the categories.
- Parameters:
others (Sequence[Categorical]) – The Categorical arrays to concatenate and merge with this one
ordered (bool) – If True (default), the arrays will be appended in the order given. If False, array data may be interleaved in blocks, which can greatly improve performance but results in non-deterministic ordering of elements.
- Returns:
The merged Categorical object
- Return type:
- Raises:
TypeError – Raised if any others array objects are not Categorical objects
Notes
This operation can be expensive – slower than concatenating Strings.
- contains(substr: bytes | arkouda.numpy.dtypes.str_scalars, regex: bool = False) arkouda.pdarrayclass.pdarray [source]¶
Check whether each element contains the given substring.
- Parameters:
substr (Union[bytes, str_scalars]) – The substring to search for
regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Returns:
True for elements that contain substr, False otherwise
- Return type:
pdarray, bool
- Raises:
TypeError – Raised if the substr parameter is not bytes or str_scalars
ValueError – Rasied if substr is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Notes
This method can be significantly faster than the corresponding method on Strings objects, because it searches the unique category labels instead of the full array.
- dtype¶
- endswith(substr: bytes | arkouda.numpy.dtypes.str_scalars, regex: bool = False) arkouda.pdarrayclass.pdarray [source]¶
Check whether each element ends with the given substring.
- Parameters:
substr (Union[bytes, str_scalars]) – The substring to search for
regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Returns:
True for elements that end with substr, False otherwise
- Return type:
pdarray, bool
- Raises:
TypeError – Raised if the substr parameter is not bytes or str_scalars
ValueError – Rasied if substr is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Notes
This method can be significantly faster than the corresponding method on Strings objects, because it searches the unique category labels instead of the full array.
- equals(other) arkouda.numpy.dtypes.bool_scalars [source]¶
Whether Categoricals are the same size and all entries are equal.
- Parameters:
other (object) – object to compare.
- Returns:
True if the Categoricals are the same, o.w. False.
- Return type:
bool
Examples
>>> import arkouda as ak >>> ak.connect() >>> c = Categorical(ak.array(["a", "b", "c"])) >>> c_cpy = Categorical(ak.array(["a", "b", "c"])) >>> c.equals(c_cpy) True >>> c2 = Categorical(ak.array(["a", "x", "c"])) >>> c.equals(c2) False
- classmethod from_codes(codes: arkouda.pdarrayclass.pdarray, categories: arkouda.strings.Strings, permutation=None, segments=None, **kwargs) Categorical [source]¶
Make a Categorical from codes and categories arrays. If codes and categories have already been pre-computed, this constructor saves time. If not, please use the normal constructor.
- Parameters:
- Returns:
The Categorical object created from the input parameters
- Return type:
- Raises:
TypeError – Raised if codes is not a pdarray of int64 objects or if categories is not a Strings object
- classmethod from_return_msg(rep_msg) Categorical [source]¶
Create categorical from return message from server
Notes
This is currently only used when reading a Categorical from HDF5 files.
- group() arkouda.pdarrayclass.pdarray [source]¶
Return the permutation that groups the array, placing equivalent categories together. All instances of the same category are guaranteed to lie in one contiguous block of the permuted array, but the blocks are not necessarily ordered.
- Returns:
The permutation that groups the array by value
- Return type:
See also
GroupBy
,unique
Notes
This method is faster than the corresponding Strings method. If the Categorical was created from a Strings object, then this function simply returns the cached permutation. Even if the Categorical was created using from_codes(), this function will be faster than Strings.group() because it sorts dense integer values, rather than 128-bit hash values.
- hash() Tuple[arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray] [source]¶
Compute a 128-bit hash of each element of the Categorical.
- Returns:
A tuple of two int64 pdarrays. The ith hash value is the concatenation of the ith values from each array.
- Return type:
Notes
The implementation uses SipHash128, a fast and balanced hash function (used by Python for dictionaries and sets). For realistic numbers of strings (up to about 10**15), the probability of a collision between two 128-bit hash values is negligible.
- in1d(test: arkouda.strings.Strings | Categorical) arkouda.pdarrayclass.pdarray [source]¶
Test whether each element of the Categorical object is also present in the test Strings or Categorical object.
Returns a boolean array the same length as self that is True where an element of self is in test and False otherwise.
- Parameters:
test (Union[Strings,Categorical]) – The values against which to test each value of ‘self`.
- Returns:
The values self[in1d] are in the test Strings or Categorical object.
- Return type:
pdarray, bool
- Raises:
TypeError – Raised if test is not a Strings or Categorical object
See also
unique
,intersect1d
,union1d
Notes
in1d can be considered as an element-wise function version of the python keyword in, for 1-D sequences.
in1d(a, b)
is logically equivalent toak.array([item in b for item in a])
, but is much faster and scales to arbitrarily largea
.Examples
>>> strings = ak.array([f'String {i}' for i in range(0,5)]) >>> cat = ak.Categorical(strings) >>> ak.in1d(cat,strings) array([True True True True True]) >>> strings = ak.array([f'String {i}' for i in range(5,9)]) >>> catTwo = ak.Categorical(strings) >>> ak.in1d(cat,catTwo) array([False False False False False])
- property inferred_type: str¶
Return a string of the type inferred from the values.
- info() str [source]¶
Returns a JSON formatted string containing information about all components of self
- Parameters:
None
- Returns:
JSON string containing information about all components of self
- Return type:
str
- is_registered() numpy.bool_ [source]¶
Return True iff the object is contained in the registry or is a component of a registered object.
- Returns:
Indicates if the object is contained in the registry
- Return type:
numpy.bool
- Raises:
RegistrationError – Raised if there’s a server-side error or a mis-match of registered components
See also
register
,attach
,unregister
,unregister_categorical_by_name
Notes
Objects registered with the server are immune to deletion until they are unregistered.
- logger¶
- property nbytes¶
The size of the Categorical in bytes.
- Returns:
The size of the Categorical in bytes.
- Return type:
int
- ndim¶
- nlevels¶
- objType = 'Categorical'¶
- static parse_hdf_categoricals(d: Mapping[str, arkouda.pdarrayclass.pdarray | arkouda.strings.Strings]) Tuple[List[str], Dict[str, Categorical]] [source]¶
This function should be used in conjunction with the load_all function which reads hdf5 files and reconstitutes Categorical objects. Categorical objects use a naming convention and HDF5 structure so they can be identified and constructed for the user.
In general you should not call this method directly
- Parameters:
d (Dictionary of String to either Pdarray or Strings object)
- Returns:
2-Tuple of List of strings containing key names which should be removed and Dictionary of
base name to Categorical object
See also
Categorical.save
,load_all
- permutation: arkouda.pdarrayclass.pdarray | None = None¶
- pretty_print_info() None [source]¶
Prints information about all components of self in a human readable format
- Parameters:
None
- Return type:
None
- register(user_defined_name: str) Categorical [source]¶
Register this Categorical object and underlying components with the Arkouda server
- Parameters:
user_defined_name (str) – user defined name the Categorical is to be registered under, this will be the root name for underlying components
- Returns:
The same Categorical which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different Categoricals with the same name.
- Return type:
- Raises:
TypeError – Raised if user_defined_name is not a str
RegistrationError – If the server was unable to register the Categorical with the user_defined_name
See also
unregister
,attach
,unregister_categorical_by_name
,is_registered
Notes
Objects registered with the server are immune to deletion until they are unregistered.
- registered_name: str | None = None¶
- reset_categories() Categorical [source]¶
Recompute the category labels, discarding any unused labels. This method is often useful after slicing or indexing a Categorical array, when the resulting array only contains a subset of the original categories. In this case, eliminating unused categories can speed up other operations.
- Returns:
A Categorical object generated from the current instance
- Return type:
- save(prefix_path: str, dataset: str = 'categorical_array', file_format: str = 'HDF5', mode: str = 'truncate', file_type: str = 'distribute', compression: str | None = None) str [source]¶
DEPRECATED Save the Categorical object to HDF5 or Parquet. The result is a collection of HDF5/Parquet files, one file per locale of the arkouda server, where each filename starts with prefix_path and dataset. Each locale saves its chunk of the Strings array to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in HDF5 files (must not already exist) :type dataset: str :param file_format: The format to save the file to. :type file_format: str {‘HDF5 | ‘Parquet’} :param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, create a new Categorical dataset within existing files.
- Parameters:
file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.
compression (str (Optional)) – {None | ‘snappy’ | ‘gzip’ | ‘brotli’ | ‘zstd’ | ‘lz4’} The compression type to use when writing. This is only supported for Parquet files and will not be used with HDF5.
- Return type:
String message indicating result of save operation
- Raises:
ValueError – Raised if the lengths of columns and values differ, or the mode is neither ‘truncate’ nor ‘append’
TypeError – Raised if prefix_path, dataset, or mode is not a str
Notes
Important implementation notes: (1) Strings state is saved as two datasets within an hdf5 group: one for the string characters and one for the segments corresponding to the start of each string, (2) the hdf5 group is named via the dataset parameter.
See also
-
,-
- segments = None¶
- set_categories(new_categories, NAvalue=None)[source]¶
Set categories to user-defined values.
- Parameters:
new_categories (Strings) – The array of new categories to use. Must be unique.
NAvalue (str scalar) – The value to use to represent missing/null data
- Returns:
A new Categorical with the user-defined categories. Old values present in new categories will appear unchanged. Old values not present will be assigned the NA value.
- Return type:
- shape¶
- classmethod standardize_categories(arrays, NAvalue='N/A')[source]¶
Standardize an array of Categoricals so that they share the same categories.
- Parameters:
arrays (sequence of Categoricals) – The Categoricals to standardize
NAvalue (str scalar) – The value to use to represent missing/null data
- Returns:
A list of the original Categoricals remapped to the shared categories.
- Return type:
List of Categoricals
- startswith(substr: bytes | arkouda.numpy.dtypes.str_scalars, regex: bool = False) arkouda.pdarrayclass.pdarray [source]¶
Check whether each element starts with the given substring.
- Parameters:
substr (Union[bytes, str_scalars]) – The substring to search for
regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Returns:
True for elements that start with substr, False otherwise
- Return type:
pdarray, bool
- Raises:
TypeError – Raised if the substr parameter is not bytes or str_scalars
ValueError – Rasied if substr is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Notes
This method can be significantly faster than the corresponding method on Strings objects, because it searches the unique category labels instead of the full array.
- to_hdf(prefix_path, dataset='categorical_array', mode='truncate', file_type='distribute')[source]¶
Save the Categorical to HDF5. The result is a collection of HDF5 files, one file per locale of the arkouda server, where each filename starts with prefix_path.
- Parameters:
prefix_path (str) – Directory and filename prefix that all output files will share
dataset (str) – Name prefix for saved data within the HDF5 file
mode (str {'truncate' | 'append'}) – By default, truncate (overwrite) output files, if they exist. If ‘append’, add data as a new column to existing files.
file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale.
- Return type:
None
See also
load
- to_list() List [source]¶
Convert the Categorical to a list, transferring data from the arkouda server to Python. This conversion discards category information and produces a list of strings. If the arrays exceeds a built-in size limit, a RuntimeError is raised.
- Returns:
A list of strings corresponding to the values in this Categorical
- Return type:
list
Notes
The number of bytes in the Categorical cannot exceed
ak.client.maxTransferBytes
, otherwise aRuntimeError
will be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting ak.client.maxTransferBytes to a larger value, but proceed with caution.
- to_ndarray() numpy.ndarray [source]¶
Convert the array to a np.ndarray, transferring array data from the arkouda server to Python. This conversion discards category information and produces an ndarray of strings. If the arrays exceeds a built-in size limit, a RuntimeError is raised.
- Returns:
A numpy ndarray of strings corresponding to the values in this array
- Return type:
np.ndarray
Notes
The number of bytes in the array cannot exceed
ak.client.maxTransferBytes
, otherwise aRuntimeError
will be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting ak.client.maxTransferBytes to a larger value, but proceed with caution.
- to_parquet(prefix_path: str, dataset: str = 'categorical_array', mode: str = 'truncate', compression: str | None = None) str [source]¶
This functionality is currently not supported and will also raise a RuntimeError. Support is in development. Save the Categorical to Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. Each locale saves its chunk of the array to its corresponding file.
- Parameters:
prefix_path (str) – Directory and filename prefix that all output files share
dataset (str) – Name of the dataset to create in HDF5 files (must not already exist)
mode (str {'truncate' | 'append'}) – By default, truncate (overwrite) output files, if they exist. If ‘append’, create a new Categorical dataset within existing files.
compression (str (Optional)) – Default None Provide the compression type to use when writing the file. Supported values: snappy, gzip, brotli, zstd, lz4
- Return type:
String message indicating result of save operation
- Raises:
RuntimeError – On run due to compatability issues of Categorical with Parquet.
Notes
The prefix_path must be visible to the arkouda server and the user must
have write permission. - Output files have names of the form
<prefix_path>_LOCALE<i>
, where<i>
ranges from 0 tonumLocales
for file_type=’distribute’. - ‘append’ write mode is supported, but is not efficient. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, aRuntimeError
will result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.See also
- to_strings() List [source]¶
Convert the Categorical to Strings.
- Returns:
A Strings object corresponding to the values in this Categorical.
- Return type:
Examples
>>> import arkouda as ak >>> ak.connect() >>> a = ak.array(["a","b","c"]) >>> a array(['a', 'b', 'c']) >>> c = ak.Categorical(a) >>> c.to_strings() array(['a', 'b', 'c'])
>>> isinstance(c.to_strings(), ak.Strings) True
- transfer(hostname: str, port: arkouda.numpy.dtypes.int_scalars)[source]¶
Sends a Categorical object to a different Arkouda server
- Parameters:
hostname (str) – The hostname where the Arkouda server intended to receive the Categorical is running.
port (int_scalars) – The port to send the array over. This needs to be an open port (i.e., not one that the Arkouda server is running on). This will open up numLocales ports, each of which in succession, so will use ports of the range {port..(port+numLocales)} (e.g., running an Arkouda server of 4 nodes, port 1234 is passed as port, Arkouda will use ports 1234, 1235, 1236, and 1237 to send the array data). This port much match the port passed to the call to ak.receive_array().
- Return type:
A message indicating a complete transfer
- Raises:
ValueError – Raised if the op is not within the pdarray.BinOps set
TypeError – Raised if other is not a pdarray or the pdarray.dtype is not a supported dtype
- unique() Categorical [source]¶
- unregister() None [source]¶
Unregister this Categorical object in the arkouda server which was previously registered using register() and/or attached to using attach()
- Raises:
RegistrationError – If the object is already unregistered or if there is a server error when attempting to unregister
See also
register
,attach
,unregister_categorical_by_name
,is_registered
Notes
Objects registered with the server are immune to deletion until they are unregistered.
- static unregister_categorical_by_name(user_defined_name: str) None [source]¶
Function to unregister Categorical object by name which was registered with the arkouda server via register()
- Parameters:
user_defined_name (str) – Name under which the Categorical object was registered
- Raises:
TypeError – if user_defined_name is not a string
RegistrationError – if there is an issue attempting to unregister any underlying components
See also
- update_hdf(prefix_path, dataset='categorical_array', repack=True)[source]¶
Overwrite the dataset with the name provided with this Categorical object. If the dataset does not exist it is added.
- Parameters:
prefix_path (str) – Directory and filename prefix that all output files share
dataset (str) – Name of the dataset to create in files
repack (bool) – Default: True HDF5 does not release memory on delete. When True, the inaccessible data (that was overwritten) is removed. When False, the data remains, but is inaccessible. Setting to false will yield better performance, but will cause file sizes to expand.
- Return type:
None
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the Categorical
Notes
If file does not contain File_Format attribute to indicate how it was saved, the file name is checked for _LOCALE#### to determine if it is distributed.
If the dataset provided does not exist, it will be added
Because HDF5 deletes do not release memory, the repack option allows for automatic creation of a file without the inaccessible data.