arkouda.strings¶
Classes¶
Represents an array of strings whose data resides on the |
Module Contents¶
- class arkouda.strings.Strings(strings_pdarray: arkouda.pdarrayclass.pdarray, bytes_size: arkouda.numpy.dtypes.int_scalars)[source]¶
Represents an array of strings whose data resides on the arkouda server. The user should not call this class directly; rather its instances are created by other arkouda functions.
- entry¶
Encapsulation of a Segmented Strings array contained on the arkouda server. This is a composite of
offsets array: starting indices for each string
bytes array: raw bytes of all strings joined by nulls
- Type:
- size¶
The number of strings in the array
- Type:
- nbytes¶
The total number of bytes in all strings
- Type:
- ndim¶
The rank of the array (currently only rank 1 arrays supported)
- Type:
- shape¶
The sizes of each dimension of the array
- Type:
tuple
- dtype¶
The dtype is ak.str
- Type:
dtype
- logger¶
Used for all logging operations
- Type:
ArkoudaLogger
Notes
Strings is composed of two pdarrays: (1) offsets, which contains the starting indices for each string and (2) bytes, which contains the raw bytes of all strings, delimited by nulls.
- BinOps¶
- astype(dtype) arkouda.pdarrayclass.pdarray [source]¶
Cast values of Strings object to provided dtype
- Parameters:
dtype (np.dtype or str) – Dtype to cast to
- Returns:
An arkouda pdarray with values converted to the specified data type
- Return type:
ak.pdarray
Notes
This is essentially shorthand for ak.cast(x, ‘<dtype>’) where x is a pdarray.
- static attach(user_defined_name: str) Strings [source]¶
class method to return a Strings object attached to the registered name in the arkouda server which was registered using register()
- Parameters:
user_defined_name (str) – user defined name which the Strings object was registered under
- Returns:
the Strings object registered with user_defined_name in the arkouda server
- Return type:
Strings object
- Raises:
TypeError – Raised if user_defined_name is not a str
See also
Notes
Registered names/Strings objects in the server are immune to deletion until they are unregistered.
- cached_regex_patterns() List [source]¶
Returns the regex patterns for which Match objects have been cached
- capitalize() Strings [source]¶
Returns a new Strings from the original replaced with the first letter capitilzed and the remaining letters lowercase.
- Returns:
Strings from the original replaced with the capitalized equivalent.
- Return type:
- Raises:
RuntimeError – Raised if there is a server-side error thrown.
See also
Strings.lower
,String.upper
,String.title
Examples
>>> strings = ak.array([f'StrINgS aRe Here {i}' for i in range(5)]) >>> strings array(['StrINgS aRe Here 0', 'StrINgS aRe Here 1', 'StrINgS aRe Here 2', 'StrINgS aRe Here 3', ... 'StrINgS aRe Here 4']) >>> strings.title() array(['Strings are here 0', 'Strings are here 1', 'Strings are here 2', 'Strings are here 3', ... 'Strings are here 4'])
- contains(substr: bytes | arkouda.numpy.dtypes.str_scalars, regex: bool = False) arkouda.pdarrayclass.pdarray [source]¶
Check whether each element contains the given substring.
- Parameters:
substr (str_scalars) – The substring in the form of string or byte array to search for
regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Returns:
True for elements that contain substr, False otherwise
- Return type:
pdarray, bool
- Raises:
TypeError – Raised if the substr parameter is not bytes or str_scalars
ValueError – Rasied if substr is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings = ak.array([f'{i} string {i}' for i in range(1, 6)]) >>> strings array(['1 string 1', '2 string 2', '3 string 3', '4 string 4', '5 string 5']) >>> strings.contains('string') array([True, True, True, True, True]) >>> strings.contains('string \d', regex=True) array([True, True, True, True, True])
- decode(fromEncoding, toEncoding='UTF-8')[source]¶
Return a new strings object in fromEncoding, expecting that the current Strings is encoded in toEncoding
- Parameters:
fromEncoding (str) – The current encoding of the strings object
toEncoding (str) – The encoding that the strings will be converted to, default to UTF-8
- Returns:
A new Strings object in toEncoding
- Return type:
- Raises:
RuntimeError – Raised if there is a server-side error thrown
- dtype¶
- encode(toEncoding: str, fromEncoding: str = 'UTF-8')[source]¶
Return a new strings object in toEncoding, expecting that the current Strings is encoded in fromEncoding
- Parameters:
toEncoding (str) – The encoding that the strings will be converted to
fromEncoding (str) – The current encoding of the strings object, default to UTF-8
- Returns:
A new Strings object in toEncoding
- Return type:
- Raises:
RuntimeError – Raised if there is a server-side error thrown
- endswith(substr: bytes | arkouda.numpy.dtypes.str_scalars, regex: bool = False) arkouda.pdarrayclass.pdarray [source]¶
Check whether each element ends with the given substring.
- Parameters:
substr (Union[bytes, str_scalars]) – The suffix to search for
regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Returns:
True for elements that end with substr, False otherwise
- Return type:
pdarray, bool
- Raises:
TypeError – Raised if the substr parameter is not bytes or str_scalars
ValueError – Rasied if substr is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings_start = ak.array([f'{i} string' for i in range(1,6)]) >>> strings_start array(['1 string', '2 string', '3 string', '4 string', '5 string']) >>> strings_start.endswith('ing') array([True, True, True, True, True]) >>> strings_end = ak.array([f'string {i}' for i in range(1, 6)]) >>> strings_end array(['string 1', 'string 2', 'string 3', 'string 4', 'string 5']) >>> strings_end.endswith('ing \d', regex = True) array([True, True, True, True, True])
- equals(other) bool [source]¶
Whether Strings are the same size and all entries are equal.
- Parameters:
other (object) – object to compare.
- Returns:
True if the Strings are the same, o.w. False.
- Return type:
bool
Examples
>>> import arkouda as ak >>> ak.connect() >>> s = ak.array(["a", "b", "c"]) >>> s_cpy = ak.array(["a", "b", "c"]) >>> s.equals(s_cpy) True >>> s2 = ak.array(["a", "x", "c"]) >>> s.equals(s2) False
- find_locations(pattern: bytes | arkouda.numpy.dtypes.str_scalars) Tuple[arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray] [source]¶
Finds pattern matches and returns pdarrays containing the number, start postitions, and lengths of matches
- Parameters:
pattern (str_scalars) – The regex pattern used to find matches
- Returns:
pdarray, int64 – For each original string, the number of pattern matches
pdarray, int64 – The start positons of pattern matches
pdarray, int64 – The lengths of pattern matches
- Raises:
TypeError – Raised if the pattern parameter is not bytes or str_scalars
ValueError – Raised if pattern is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings = ak.array([f'{i} string {i}' for i in range(1, 6)]) >>> num_matches, starts, lens = strings.find_locations('\d') >>> num_matches array([2, 2, 2, 2, 2]) >>> starts array([0, 9, 0, 9, 0, 9, 0, 9, 0, 9]) >>> lens array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1]))
- findall(pattern: bytes | arkouda.numpy.dtypes.str_scalars, return_match_origins: bool = False) Strings | Tuple [source]¶
Return a new Strings containg all non-overlapping matches of pattern
- Parameters:
pattern (str_scalars) – Regex used to find matches
return_match_origins (bool) – If True, return a pdarray containing the index of the original string each pattern match is from
- Returns:
Strings – Strings object containing only pattern matches
pdarray, int64 (optional) – The index of the original string each pattern match is from
- Raises:
TypeError – Raised if the pattern parameter is not bytes or str_scalars
ValueError – Raised if pattern is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', '']) >>> strings.findall('_+', return_match_origins=True) (array(['_', '___', '____', '__', '___', '____', '___']), array([0 0 1 3 3 3 3]))
- flatten(delimiter: str, return_segments: bool = False, regex: bool = False) Strings | Tuple [source]¶
Unpack delimiter-joined substrings into a flat array.
- Parameters:
delimiter (str) – Characters used to split strings into substrings
return_segments (bool) – If True, also return mapping of original strings to first substring in return array.
regex (bool) – Indicates whether delimiter is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Returns:
Strings – Flattened substrings with delimiters removed
pdarray, int64 (optional) – For each original string, the index of first corresponding substring in the return array
Examples
>>> orig = ak.array(['one|two', 'three|four|five', 'six']) >>> orig.flatten('|') array(['one', 'two', 'three', 'four', 'five', 'six']) >>> flat, map = orig.flatten('|', return_segments=True) >>> map array([0, 2, 5]) >>> under = ak.array(['one_two', 'three_____four____five', 'six']) >>> under_flat, under_map = under.flatten('_+', return_segments=True, regex=True) >>> under_flat array(['one', 'two', 'three', 'four', 'five', 'six']) >>> under_map array([0, 2, 5])
- static from_parts(offset_attrib: arkouda.pdarrayclass.pdarray | str, bytes_attrib: arkouda.pdarrayclass.pdarray | str) Strings [source]¶
Factory method for creating a Strings object from an Arkouda server response where the arrays are separate components.
- Parameters:
- Returns:
object representing a segmented strings array on the server
- Return type:
- Raises:
RuntimeError – Raised if there’s an error converting a server-returned str-descriptor
Notes
This factory method is used when we construct the parts of a Strings object on the client side and transfer the offsets & bytes separately to the server. This results in two entries in the symbol table and we need to instruct the server to assemble the into a composite entity.
- static from_return_msg(rep_msg: str) Strings [source]¶
Factory method for creating a Strings object from an Arkouda server response message
- Parameters:
rep_msg (str) – Server response message currently of form created name type size ndim shape itemsize+created bytes.size 1234
- Returns:
object representing a segmented strings array on the server
- Return type:
- Raises:
RuntimeError – Raised if there’s an error converting a server-returned str-descriptor
Notes
We really don’t have an itemsize because these are variable length strings. In the future we could probably use this position to store the total bytes.
- fullmatch(pattern: bytes | arkouda.numpy.dtypes.str_scalars) arkouda.match.Match [source]¶
Returns a match object where elements match only if the whole string matches the regular expression pattern
- Parameters:
pattern (str) – Regex used to find matches
- Returns:
Match object where elements match only if the whole string matches the regular expression pattern
- Return type:
Examples
>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', '']) >>> strings.fullmatch('_+') <ak.Match object: matched=False; matched=True, span=(0, 4); matched=False; matched=False; matched=False>
- get_bytes()[source]¶
Getter for the bytes component (uint8 pdarray) of this Strings.
Example
>>> x = ak.array(['one', 'two', 'three']) >>> x.get_bytes() [111 110 101 0 116 119 111 0 116 104 114 101 101 0]
- get_lengths() arkouda.pdarrayclass.pdarray [source]¶
Return the length of each string in the array.
- Returns:
The length of each string
- Return type:
pdarray, int
- Raises:
RuntimeError – Raised if there is a server-side error thrown
- get_offsets()[source]¶
Getter for the offsets component (int64 pdarray) of this Strings.
Example
>>> x = ak.array(['one', 'two', 'three']) >>> x.get_offsets() [0 4 8]
- get_prefixes(n: arkouda.numpy.dtypes.int_scalars, return_origins: bool = True, proper: bool = True) Strings | Tuple[Strings, arkouda.pdarrayclass.pdarray] [source]¶
Return the n-long prefix of each string, where possible
- Parameters:
n (int) – Length of prefix
return_origins (bool) – If True, return a logical index indicating which strings were long enough to return an n-prefix
proper (bool) – If True, only return proper prefixes, i.e. from strings that are at least n+1 long. If False, allow the entire string to be returned as a prefix.
- Returns:
prefixes (Strings) – The array of n-character prefixes; the number of elements is the number of True values in the returned mask.
origin_indices (pdarray, bool) – Boolean array that is True where the string was long enough to return an n-character prefix, False otherwise.
- get_suffixes(n: arkouda.numpy.dtypes.int_scalars, return_origins: bool = True, proper: bool = True) Strings | Tuple[Strings, arkouda.pdarrayclass.pdarray] [source]¶
Return the n-long suffix of each string, where possible
- Parameters:
n (int) – Length of suffix
return_origins (bool) – If True, return a logical index indicating which strings were long enough to return an n-suffix
proper (bool) – If True, only return proper suffixes, i.e. from strings that are at least n+1 long. If False, allow the entire string to be returned as a suffix.
- Returns:
suffixes (Strings) – The array of n-character suffixes; the number of elements is the number of True values in the returned mask.
origin_indices (pdarray, bool) – Boolean array that is True where the string was long enough to return an n-character suffix, False otherwise.
- group() arkouda.pdarrayclass.pdarray [source]¶
Return the permutation that groups the array, placing equivalent strings together. All instances of the same string are guaranteed to lie in one contiguous block of the permuted array, but the blocks are not necessarily ordered.
- Returns:
The permutation that groups the array by value
- Return type:
See also
GroupBy
,unique
Notes
If the arkouda server is compiled with “-sSegmentedString.useHash=true”, then arkouda uses 128-bit hash values to group strings, rather than sorting the strings directly. This method is fast, but the resulting permutation merely groups equivalent strings and does not sort them. If the “useHash” parameter is false, then a full sort is performed.
- Raises:
RuntimeError – Raised if there is a server-side error in executing group request or creating the pdarray encapsulating the return message
- hash() Tuple[arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray] [source]¶
Compute a 128-bit hash of each string.
- Returns:
A tuple of two int64 pdarrays. The ith hash value is the concatenation of the ith values from each array.
- Return type:
Notes
The implementation uses SipHash128, a fast and balanced hash function (used by Python for dictionaries and sets). For realistic numbers of strings (up to about 10**15), the probability of a collision between two 128-bit hash values is negligible.
- property inferred_type: str¶
Return a string of the type inferred from the values.
- info() str [source]¶
Returns a JSON formatted string containing information about all components of self
- Parameters:
None
- Returns:
JSON string containing information about all components of self
- Return type:
str
- is_registered() numpy.bool_ [source]¶
Return True iff the object is contained in the registry
- Parameters:
None
- Returns:
Indicates if the object is contained in the registry
- Return type:
bool
- Raises:
RuntimeError – Raised if there’s a server-side error thrown
- isalnum() arkouda.pdarrayclass.pdarray [source]¶
Returns a boolean pdarray where index i indicates whether string i of the Strings is alphanumeric.
- Returns:
True for elements that are alphanumeric, False otherwise
- Return type:
pdarray, bool
- Raises:
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> not_alnum = ak.array([f'%Strings {i}' for i in range(3)]) >>> alnum = ak.array([f'Strings{i}' for i in range(3)]) >>> strings = ak.concatenate([not_alnum, alnum]) >>> strings array(['%Strings 0', '%Strings 1', '%Strings 2', 'Strings0', 'Strings1', 'Strings2']) >>> strings.isalnum() array([False False False True True True])
- isalpha() arkouda.pdarrayclass.pdarray [source]¶
Returns a boolean pdarray where index i indicates whether string i of the Strings is alphabetic. This means there is at least one character, and all the characters are alphabetic.
- Returns:
True for elements that are alphabetic, False otherwise
- Return type:
pdarray, bool
- Raises:
RuntimeError – Raised if there is a server-side error thrown
See also
Strings.islower
,Strings.isupper
,Strings.istitle
,Strings.isalnum
Examples
>>> not_alpha = ak.array([f'%Strings {i}' for i in range(3)]) >>> alpha = ak.array(['StringA','StringB','StringC']) >>> strings = ak.concatenate([not_alpha, alpha]) >>> strings array(['%Strings 0', '%Strings 1', '%Strings 2', 'StringA','StringB','StringC']) >>> strings.isalpha() array([False False False True True True])
- isdecimal() arkouda.pdarrayclass.pdarray [source]¶
Returns a boolean pdarray where index i indicates whether string i of the Strings has all decimal characters.
- Returns:
True for elements that are decimals, False otherwise
- Return type:
pdarray, bool
- Raises:
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> not_decimal = ak.array([f'Strings {i}' for i in range(3)]) >>> decimal = ak.array([f'12{i}' for i in range(3)]) >>> strings = ak.concatenate([not_decimal, decimal]) >>> strings array(['Strings 0', 'Strings 1', 'Strings 2', '120', '121', '122']) >>> strings.isdecimal() array([False False False True True True]) Special Character Examples >>> special_strings = ak.array(["3.14", "0", "²", "2³₇", "2³x₇"]) >>> special_strings array(['3.14', '0', '²', '2³₇', '2³x₇']) >>> special_strings.isdecimal() array([False True False False False])
- isdigit() arkouda.pdarrayclass.pdarray [source]¶
Returns a boolean pdarray where index i indicates whether string i of the Strings has all digit characters.
- Returns:
True for elements that are digits, False otherwise
- Return type:
pdarray, bool
- Raises:
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> not_digit = ak.array([f'Strings {i}' for i in range(3)]) >>> digit = ak.array([f'12{i}' for i in range(3)]) >>> strings = ak.concatenate([not_digit, digit]) >>> strings array(['Strings 0', 'Strings 1', 'Strings 2', '120', '121', '122']) >>> strings.isdigit() array([False False False True True True]) Special Character Examples >>> special_strings = ak.array(["3.14", "0", "²", "2³₇", "2³x₇"]) >>> special_strings array(['3.14', '0', '²', '2³₇', '2³x₇']) >>> special_strings.isdigit() array([False True True True False])
- isempty() arkouda.pdarrayclass.pdarray [source]¶
Returns a boolean pdarray where index i indicates whether string i of the Strings is empty.
True for elements that are the empty string, False otherwise
- Returns:
True for elements that are digits, False otherwise
- Return type:
pdarray, bool
- Raises:
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> not_empty = ak.array([f'Strings {i}' for i in range(3)]) >>> empty = ak.array(['' for i in range(3)]) >>> strings = ak.concatenate([not_empty, empty]) >>> strings array(['%Strings 0', '%Strings 1', '%Strings 2', '', '', '']) >>> strings.isempty()
- islower() arkouda.pdarrayclass.pdarray [source]¶
Returns a boolean pdarray where index i indicates whether string i of the Strings is entirely lowercase
- Returns:
True for elements that are entirely lowercase, False otherwise
- Return type:
pdarray, bool
- Raises:
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> lower = ak.array([f'strings {i}' for i in range(3)]) >>> upper = ak.array([f'STRINGS {i}' for i in range(3)]) >>> strings = ak.concatenate([lower, upper]) >>> strings array(['strings 0', 'strings 1', 'strings 2', 'STRINGS 0', 'STRINGS 1', 'STRINGS 2']) >>> strings.islower() array([True True True False False False])
- isspace() arkouda.pdarrayclass.pdarray [source]¶
Returns a boolean pdarray where index i indicates whether string i has all whitespace characters (‘ ‘, ‘ ’, ‘
’, ‘ ’, ‘ ’, ‘ ’).
- pdarray, bool
True for elements that are whitespace, False otherwise
- RuntimeError
Raised if there is a server-side error thrown
Strings.islower Strings.isupper Strings.istitle
>>> not_space = ak.array([f'Strings {i}' for i in range(3)]) >>> space = ak.array([' ', ' ', '
‘, ‘ ‘, ‘ ‘, ‘ ‘, ‘
- ‘])
>>> strings = ak.concatenate([not_space, space]) >>> strings array(['Strings 0', 'Strings 1', 'Strings 2', ' ', ... 'u0009', 'n', 'u000B', 'u000C', 'u000D', ' u0009nu000Bu000Cu000D']) >>> strings.isspace() array([False False False True True True True True True True])
- istitle() arkouda.pdarrayclass.pdarray [source]¶
Returns a boolean pdarray where index i indicates whether string i of the Strings is titlecase
- Returns:
True for elements that are titlecase, False otherwise
- Return type:
pdarray, bool
- Raises:
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> mixed = ak.array([f'sTrINgs {i}' for i in range(3)]) >>> title = ak.array([f'Strings {i}' for i in range(3)]) >>> strings = ak.concatenate([mixed, title]) >>> strings array(['sTrINgs 0', 'sTrINgs 1', 'sTrINgs 2', 'Strings 0', 'Strings 1', 'Strings 2']) >>> strings.istitle() array([False False False True True True])
- isupper() arkouda.pdarrayclass.pdarray [source]¶
Returns a boolean pdarray where index i indicates whether string i of the Strings is entirely uppercase
- Returns:
True for elements that are entirely uppercase, False otherwise
- Return type:
pdarray, bool
- Raises:
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> lower = ak.array([f'strings {i}' for i in range(3)]) >>> upper = ak.array([f'STRINGS {i}' for i in range(3)]) >>> strings = ak.concatenate([lower, upper]) >>> strings array(['strings 0', 'strings 1', 'strings 2', 'STRINGS 0', 'STRINGS 1', 'STRINGS 2']) >>> strings.isupper() array([False False False True True True])
- logger¶
- lower() Strings [source]¶
Returns a new Strings with all uppercase characters from the original replaced with their lowercase equivalent
- Returns:
Strings with all uppercase characters from the original replaced with their lowercase equivalent
- Return type:
- Raises:
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings = ak.array([f'StrINgS {i}' for i in range(5)]) >>> strings array(['StrINgS 0', 'StrINgS 1', 'StrINgS 2', 'StrINgS 3', 'StrINgS 4']) >>> strings.lower() array(['strings 0', 'strings 1', 'strings 2', 'strings 3', 'strings 4'])
- lstick(other: Strings, delimiter: bytes | arkouda.numpy.dtypes.str_scalars = '') Strings [source]¶
Join the strings from another array onto the left of the strings of this array, optionally inserting a delimiter. Warning: This function is experimental and not guaranteed to work.
- Parameters:
other (Strings) – The strings to join onto self’s strings
delimiter (Union[bytes,str_scalars]) – String inserted between self and other
- Returns:
The array of joined strings, as other + self
- Return type:
- Raises:
TypeError – Raised if the delimiter parameter is neither bytes nor a str or if the other parameter is not a Strings instance
RuntimeError – Raised if there is a server-side error thrown
Examples
>>> s = ak.array(['a', 'c', 'e']) >>> t = ak.array(['b', 'd', 'f']) >>> s.lstick(t, delimiter='.') array(['b.a', 'd.c', 'f.e'])
- match(pattern: bytes | arkouda.numpy.dtypes.str_scalars) arkouda.match.Match [source]¶
Returns a match object where elements match only if the beginning of the string matches the regular expression pattern
- Parameters:
pattern (str) – Regex used to find matches
- Returns:
Match object where elements match only if the beginning of the string matches the regular expression pattern
- Return type:
Examples
>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', '']) >>> strings.match('_+') <ak.Match object: matched=False; matched=True, span=(0, 4); matched=False; matched=True, span=(0, 2); matched=False>
- objType = 'Strings'¶
- peel(delimiter: bytes | arkouda.numpy.dtypes.str_scalars, times: arkouda.numpy.dtypes.int_scalars = 1, includeDelimiter: bool = False, keepPartial: bool = False, fromRight: bool = False, regex: bool = False) Tuple [source]¶
Peel off one or more delimited fields from each string (similar to string.partition), returning two new arrays of strings. Warning: This function is experimental and not guaranteed to work.
- Parameters:
delimiter (Union[bytes, str_scalars]) – The separator where the split will occur
times (Union[int, np.int64]) – The number of times the delimiter is sought, i.e. skip over the first (times-1) delimiters
includeDelimiter (bool) – If true, append the delimiter to the end of the first return array. By default, it is prepended to the beginning of the second return array.
keepPartial (bool) – If true, a string that does not contain <times> instances of the delimiter will be returned in the first array. By default, such strings are returned in the second array.
fromRight (bool) – If true, peel from the right instead of the left (see also rpeel)
regex (bool) – Indicates whether delimiter is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Returns:
- left: Strings
The field(s) peeled from the end of each string (unless fromRight is true)
- right: Strings
The remainder of each string after peeling (unless fromRight is true)
- Return type:
- Raises:
TypeError – Raised if the delimiter parameter is not byte or str_scalars, if times is not int64, or if includeDelimiter, keepPartial, or fromRight is not bool
ValueError – Raised if times is < 1 or if delimiter is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
Examples
>>> s = ak.array(['a.b', 'c.d', 'e.f.g']) >>> s.peel('.') (array(['a', 'c', 'e']), array(['b', 'd', 'f.g'])) >>> s.peel('.', includeDelimiter=True) (array(['a.', 'c.', 'e.']), array(['b', 'd', 'f.g'])) >>> s.peel('.', times=2) (array(['', '', 'e.f']), array(['a.b', 'c.d', 'g'])) >>> s.peel('.', times=2, keepPartial=True) (array(['a.b', 'c.d', 'e.f']), array(['', '', 'g']))
- pretty_print_info() None [source]¶
Prints information about all components of self in a human readable format
- Parameters:
None
- Return type:
None
- register(user_defined_name: str) Strings [source]¶
Register this Strings object with a user defined name in the arkouda server so it can be attached to later using Strings.attach() This is an in-place operation, registering a Strings object more than once will update the name in the registry and remove the previously registered name. A name can only be registered to one object at a time.
- Parameters:
user_defined_name (str) – user defined name which the Strings object is to be registered under
- Returns:
The same Strings object which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different objects with the same name.
- Return type:
- Raises:
TypeError – Raised if user_defined_name is not a str
RegistrationError – If the server was unable to register the Strings object with the user_defined_name If the user is attempting to register more than one object with the same name, the former should be unregistered first to free up the registration name.
See also
Notes
Registered names/Strings objects in the server are immune to deletion until they are unregistered.
- registered_name: str | None = None¶
- rpeel(delimiter: bytes | arkouda.numpy.dtypes.str_scalars, times: arkouda.numpy.dtypes.int_scalars = 1, includeDelimiter: bool = False, keepPartial: bool = False, regex: bool = False)[source]¶
Peel off one or more delimited fields from the end of each string (similar to string.rpartition), returning two new arrays of strings. Warning: This function is experimental and not guaranteed to work.
- Parameters:
delimiter (Union[bytes, str_scalars]) – The separator where the split will occur
times (Union[int, np.int64]) – The number of times the delimiter is sought, i.e. skip over the last (times-1) delimiters
includeDelimiter (bool) – If true, prepend the delimiter to the start of the first return array. By default, it is appended to the end of the second return array.
keepPartial (bool) – If true, a string that does not contain <times> instances of the delimiter will be returned in the second array. By default, such strings are returned in the first array.
regex (bool) – Indicates whether delimiter is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Returns:
- left: Strings
The remainder of the string after peeling
- right: Strings
The field(s) that were peeled from the right of each string
- Return type:
- Raises:
TypeError – Raised if the delimiter parameter is not bytes or str_scalars or if times is not int64
ValueError – Raised if times is < 1 or if delimiter is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
Examples
>>> s = ak.array(['a.b', 'c.d', 'e.f.g']) >>> s.rpeel('.') (array(['a', 'c', 'e.f']), array(['b', 'd', 'g'])) # Compared against peel >>> s.peel('.') (array(['a', 'c', 'e']), array(['b', 'd', 'f.g']))
- save(prefix_path: str, dataset: str = 'strings_array', mode: str = 'truncate', save_offsets: bool = True, compression: str | None = None, file_format: str = 'HDF5', file_type: str = 'distribute') str [source]¶
DEPRECATED Save the Strings object to HDF5 or Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. HDF5 support single files, in which case the file name will only be that provided. Each locale saves its chunk of the array to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: The name of the Strings dataset to be written, defaults to strings_array :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, create a new Strings dataset within existing files.
- Parameters:
save_offsets (bool) – Defaults to True which will instruct the server to save the offsets array to HDF5 If False the offsets array will not be save and will be derived from the string values upon load/read. This is not supported for Parquet files.
compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Sets the compression type used with Parquet files
file_format (str) – By default, saved files will be written to the HDF5 file format. If ‘Parquet’, the files will be written to the Parquet file format. This is case insensitive.
file_type (str ("single" | "distribute")) – Default: Distribute Distribute the dataset over a file per locale. Single file will save the dataset to one file
- Return type:
String message indicating result of save operation
Notes
Important implementation notes: (1) Strings state is saved as two datasets within an hdf5 group: one for the string characters and one for the segments corresponding to the start of each string, (2) the hdf5 group is named via the dataset parameter. (3) Parquet files do not store the segments, only the values.
- search(pattern: bytes | arkouda.numpy.dtypes.str_scalars) arkouda.match.Match [source]¶
Returns a match object with the first location in each element where pattern produces a match. Elements match if any part of the string matches the regular expression pattern
- Parameters:
pattern (str) – Regex used to find matches
- Returns:
Match object where elements match if any part of the string matches the regular expression pattern
- Return type:
Examples
>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', '']) >>> strings.search('_+') <ak.Match object: matched=True, span=(1, 2); matched=True, span=(0, 4); matched=False; matched=True, span=(0, 2); matched=False>
- split(pattern: bytes | arkouda.numpy.dtypes.str_scalars, maxsplit: int = 0, return_segments: bool = False) Strings | Tuple [source]¶
Returns a new Strings split by the occurrences of pattern. If maxsplit is nonzero, at most maxsplit splits occur
- Parameters:
pattern (str) – Regex used to split strings into substrings
maxsplit (int) – The max number of pattern match occurences in each element to split. The default maxsplit=0 splits on all occurences
return_segments (bool) – If True, return mapping of original strings to first substring in return array.
- Returns:
Strings – Substrings with pattern matches removed
pdarray, int64 (optional) – For each original string, the index of first corresponding substring in the return array
Examples
>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', '']) >>> strings.split('_+', maxsplit=2, return_segments=True) (array(['1', '2', '', '', '', '3', '', '4', '5____6___7', '']), array([0 3 5 6 9]))
- startswith(substr: bytes | arkouda.numpy.dtypes.str_scalars, regex: bool = False) arkouda.pdarrayclass.pdarray [source]¶
Check whether each element starts with the given substring.
- Parameters:
substr (Union[bytes, str_scalars]) – The prefix to search for
regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Returns:
True for elements that start with substr, False otherwise
- Return type:
pdarray, bool
- Raises:
TypeError – Raised if the substr parameter is not a bytes ior str_scalars
ValueError – Rasied if substr is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings_end = ak.array([f'string {i}' for i in range(1, 6)]) >>> strings_end array(['string 1', 'string 2', 'string 3', 'string 4', 'string 5']) >>> strings_end.startswith('string') array([True, True, True, True, True]) >>> strings_start = ak.array([f'{i} string' for i in range(1,6)]) >>> strings_start array(['1 string', '2 string', '3 string', '4 string', '5 string']) >>> strings_start.startswith('\d str', regex = True) array([True, True, True, True, True])
- stick(other: Strings, delimiter: bytes | arkouda.numpy.dtypes.str_scalars = '', toLeft: bool = False) Strings [source]¶
Join the strings from another array onto one end of the strings of this array, optionally inserting a delimiter. Warning: This function is experimental and not guaranteed to work.
- Parameters:
other (Strings) – The strings to join onto self’s strings
delimiter (str) – String inserted between self and other
toLeft (bool) – If true, join other strings to the left of self. By default, other is joined to the right of self.
- Returns:
The array of joined strings
- Return type:
- Raises:
TypeError – Raised if the delimiter parameter is not bytes or str_scalars or if the other parameter is not a Strings instance
ValueError – Raised if times is < 1
RuntimeError – Raised if there is a server-side error thrown
Examples
>>> s = ak.array(['a', 'c', 'e']) >>> t = ak.array(['b', 'd', 'f']) >>> s.stick(t, delimiter='.') array(['a.b', 'c.d', 'e.f'])
- strip(chars: bytes | arkouda.numpy.dtypes.str_scalars | None = '') Strings [source]¶
Returns a new Strings object with all leading and trailing occurrences of characters contained in chars removed. The chars argument is a string specifying the set of characters to be removed. If omitted, the chars argument defaults to removing whitespace. The chars argument is not a prefix or suffix; rather, all combinations of its values are stripped.
- Parameters:
chars – the set of characters to be removed
- Returns:
Strings object with the leading and trailing characters matching the set of characters in the chars argument removed
- Return type:
- Raises:
RuntimeError – Raised if there is a server-side error thrown
Examples
>>> strings = ak.array(['Strings ', ' StringS ', 'StringS ']) >>> s = strings.strip() >>> s array(['Strings', 'StringS', 'StringS'])
>>> strings = ak.array(['Strings 1', '1 StringS ', ' 1StringS 12 ']) >>> s = strings.strip(' 12') >>> s array(['Strings', 'StringS', 'StringS'])
- sub(pattern: bytes | arkouda.numpy.dtypes.str_scalars, repl: bytes | arkouda.numpy.dtypes.str_scalars, count: int = 0) Strings [source]¶
Return new Strings obtained by replacing non-overlapping occurrences of pattern with the replacement repl. If count is nonzero, at most count substitutions occur
- Parameters:
pattern (str_scalars) – The regex to substitue
repl (str_scalars) – The substring to replace pattern matches with
count (int) – The max number of pattern match occurences in each element to replace. The default count=0 replaces all occurences of pattern with repl
- Returns:
Strings with pattern matches replaced
- Return type:
- Raises:
TypeError – Raised if pattern or repl are not bytes or str_scalars
ValueError – Raised if pattern is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', '']) >>> strings.sub(pattern='_+', repl='-', count=2) array(['1-2-', '-', '3', '-4-5____6___7', ''])
- subn(pattern: bytes | arkouda.numpy.dtypes.str_scalars, repl: bytes | arkouda.numpy.dtypes.str_scalars, count: int = 0) Tuple [source]¶
Perform the same operation as sub(), but return a tuple (new_Strings, number_of_substitions)
- Parameters:
pattern (str_scalars) – The regex to substitue
repl (str_scalars) – The substring to replace pattern matches with
count (int) – The max number of pattern match occurences in each element to replace. The default count=0 replaces all occurences of pattern with repl
- Returns:
Strings – Strings with pattern matches replaced
pdarray, int64 – The number of substitutions made for each element of Strings
- Raises:
TypeError – Raised if pattern or repl are not bytes or str_scalars
ValueError – Raised if pattern is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', '']) >>> strings.subn(pattern='_+', repl='-', count=2) (array(['1-2-', '-', '3', '-4-5____6___7', '']), array([2 1 0 2 0]))
- title() Strings [source]¶
Returns a new Strings from the original replaced with their titlecase equivalent.
- Returns:
Strings from the original replaced with their titlecase equivalent.
- Return type:
- Raises:
RuntimeError – Raised if there is a server-side error thrown.
See also
Strings.lower
,String.upper
Examples
>>> strings = ak.array([f'StrINgS {i}' for i in range(5)]) >>> strings array(['StrINgS 0', 'StrINgS 1', 'StrINgS 2', 'StrINgS 3', 'StrINgS 4']) >>> strings.title() array(['Strings 0', 'Strings 1', 'Strings 2', 'Strings 3', 'Strings 4'])
- to_csv(prefix_path: str, dataset: str = 'strings_array', col_delim: str = ',', overwrite: bool = False)[source]¶
Write Strings to CSV file(s). File will contain a single column with the Strings data. All CSV Files written by Arkouda include a header denoting data types of the columns. Unlike other file formats, CSV files store Strings as their UTF-8 format instead of storing bytes as uint(8).
- Parameters:
prefix_path (str) – The filename prefix to be used for saving files. Files will have _LOCALE#### appended when they are written to disk.
dataset (str) – Column name to save the Strings under. Defaults to “strings_array”.
col_delim (str) – Defaults to “,”. Value to be used to separate columns within the file. Please be sure that the value used DOES NOT appear in your dataset.
overwrite (bool) – Defaults to False. If True, any existing files matching your provided prefix_path will be overwritten. If False, an error will be returned if existing files are found.
- Return type:
str reponse message
- Raises:
ValueError – Raised if all datasets are not present in all parquet files or if one or more of the specified files do not exist
RuntimeError – Raised if one or more of the specified files cannot be opened. If allow_errors is true this may be raised if no values are returned from the server.
TypeError – Raised if we receive an unknown arkouda_type returned from the server
Notes
CSV format is not currently supported by load/load_all operations
The column delimiter is expected to be the same for column names and data
Be sure that column delimiters are not found within your data.
All CSV files must delimit rows using newline (
\n
) at this time.
- to_hdf(prefix_path: str, dataset: str = 'strings_array', mode: str = 'truncate', save_offsets: bool = True, file_type: str = 'distribute') str [source]¶
Save the Strings object to HDF5. The object can be saved to a collection of files or single file.
- Parameters:
prefix_path (str) – Directory and filename prefix that all output files share
dataset (str) – The name of the Strings dataset to be written, defaults to strings_array
mode (str {'truncate' | 'append'}) – By default, truncate (overwrite) output files, if they exist. If ‘append’, create a new Strings dataset within existing files.
save_offsets (bool) – Defaults to True which will instruct the server to save the offsets array to HDF5 If False the offsets array will not be save and will be derived from the string values upon load/read.
file_type (str ("single" | "distribute")) – Default: Distribute Distribute the dataset over a file per locale. Single file will save the dataset to one file
- Return type:
String message indicating result of save operation
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
Parquet files do not store the segments, only the values.
Strings state is saved as two datasets within an hdf5 group: one for the string characters and one for the segments corresponding to the start of each string
the hdf5 group is named via the dataset parameter.
The prefix_path must be visible to the arkouda server and the user must have write permission.
Output files have names of the form
<prefix_path>_LOCALE<i>
, where<i>
ranges from 0 tonumLocales
for file_type=’distribute’. Otherwise, the file name will be prefix_path.If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, a
RuntimeError
will result.Any file extension can be used.The file I/O does not rely on the extension to determine the file format.
See also
- to_list() list [source]¶
Convert the SegString to a list, transferring data from the arkouda server to Python. If the SegString exceeds a built-in size limit, a RuntimeError is raised.
- Returns:
A list with the same strings as this SegString
- Return type:
list
Notes
The number of bytes in the array cannot exceed
ak.client.maxTransferBytes
, otherwise aRuntimeError
will be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting ak.client.maxTransferBytes to a larger value, but proceed with caution.See also
Examples
>>> a = ak.array(["hello", "my", "world"]) >>> a.to_list() ['hello', 'my', 'world'] >>> type(a.to_list()) list
- to_ndarray() numpy.ndarray [source]¶
Convert the array to a np.ndarray, transferring array data from the arkouda server to Python. If the array exceeds a built-in size limit, a RuntimeError is raised.
- Returns:
A numpy ndarray with the same strings as this array
- Return type:
np.ndarray
Notes
The number of bytes in the array cannot exceed
ak.client.maxTransferBytes
, otherwise aRuntimeError
will be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting ak.client.maxTransferBytes to a larger value, but proceed with caution.See also
array
,to_list
Examples
>>> a = ak.array(["hello", "my", "world"]) >>> a.to_ndarray() array(['hello', 'my', 'world'], dtype='<U5') >>> type(a.to_ndarray()) numpy.ndarray
- to_parquet(prefix_path: str, dataset: str = 'strings_array', mode: str = 'truncate', compression: str | None = None) str [source]¶
Save the Strings object to Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. Each locale saves its chunk of the array to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, attempt to create new dataset in existing files.
- Parameters:
compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Sets the compression type used with Parquet files
- Return type:
string message indicating result of save operation
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
The prefix_path must be visible to the arkouda server and the user must
have write permission. - Output files have names of the form
<prefix_path>_LOCALE<i>
, where<i>
ranges from 0 tonumLocales
for file_type=’distribute’. - ‘append’ write mode is supported, but is not efficient. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, aRuntimeError
will result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.
- transfer(hostname: str, port: arkouda.numpy.dtypes.int_scalars)[source]¶
Sends a Strings object to a different Arkouda server
- Parameters:
hostname (str) – The hostname where the Arkouda server intended to receive the Strings object is running.
port (int_scalars) – The port to send the array over. This needs to be an open port (i.e., not one that the Arkouda server is running on). This will open up numLocales ports, each of which in succession, so will use ports of the range {port..(port+numLocales)} (e.g., running an Arkouda server of 4 nodes, port 1234 is passed as port, Arkouda will use ports 1234, 1235, 1236, and 1237 to send the array data). This port much match the port passed to the call to ak.receive_array().
- Return type:
A message indicating a complete transfer
- Raises:
ValueError – Raised if the op is not within the pdarray.BinOps set
TypeError – Raised if other is not a pdarray or the pdarray.dtype is not a supported dtype
- unregister() None [source]¶
Unregister a Strings object in the arkouda server which was previously registered using register() and/or attached to using attach()
- Return type:
None
- Raises:
RuntimeError – Raised if the server could not find the internal name/symbol to remove
Notes
Registered names/Strings objects in the server are immune to deletion until they are unregistered.
- static unregister_strings_by_name(user_defined_name: str) None [source]¶
Unregister a Strings object in the arkouda server previously registered via register()
- Parameters:
user_defined_name (str) – The registered name of the Strings object
See also
- update_hdf(prefix_path: str, dataset: str = 'strings_array', save_offsets: bool = True, repack: bool = True)[source]¶
Overwrite the dataset with the name provided with this Strings object. If the dataset does not exist it is added
- Parameters:
prefix_path (str) – Directory and filename prefix that all output files share
dataset (str) – Name of the dataset to create in files
save_offsets (bool) – Defaults to True which will instruct the server to save the offsets array to HDF5 If False the offsets array will not be save and will be derived from the string values upon load/read.
repack (bool) – Default: True HDF5 does not release memory on delete. When True, the inaccessible data (that was overwritten) is removed. When False, the data remains, but is inaccessible. Setting to false will yield better performance, but will cause file sizes to expand.
- Return type:
str - success message if successful
- Raises:
RuntimeError – Raised if a server-side error is thrown saving the Strings object
Notes
If file does not contain File_Format attribute to indicate how it was saved, the file name is checked for _LOCALE#### to determine if it is distributed.
If the dataset provided does not exist, it will be added
- upper() Strings [source]¶
Returns a new Strings with all lowercase characters from the original replaced with their uppercase equivalent
- Returns:
Strings with all lowercase characters from the original replaced with their uppercase equivalent
- Return type:
- Raises:
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings = ak.array([f'StrINgS {i}' for i in range(5)]) >>> strings array(['StrINgS 0', 'StrINgS 1', 'StrINgS 2', 'StrINgS 3', 'StrINgS 4']) >>> strings.upper() array(['STRINGS 0', 'STRINGS 1', 'STRINGS 2', 'STRINGS 3', 'STRINGS 4'])