Strings in Arkouda¶
Like NumPy, Arkouda supports arrays of strings, but whereas in NumPy arrays of strings are still ndarray
objects, in Arkouda the array of strings is its own class: Strings
.
In order to efficiently store strings with a wide range of lengths, Arkouda uses a “segmented array” data structure, comprising:
bytes
: Auint8
array containing the concatenated bytes of all the strings, separated by null (0) bytes.offsets
: Aint64
array with the start index of each string
Performance¶
Because strings are a variable-width data type, and because of the way Arkouda represents strings, operations on strings are considerably slower than operations on numeric data. Use numeric data whenever possible. For example, if your raw data contains string data that could be represented numerically, consider setting up a processing pipeline performs the conversion (and stores the result in HDF5 format) on ingest.
I/O¶
Arrays of strings can be transferred between the Arkouda client and server using the arkouda.array
and Strings.to_ndarray
functions (see Data I/O). The former converts a Python list or NumPy ndarray
of strings to an Arkouda Strings
object, whereas the latter converts an Arkouda Strings
object to a NumPy ndarray
. As with numeric arrays, if the size of the data exceeds the threshold set by ak.client.maxTransferBytes
, the client will raise an exception.
Arkouda currently only supports the HDF5 file format for disk-based I/O. In order to read an array of strings from an HDF5 file, the strings must be stored in an HDF5 group
containing two datasets: segments
(an integer array corresponding to offsets
above) and values
(a uint8
array corresponding to bytes
above). See Supported File Formats for more information and guidelines.
Iteration¶
Iterating directly over a Strings
with for x in string
is not supported to discourage transferring all the Strings
object’s data from the arkouda server to the Python client since there is almost always a more array-oriented way to express an iterator-based computation. To force this transfer, use the to_ndarray
function to return the Strings
as a numpy.ndarray
. See I/O for more details about using to_ndarray
with Strings
- arkouda.Strings.to_ndarray(self)¶
Convert the array to a np.ndarray, transferring array data from the arkouda server to Python. If the array exceeds a built-in size limit, a RuntimeError is raised.
- Returns:
A numpy ndarray with the same strings as this array
- Return type:
np.ndarray
Notes
The number of bytes in the array cannot exceed
ak.client.maxTransferBytes
, otherwise aRuntimeError
will be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting ak.client.maxTransferBytes to a larger value, but proceed with caution.See also
array
,to_list
Examples
>>> a = ak.array(["hello", "my", "world"]) >>> a.to_ndarray() array(['hello', 'my', 'world'], dtype='<U5') >>> type(a.to_ndarray()) numpy.ndarray
Operations¶
Arkouda Strings
objects support the following operations:
Indexing with integer, slice, integer
pdarray
, and booleanpdarray
(see Indexing and Assignment)Comparison (
==
and!=
) with string literal or otherStrings
object of same sizeArray Set Operations, e.g.
unique
andin1d
Sorting, via
argsort
andcoargsort
GroupBy, both alone and in conjunction with numeric arrays
Type Casting to and from numeric arrays
Concatenation with other
Strings
String-Specific Methods¶
Substring search¶
- Strings.contains(substr, regex=False)[source]¶
Check whether each element contains the given substring.
- Parameters:
substr (str_scalars) – The substring in the form of string or byte array to search for
regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Returns:
True for elements that contain substr, False otherwise
- Return type:
pdarray, bool
- Raises:
TypeError – Raised if the substr parameter is not bytes or str_scalars
ValueError – Rasied if substr is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings = ak.array([f'{i} string {i}' for i in range(1, 6)]) >>> strings array(['1 string 1', '2 string 2', '3 string 3', '4 string 4', '5 string 5']) >>> strings.contains('string') array([True, True, True, True, True]) >>> strings.contains('string \d', regex=True) array([True, True, True, True, True])
- Strings.startswith(substr, regex=False)[source]¶
Check whether each element starts with the given substring.
- Parameters:
substr (Union[bytes, str_scalars]) – The prefix to search for
regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Returns:
True for elements that start with substr, False otherwise
- Return type:
pdarray, bool
- Raises:
TypeError – Raised if the substr parameter is not a bytes ior str_scalars
ValueError – Rasied if substr is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings_end = ak.array([f'string {i}' for i in range(1, 6)]) >>> strings_end array(['string 1', 'string 2', 'string 3', 'string 4', 'string 5']) >>> strings_end.startswith('string') array([True, True, True, True, True]) >>> strings_start = ak.array([f'{i} string' for i in range(1,6)]) >>> strings_start array(['1 string', '2 string', '3 string', '4 string', '5 string']) >>> strings_start.startswith('\d str', regex = True) array([True, True, True, True, True])
- Strings.endswith(substr, regex=False)[source]¶
Check whether each element ends with the given substring.
- Parameters:
substr (Union[bytes, str_scalars]) – The suffix to search for
regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Returns:
True for elements that end with substr, False otherwise
- Return type:
pdarray, bool
- Raises:
TypeError – Raised if the substr parameter is not bytes or str_scalars
ValueError – Rasied if substr is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings_start = ak.array([f'{i} string' for i in range(1,6)]) >>> strings_start array(['1 string', '2 string', '3 string', '4 string', '5 string']) >>> strings_start.endswith('ing') array([True, True, True, True, True]) >>> strings_end = ak.array([f'string {i}' for i in range(1, 6)]) >>> strings_end array(['string 1', 'string 2', 'string 3', 'string 4', 'string 5']) >>> strings_end.endswith('ing \d', regex = True) array([True, True, True, True, True])
Splitting and joining¶
- Strings.peel(delimiter, times=1, includeDelimiter=False, keepPartial=False, fromRight=False, regex=False)[source]¶
Peel off one or more delimited fields from each string (similar to string.partition), returning two new arrays of strings. Warning: This function is experimental and not guaranteed to work.
- Parameters:
delimiter (Union[bytes, str_scalars]) – The separator where the split will occur
times (Union[int, np.int64]) – The number of times the delimiter is sought, i.e. skip over the first (times-1) delimiters
includeDelimiter (bool) – If true, append the delimiter to the end of the first return array. By default, it is prepended to the beginning of the second return array.
keepPartial (bool) – If true, a string that does not contain <times> instances of the delimiter will be returned in the first array. By default, such strings are returned in the second array.
fromRight (bool) – If true, peel from the right instead of the left (see also rpeel)
regex (bool) – Indicates whether delimiter is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Returns:
- left: Strings
The field(s) peeled from the end of each string (unless fromRight is true)
- right: Strings
The remainder of each string after peeling (unless fromRight is true)
- Return type:
- Raises:
TypeError – Raised if the delimiter parameter is not byte or str_scalars, if times is not int64, or if includeDelimiter, keepPartial, or fromRight is not bool
ValueError – Raised if times is < 1 or if delimiter is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
Examples
>>> s = ak.array(['a.b', 'c.d', 'e.f.g']) >>> s.peel('.') (array(['a', 'c', 'e']), array(['b', 'd', 'f.g'])) >>> s.peel('.', includeDelimiter=True) (array(['a.', 'c.', 'e.']), array(['b', 'd', 'f.g'])) >>> s.peel('.', times=2) (array(['', '', 'e.f']), array(['a.b', 'c.d', 'g'])) >>> s.peel('.', times=2, keepPartial=True) (array(['a.b', 'c.d', 'e.f']), array(['', '', 'g']))
- Strings.rpeel(delimiter, times=1, includeDelimiter=False, keepPartial=False, regex=False)[source]¶
Peel off one or more delimited fields from the end of each string (similar to string.rpartition), returning two new arrays of strings. Warning: This function is experimental and not guaranteed to work.
- Parameters:
delimiter (Union[bytes, str_scalars]) – The separator where the split will occur
times (Union[int, np.int64]) – The number of times the delimiter is sought, i.e. skip over the last (times-1) delimiters
includeDelimiter (bool) – If true, prepend the delimiter to the start of the first return array. By default, it is appended to the end of the second return array.
keepPartial (bool) – If true, a string that does not contain <times> instances of the delimiter will be returned in the second array. By default, such strings are returned in the first array.
regex (bool) – Indicates whether delimiter is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Returns:
- left: Strings
The remainder of the string after peeling
- right: Strings
The field(s) that were peeled from the right of each string
- Return type:
- Raises:
TypeError – Raised if the delimiter parameter is not bytes or str_scalars or if times is not int64
ValueError – Raised if times is < 1 or if delimiter is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
Examples
>>> s = ak.array(['a.b', 'c.d', 'e.f.g']) >>> s.rpeel('.') (array(['a', 'c', 'e.f']), array(['b', 'd', 'g'])) # Compared against peel >>> s.peel('.') (array(['a', 'c', 'e']), array(['b', 'd', 'f.g']))
- Strings.stick(other, delimiter='', toLeft=False)[source]¶
Join the strings from another array onto one end of the strings of this array, optionally inserting a delimiter. Warning: This function is experimental and not guaranteed to work.
- Parameters:
other (Strings) – The strings to join onto self’s strings
delimiter (str) – String inserted between self and other
toLeft (bool) – If true, join other strings to the left of self. By default, other is joined to the right of self.
- Returns:
The array of joined strings
- Return type:
- Raises:
TypeError – Raised if the delimiter parameter is not bytes or str_scalars or if the other parameter is not a Strings instance
ValueError – Raised if times is < 1
RuntimeError – Raised if there is a server-side error thrown
Examples
>>> s = ak.array(['a', 'c', 'e']) >>> t = ak.array(['b', 'd', 'f']) >>> s.stick(t, delimiter='.') array(['a.b', 'c.d', 'e.f'])
- Strings.lstick(other, delimiter='')[source]¶
Join the strings from another array onto the left of the strings of this array, optionally inserting a delimiter. Warning: This function is experimental and not guaranteed to work.
- Parameters:
other (Strings) – The strings to join onto self’s strings
delimiter (Union[bytes,str_scalars]) – String inserted between self and other
- Returns:
The array of joined strings, as other + self
- Return type:
- Raises:
TypeError – Raised if the delimiter parameter is neither bytes nor a str or if the other parameter is not a Strings instance
RuntimeError – Raised if there is a server-side error thrown
Examples
>>> s = ak.array(['a', 'c', 'e']) >>> t = ak.array(['b', 'd', 'f']) >>> s.lstick(t, delimiter='.') array(['b.a', 'd.c', 'f.e'])
Flattening¶
Given an array of strings where each string encodes a variable-length sequence delimited by a common substring, flattening offers a method for unpacking the sequences into a flat array of individual elements. A mapping between original strings and new array elements can be preserved, if desired. This method can be used in pipe
Regular Expressions¶
Strings
implements behavior similar to the re python library applied to every element. This functionality is based on Chapel’s regex module which is built on google’s re2. re2 sacrifices some functionality (notably lookahead/lookbehind) in exchange for guarantees that searches complete in linear time and in a fixed amount of stack space
- Strings.search(pattern)[source]¶
Returns a match object with the first location in each element where pattern produces a match. Elements match if any part of the string matches the regular expression pattern
- Parameters:
pattern (str) – Regex used to find matches
- Returns:
Match object where elements match if any part of the string matches the regular expression pattern
- Return type:
Examples
>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', '']) >>> strings.search('_+') <ak.Match object: matched=True, span=(1, 2); matched=True, span=(0, 4); matched=False; matched=True, span=(0, 2); matched=False>
- Strings.match(pattern)[source]¶
Returns a match object where elements match only if the beginning of the string matches the regular expression pattern
- Parameters:
pattern (str) – Regex used to find matches
- Returns:
Match object where elements match only if the beginning of the string matches the regular expression pattern
- Return type:
Examples
>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', '']) >>> strings.match('_+') <ak.Match object: matched=False; matched=True, span=(0, 4); matched=False; matched=True, span=(0, 2); matched=False>
- Strings.fullmatch(pattern)[source]¶
Returns a match object where elements match only if the whole string matches the regular expression pattern
- Parameters:
pattern (str) – Regex used to find matches
- Returns:
Match object where elements match only if the whole string matches the regular expression pattern
- Return type:
Examples
>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', '']) >>> strings.fullmatch('_+') <ak.Match object: matched=False; matched=True, span=(0, 4); matched=False; matched=False; matched=False>
- Strings.split(delimiter, return_segments=False, regex=False)[source]¶
Unpack delimiter-joined substrings into a flat array.
- Parameters:
delimiter (str) – Characters used to split strings into substrings
return_segments (bool) – If True, also return mapping of original strings to first substring in return array.
regex (bool) – Indicates whether delimiter is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)
- Return type:
Union
[Strings
,Tuple
]- Returns:
Strings – Flattened substrings with delimiters removed
pdarray, int64 (optional) – For each original string, the index of first corresponding substring in the return array
Examples
>>> orig = ak.array(['one|two', 'three|four|five', 'six']) >>> orig.flatten('|') array(['one', 'two', 'three', 'four', 'five', 'six']) >>> flat, map = orig.flatten('|', return_segments=True) >>> map array([0, 2, 5]) >>> under = ak.array(['one_two', 'three_____four____five', 'six']) >>> under_flat, under_map = under.flatten('_+', return_segments=True, regex=True) >>> under_flat array(['one', 'two', 'three', 'four', 'five', 'six']) >>> under_map array([0, 2, 5])
- Strings.findall(pattern, return_match_origins=False)[source]¶
Return a new Strings containg all non-overlapping matches of pattern
- Parameters:
pattern (str_scalars) – Regex used to find matches
return_match_origins (bool) – If True, return a pdarray containing the index of the original string each pattern match is from
- Return type:
Union
[Strings
,Tuple
]- Returns:
Strings – Strings object containing only pattern matches
pdarray, int64 (optional) – The index of the original string each pattern match is from
- Raises:
TypeError – Raised if the pattern parameter is not bytes or str_scalars
ValueError – Raised if pattern is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', '']) >>> strings.findall('_+', return_match_origins=True) (array(['_', '___', '____', '__', '___', '____', '___']), array([0 0 1 3 3 3 3]))
- Strings.sub(pattern, repl, count=0)[source]¶
Return new Strings obtained by replacing non-overlapping occurrences of pattern with the replacement repl. If count is nonzero, at most count substitutions occur
- Parameters:
pattern (str_scalars) – The regex to substitue
repl (str_scalars) – The substring to replace pattern matches with
count (int) – The max number of pattern match occurences in each element to replace. The default count=0 replaces all occurences of pattern with repl
- Returns:
Strings with pattern matches replaced
- Return type:
- Raises:
TypeError – Raised if pattern or repl are not bytes or str_scalars
ValueError – Raised if pattern is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', '']) >>> strings.sub(pattern='_+', repl='-', count=2) array(['1-2-', '-', '3', '-4-5____6___7', ''])
- Strings.subn(pattern, repl, count=0)[source]¶
Perform the same operation as sub(), but return a tuple (new_Strings, number_of_substitions)
- Parameters:
pattern (str_scalars) – The regex to substitue
repl (str_scalars) – The substring to replace pattern matches with
count (int) – The max number of pattern match occurences in each element to replace. The default count=0 replaces all occurences of pattern with repl
- Return type:
Tuple
- Returns:
Strings – Strings with pattern matches replaced
pdarray, int64 – The number of substitutions made for each element of Strings
- Raises:
TypeError – Raised if pattern or repl are not bytes or str_scalars
ValueError – Raised if pattern is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', '']) >>> strings.subn(pattern='_+', repl='-', count=2) (array(['1-2-', '-', '3', '-4-5____6___7', '']), array([2 1 0 2 0]))
- Strings.find_locations(pattern)[source]¶
Finds pattern matches and returns pdarrays containing the number, start postitions, and lengths of matches
- Parameters:
pattern (str_scalars) – The regex pattern used to find matches
- Return type:
- Returns:
pdarray, int64 – For each original string, the number of pattern matches
pdarray, int64 – The start positons of pattern matches
pdarray, int64 – The lengths of pattern matches
- Raises:
TypeError – Raised if the pattern parameter is not bytes or str_scalars
ValueError – Raised if pattern is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
See also
Examples
>>> strings = ak.array([f'{i} string {i}' for i in range(1, 6)]) >>> num_matches, starts, lens = strings.find_locations('\d') >>> num_matches array([2, 2, 2, 2, 2]) >>> starts array([0, 9, 0, 9, 0, 9, 0, 9, 0, 9]) >>> lens array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1]))
Match Object¶
search, match, and fullmatch return a Match
object which supports the following methods
- Match.matched()[source]¶
Returns a boolean array indiciating whether each element matched
- Returns:
True for elements that match, False otherwise
- Return type:
pdarray, bool
Examples
>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', '']) >>> strings.search('_+').matched() array([True True False True False])
- Match.start()[source]¶
Returns the starts of matches
Examples
>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', '']) >>> strings.search('_+').start() array([1 0 0])
- Match.end()[source]¶
Returns the ends of matches
Examples
>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', '']) >>> strings.search('_+').end() array([2 4 2])
- Match.match_type()[source]¶
Returns the type of the Match object
- Returns:
MatchType of the Match object
- Return type:
str
Examples
>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', '']) >>> strings.search('_+').match_type() 'SEARCH'
- Match.find_matches(return_match_origins=False)[source]¶
Return all matches as a new Strings object
- Parameters:
return_match_origins (bool) – If True, return a pdarray containing the index of the original string each pattern match is from
- Returns:
Strings – Strings object containing only matches
pdarray, int64 (optional) – The index of the original string each pattern match is from
- Raises:
RuntimeError – Raised if there is a server-side error thrown
Examples
>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', '']) >>> strings.search('_+').find_matches(return_match_origins=True) (array(['_', '____', '__']), array([0 1 3]))
- Match.group(group_num=0, return_group_origins=False)[source]¶
Returns a new Strings containing the capture group corresponding to group_num. For the default, group_num=0, return the full match
- Parameters:
group_num (int) – The index of the capture group to be returned
return_group_origins (bool) – If True, return a pdarray containing the index of the original string each capture group is from
- Returns:
Strings – Strings object containing only the capture groups corresponding to group_num
pdarray, int64 (optional) – The index of the original string each group is from
Examples
>>> strings = ak.array(["Isaac Newton, physics", '<-calculus->', 'Gottfried Leibniz, math']) >>> m = strings.search("(\w+) (\w+)") >>> m.group() array(['Isaac Newton', 'Gottfried Leibniz']) >>> m.group(1) array(['Isaac', 'Gottfried']) >>> m.group(2, return_group_origins=True) (array(['Newton', 'Leibniz']), array([0 2]))