Strings in Arkouda¶
Like NumPy, Arkouda supports arrays of strings, but whereas in NumPy arrays of strings are still ndarray
objects, in Arkouda the array of strings is its own class: Strings
.
In order to efficiently store strings with a wide range of lengths, Arkouda uses a “segmented array” data structure, comprising:
bytes
: Auint8
array containing the concatenated bytes of all the strings, separated by null (0) bytes.offsets
: Aint64
array with the start index of each string
Performance¶
Because strings are a variable-width data type, and because of the way Arkouda represents strings, operations on strings are considerably slower than operations on numeric data. Use numeric data whenever possible. For example, if your raw data contains string data that could be represented numerically, consider setting up a processing pipeline performs the conversion (and stores the result in HDF5 format) on ingest.
I/O¶
Arrays of strings can be transferred between the Arkouda client and server using the arkouda.array
and Strings.to_ndarray
functions (see Data I/O). The former converts a Python list or NumPy ndarray
of strings to an Arkouda Strings
object, whereas the latter converts an Arkouda Strings
object to a NumPy ndarray
. As with numeric arrays, if the size of the data exceeds the threshold set by ak.client.maxTransferBytes
, the client will raise an exception.
Arkouda currently only supports the HDF5 file format for disk-based I/O. In order to read an array of strings from an HDF5 file, the strings must be stored in an HDF5 group
containing two datasets: segments
(an integer array corresponding to offsets
above) and values
(a uint8
array corresponding to bytes
above). See Supported File Formats for more information and guidelines.
Iteration¶
Iterating directly over a Strings
with for x in string
is not supported to discourage transferring all the Strings
object’s data from the arkouda server to the Python client since there is almost always a more array-oriented way to express an iterator-based computation. To force this transfer, use the to_ndarray
function to return the Strings
as a numpy.ndarray
. See I/O for more details about using to_ndarray
with Strings
#.. autofunction:: arkouda.numpy.Strings.to_ndarray
Operations¶
Arkouda Strings
objects support the following operations:
Indexing with integer, slice, integer
pdarray
, and booleanpdarray
(see Indexing and Assignment)Comparison (
==
and!=
) with string literal or otherStrings
object of same sizeArray Set Operations, e.g.
unique
andin1d
Sorting, via
argsort
andcoargsort
GroupBy, both alone and in conjunction with numeric arrays
Type Casting to and from numeric arrays
Concatenation with other
Strings
String-Specific Methods¶
Substring search¶
# .. automethod:: arkouda.numpy.Strings.contains
# .. automethod:: arkouda.numpy.Strings.startswith
# .. automethod:: arkouda.numpy.Strings.endswith
Splitting and joining¶
# .. automethod:: arkouda.numpy.Strings.peel
# .. automethod:: arkouda.numpy.Strings.rpeel
# .. automethod:: arkouda.numpy.Strings.stick
# .. automethod:: arkouda.numpy.Strings.lstick
Flattening¶
Given an array of strings where each string encodes a variable-length sequence delimited by a common substring, flattening offers a method for unpacking the sequences into a flat array of individual elements. A mapping between original strings and new array elements can be preserved, if desired. This method can be used in pipe
# .. automethod:: arkouda.numpy.Strings.flatten
Regular Expressions¶
Strings
implements behavior similar to the re python library applied to every element. This functionality is based on Chapel’s regex module which is built on google’s re2. re2 sacrifices some functionality (notably lookahead/lookbehind) in exchange for guarantees that searches complete in linear time and in a fixed amount of stack space
# .. automethod:: arkouda.numpy.Strings.search
# .. automethod:: arkouda.numpy.Strings.match
# .. automethod:: arkouda.numpy.Strings.fullmatch
# .. automethod:: arkouda.numpy.Strings.split
# .. automethod:: arkouda.numpy.Strings.findall
# .. automethod:: arkouda.numpy.Strings.sub
# .. automethod:: arkouda.numpy.Strings.subn
# .. automethod:: arkouda.numpy.Strings.find_locations
Match Object¶
search, match, and fullmatch return a Match
object which supports the following methods
- Match.matched()[source]¶
Returns a boolean array indiciating whether each element matched
- Returns:
True for elements that match, False otherwise
- Return type:
pdarray, bool
Examples
>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', '']) >>> strings.search('_+').matched() array([True True False True False])
- Match.start()[source]¶
Returns the starts of matches
- Returns:
The start positions of matches
- Return type:
pdarray, int64
Examples
>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', '']) >>> strings.search('_+').start() array([1 0 0])
- Match.end()[source]¶
Returns the ends of matches
- Returns:
The end positions of matches
- Return type:
pdarray, int64
Examples
>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', '']) >>> strings.search('_+').end() array([2 4 2])
- Match.match_type()[source]¶
Returns the type of the Match object
- Returns:
MatchType of the Match object
- Return type:
str
Examples
>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', '']) >>> strings.search('_+').match_type() 'SEARCH'
- Match.find_matches(return_match_origins=False)[source]¶
Return all matches as a new Strings object
- Parameters:
return_match_origins (bool) – If True, return a pdarray containing the index of the original string each pattern match is from
- Returns:
Strings – Strings object containing only matches
pdarray, int64 (optional) – The index of the original string each pattern match is from
- Raises:
RuntimeError – Raised if there is a server-side error thrown
Examples
>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', '']) >>> strings.search('_+').find_matches(return_match_origins=True) (array(['_', '____', '__']), array([0 1 3]))
- Match.group(group_num=0, return_group_origins=False)[source]¶
Returns a new Strings containing the capture group corresponding to group_num. For the default, group_num=0, return the full match
- Parameters:
group_num (int) – The index of the capture group to be returned
return_group_origins (bool) – If True, return a pdarray containing the index of the original string each capture group is from
- Returns:
Strings – Strings object containing only the capture groups corresponding to group_num
pdarray, int64 (optional) – The index of the original string each group is from
Examples
>>> strings = ak.array(["Isaac Newton, physics", '<-calculus->', 'Gottfried Leibniz, math']) >>> m = strings.search("(\w+) (\w+)") >>> m.group() array(['Isaac Newton', 'Gottfried Leibniz']) >>> m.group(1) array(['Isaac', 'Gottfried']) >>> m.group(2, return_group_origins=True) (array(['Newton', 'Leibniz']), array([0 2]))