Arkouda is a Python API for exploratory data analysis on massive datasets. It
leverages a Chapel-based backend to enable high-performance computing on
distributed systems, while exposing a familiar NumPy- and Pandas-like interface
to Python users.
Represent integers as bit vectors, e.g. a set of flags.
Parameters:
values (pdarray, int64) – The integers to represent as bit vectors
width (int) – The number of bit fields in the vector
reverse (bool) – If True, display bits from least significant (left) to most
significant (right). By default, the most significant bit
is the left-most bit.
This class is a thin wrapper around pdarray that mostly affects
how values are displayed to the user. Operators and methods will
typically treat this class like a uint64 pdarray.
Register this BitVector object and underlying components with the Arkouda server.
Parameters:
user_defined_name (str) – user defined name the BitVector is to be registered under,
this will be the root name for underlying components
Returns:
The same BitVector which is now registered with the arkouda server and has an updated name.
This is an in-place modification, the original is returned to support
a fluid programming style.
Please note you cannot register two different BitVectors with the same name.
Make a callback (i.e. function) that can be called on an array to create a BitVector.
Parameters:
width (int) – The number of bit fields in the vector
reverse (bool) – If True, display bits from least significant (left) to most
significant (right). By default, the most significant bit
is the left-most bit.
Returns:
bitvectorizer – A function that takes an array and returns a BitVector instance
initialdata (List or dictionary of lists, tuples, or pdarrays) – Each list/dictionary entry corresponds to one column of the data and
should be a homogenous type. Different columns may have different
types. If using a dictionary, keys should be strings.
index (Index, pdarray, or Strings) – Index for the resulting frame. Defaults to an integer range.
columns (List, tuple, pdarray, or Strings) – Column labels to use if the data does not include them. Elements must
be strings. Defaults to an stringified integer range.
Examples
>>> importarkoudaasakCreate an empty DataFrame and add a column of data:
Group the dataframe by a column or a list of columns.
Parameters:
keys (str or list of str) – An (ordered) list of column names or a single string to group by.
use_series (bool, default=False) – If True, returns an arkouda.dataframe.DataFrameGroupBy object.
Otherwise an arkouda.groupbyclass.GroupBy object.
as_index (bool, default=True) – If True, groupby columns will be set as index
otherwise, the groupby columns will be treated as DataFrame columns.
dropna (bool, default=True) – If True, and the groupby keys contain NaN values,
the NaN values together with the corresponding row will be dropped.
Otherwise, the rows corresponding to NaN values will be kept.
Returns:
If use_series = True, returns an arkouda.dataframe.DataFrameGroupBy object.
Otherwise returns an arkouda.groupbyclass.GroupBy object.
If True, and the groupby keys contain NaN values,
the NaN values together with the corresponding row will be dropped.
Otherwise, the rows corresponding to NaN values will be kept.
The default is True
Type:
bool (default=True)
Raises:
TypeError – Raised if keys is a pdarray with a dtype other than int64
Notes
Integral pdarrays, Strings, and Categoricals are natively supported, but
float64 and bool arrays are not.
For a user-defined class to be groupable, it must inherit from pdarray
and define or overload the grouping API:
a ._get_grouping_keys() method that returns a list of pdarrays
that can be (co)argsorted.
(Optional) a .group() method that returns the permutation that
groups the array
If the input is a single array with a .group() method defined, method 2
will be used; otherwise, method 1 will be used.
Concatenate data from ‘other’ onto the end of this DataFrame, in place.
Explicitly, use the arkouda concatenate function to append the data
from each column in other to the end of self. This operation is done
in place, in the sense that the underlying pdarrays are updated from
the result of the arkouda concatenate function, rather than returning
a new DataFrame object containing the result.
Parameters:
other (DataFrame) – The DataFrame object whose data will be appended to this DataFrame.
ordered (bool, default=True) – If False, allow rows to be interleaved for better performance (but
data within a row remains together). By default, append all rows
to the end, in input order.
Returns:
Appending occurs in-place, but result is returned for compatibility.
The operation is done in place and the original DataFrame will be modified.
This may be useful if you want to unsort an DataFrame, or even to
apply an arbitrary permutation such as the inverse of a sorting
permutation.
Parameters:
perm (pdarray) – A permutation array. Should be the same size as the data
arrays, and should consist of the integers [0,size-1] in
some order. Very minimal testing is done to ensure this
is a permutation.
Return a new object with all original columns in addition to new ones.
Existing columns that are re-assigned will be overwritten.
Parameters:
**kwargs (dict of {str: callable or Series}) – The column names are keywords. If the values are
callable, they are computed on the DataFrame and
assigned to the new columns. The callable must not
change input DataFrame (though pandas doesn’t check it).
If the values are not callable, (e.g. a Series, scalar, or array),
they are simply assigned.
Returns:
A new DataFrame with the new columns in addition to
all the existing columns.
Assigning multiple columns within the same assign is possible.
Later items in ‘**kwargs’ may refer to newly created or modified
columns in ‘df’; items are computed and assigned into ‘df’ in order.
axis ({0 or 'index', 1 or 'columns'}, default 0) – If 0 or ‘index’ counts are generated for each column.
If 1 or ‘columns’ counts are generated for each row.
numeric_only (bool = False) – Include only float, int or boolean data.
Returns:
For each column/row the number of non-NA/null entries.
Filter rows by the size of groups defined on one or more columns.
Group the DataFrame by the specified keys, compute the count of each group,
and return a boolean mask indicating which rows belong to groups whose sizes
fall within the inclusive range [low, high].
Parameters:
keys (str or list of str) – Column name or list of column names to group by.
low (int, default=1) – Minimum group size (inclusive). Must be >= 0.
high (int or None, default=None) – Maximum group size (inclusive). If None, no upper bound is applied.
Returns:
A boolean mask array of length equal to the number of rows in the DataFrame,
where True indicates the row’s group size is between low and high.
Group the dataframe by a column or a list of columns.
Alias for GroupBy.
Parameters:
keys (str or list of str) – An (ordered) list of column names or a single string to group by.
use_series (bool, default=True) – If True, returns an arkouda.dataframe.DataFrameGroupBy object.
Otherwise an arkouda.groupbyclass.GroupBy object.
as_index (bool, default=True) – If True, groupby columns will be set as index
otherwise, the groupby columns will be treated as DataFrame columns.
dropna (bool, default=True) – If True, and the groupby keys contain NaN values,
the NaN values together with the corresponding row will be dropped.
Otherwise, the rows corresponding to NaN values will be kept.
Returns:
If use_series = True, returns an arkouda.dataframe.DataFrameGroupBy object.
Otherwise returns an arkouda.groupbyclass.GroupBy object.
When values is a pdarray, check every value in the DataFrame to determine if
it exists in values.
>>> df.isin(ak.array([0,1]))
col_A
col_B
0
0
1
1
0
0
When values is a dict, the values in the dict are passed to check the column
indicated by the key.
>>> df.isin({'col_A':ak.array([0,3])})
col_A
col_B
0
0
0
1
1
0
When values is a Series, each column is checked if values is present positionally.
This means that for True to be returned, the indexes must be the same.
Return a boolean same-sized object indicating if the values are NA.
numpy.NaN values get mapped to True values.
Everything else gets mapped to False values.
Returns:
Mask of bool values for each element in DataFrame
that indicates whether an element is an NA value.
The memory usage can optionally include the contribution of
the index.
Parameters:
index (bool, default True) – Specifies whether to include the memory usage of the DataFrame’s
index in returned Series. If index=True, the memory usage of
the index is the first item in the output.
unit (str, default = "B") – Unit to return. One of {‘B’, ‘KB’, ‘MB’, ‘GB’}.
Returns:
A Series whose index is the original column names and whose values
is the memory usage of each column in bytes.
right (DataFrame) – The Right DataFrame to be joined.
on (Optional[Union[str, List[str]]] = None) – The name or list of names of the DataFrame column(s) to join on.
If on is None, this defaults to the intersection of the columns in both DataFrames.
how ({"inner", "left", "right}, default = "inner") – The merge condition.
Must be “inner”, “left”, or “right”.
left_suffix (str, default = "_x") – A string indicating the suffix to add to columns from the left dataframe for overlapping
column names in both left and right. Defaults to “_x”. Only used when how is “inner”.
right_suffix (str, default = "_y") – A string indicating the suffix to add to columns from the right dataframe for overlapping
column names in both left and right. Defaults to “_y”. Only used when how is “inner”.
convert_ints (bool = True) – If True, convert columns with missing int values (due to the join) to float64.
This is to match pandas.
If False, do not convert the column dtypes.
This has no effect when how = “inner”.
sort (bool = True) – If True, DataFrame is returned sorted by “on”.
Otherwise, the DataFrame is not sorted.
Create a new string object from the given object. If encoding or
errors is specified, then the object must expose a data buffer
that will be decoded using the given encoding and error handler.
Otherwise, returns the result of object.__str__() (if defined)
or repr(object).
encoding defaults to ‘utf-8’.
errors defaults to ‘strict’.
ValueError – Raised if all datasets are not present in all parquet files or if one or
more of the specified files do not exist.
RuntimeError – Raised if one or more of the specified files cannot be opened.
If allow_errors is true this may be raised if no values are returned
from the server.
TypeError – Raised if we receive an unknown arkouda_type returned from the server.
Register this DataFrame object and underlying components with the Arkouda server.
Parameters:
user_defined_name (str) – User defined name the DataFrame is to be registered under.
This will be the root name for underlying components.
Returns:
The same DataFrame which is now registered with the arkouda server and has an updated name.
This is an in-place modification, the original is returned to support a
fluid programming style.
Please note you cannot register two different DataFrames with the same name.
mapper (callable or dict-like, Optional) – Function or dictionary mapping existing values to new values.
Nonexistent names will not raise an error.
Uses the value of axis to determine if renaming column or index
column (callable or dict-like, Optional) – Function or dictionary mapping existing column names to
new column names. Nonexistent names will not raise an
error.
When this is set, axis is ignored.
index (callable or dict-like, Optional) – Function or dictionary mapping existing index names to
new index names. Nonexistent names will not raise an
error.
When this is set, axis is ignored.
axis (int or str, default=0) – Indicates which axis to perform the rename.
0/”index” - Indexes
1/”column” - Columns
inplace (bool, default=False) – When True, perform the operation on the calling object.
When False, return a new object.
Returns:
DateFrame when inplace=False;
None when inplace=True.
Useful if this dataframe is the result of a slice operation from
another dataframe, or if you have permuted the rows and no longer need
to keep that ordering on the rows.
Parameters:
size (int, optional) – If size is passed, do not attempt to determine size based on
existing column sizes. Assume caller handles consistency correctly.
inplace (bool, default=False) – When True, perform the operation on the calling object.
When False, return a new object.
Returns:
DateFrame when inplace=False;
None when inplace=True.
File will contain a column for each column in the DataFrame.
All CSV Files written by Arkouda include a header denoting data types of the columns.
Unlike other file formats, CSV files store Strings as their UTF-8 format instead of storing
bytes as uint(8).
Parameters:
path (str) – The filename prefix to be used for saving files. Files will have _LOCALE#### appended
when they are written to disk.
index (bool, default=False) – If True, the index of the DataFrame will be written to the file
as a column.
columns (list of str (Optional)) – Column names to assign when writing data.
col_delim (str, default=",") – Value to be used to separate columns within the file.
Please be sure that the value used DOES NOT appear in your dataset.
overwrite (bool, default=False) – If True, any existing files matching your provided prefix_path will
be overwritten. If False, an error will be returned if existing files are found.
Raises:
ValueError – Raised if all datasets are not present in all parquet files or if one or
more of the specified files do not exist.
RuntimeError – Raised if one or more of the specified files cannot be opened.
If allow_errors is true this may be raised if no values are returned
from the server.
TypeError – Raised if we receive an unknown arkouda_type returned from the server.
Notes
CSV format is not currently supported by load/load_all operations.
The column delimiter is expected to be the same for column names and data.
Be sure that column delimiters are not found within your data.
All CSV files must delimit rows using newline (”\n”) at this time.
storage_options (dict, optional) – Extra options that make sense for a particular storage connection,
e.g. host, port, username, password, etc., if using a URL that will be parsed by fsspec,
e.g., starting “s3://”, “gcs://”.
An error will be raised if providing this argument with a non-fsspec URL.
See the fsspec and backend storage implementation docs for the set
of allowed keys and values.
**kwargs – These parameters will be passed to tabulate.
datalimit (int, default=arkouda.client.maxTransferBytes) – The maximum number size, in megabytes to transfer. The requested
DataFrame will be converted to a pandas DataFrame only if the
estimated size of the DataFrame does not exceed this value.
retain_index (bool, default=False) – Normally, to_pandas() creates a new range index object. If you want
to keep the index column, set this to True.
Returns:
The result of converting this DataFrame to a pandas DataFrame.
Save DataFrame to disk as parquet, preserving column names.
Parameters:
path (str) – File path to save data.
index (bool, default=False) – If True, save the index column. By default, do not save the index.
columns (list) – List of columns to include in the file. If None, writes out all columns.
compression (str (Optional), default=None) – Provide the compression type to use when writing the file.
Supported values: snappy, gzip, brotli, zstd, lz4
convert_categoricals (bool, default=False) – Parquet requires all columns to be the same size and Categoricals
don’t satisfy that requirement.
If set, write the equivalent Strings in place of any Categorical columns.
Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
This method saves one file per locale of the arkouda server. All
files are prefixed by the path argument and suffixed by their
locale number.
hostname (str) – The hostname where the Arkouda server intended to
receive the DataFrame is running.
port (int_scalars) – The port to send the array over. This needs to be an
open port (i.e., not one that the Arkouda server is
running on). This will open up numLocales ports,
each of which in succession, so will use ports of the
range {port..(port+numLocales)} (e.g., running an
Arkouda server of 4 nodes, port 1234 is passed as
port, Arkouda will use ports 1234, 1235, 1236,
and 1237 to send the array data).
This port much match the port passed to the call to
ak.receive_array().
Returns:
A message indicating a complete transfer.
Return type:
str
Raises:
ValueError – Raised if the op is not within the pdarray.BinOps set
TypeError – Raised if other is not a pdarray or the pdarray.dtype is not
a supported dtype
Overwrite the dataset with the name provided with this dataframe.
If the dataset does not exist it is added.
Parameters:
prefix_path (str) – Directory and filename prefix that all output files share.
index (bool, default=False) – If True, save the index column. By default, do not save the index.
columns (List, default=None) – List of columns to include in the file. If None, writes out all columns.
repack (bool, default=True) – HDF5 does not release memory on delete. When True, the inaccessible
data (that was overwritten) is removed. When False, the data remains, but is
inaccessible. Setting to false will yield better performance, but will cause
file sizes to expand.
Returns:
Success message if successful.
Return type:
str
Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray.
Notes
If file does not contain File_Format attribute to indicate how it was saved,
the file name is checked for _LOCALE#### to determine if it is distributed.
If the dataset provided does not exist, it will be added.
A DataFrame that has been grouped by a subset of columns.
Parameters:
gb_key_names (str or list(str), default=None) – The column name(s) associated with the aggregated columns.
as_index (bool, default=True) – If True, interpret aggregated column as index
(only implemented for single dimensional aggregates).
Otherwise, treat aggregated column as a dataframe column.
GroupBy object, where the aggregation keys are values of column(s) of a dataframe,
usually in preparation for aggregating with respect to the other columns.
x (Series or pdarray) – The values to put in each group’s segment.
permute (bool, default=True) – If True (default), permute broadcast values back to the
ordering of the original array on which GroupBy was called.
If False, the broadcast values are grouped by value.
Returns:
A Series with the Index of the original frame and the values of the broadcast.
n (int, optional, default = 5) – Maximum number of rows to return for each group.
If the number of rows in a group is less than n,
all the values from that group will be returned.
sort_index (bool, default = True) – If true, return the DataFrame with indices sorted.
You can either specify the number of elements
or the fraction of elements to be sampled. random_state can be used for reproducibility
Parameters:
n (int, optional) – Number of items to return for each group.
Cannot be used with frac and must be no larger than
the smallest group unless replace is True.
Default is one if frac is None.
frac (float, optional) – Fraction of items to return. Cannot be used with n.
replace (bool, default False) – Allow or disallow sampling of the same row more than once.
weights (pdarray, optional) – Default None results in equal probability weighting.
If passed a pdarray, then values must have the same length as the underlying DataFrame
and will be used as sampling probabilities after normalization within each group.
Weights must be non-negative with at least one positive element within each group.
random_state (int or ak.random.Generator, optional) – If int, seed for random number generator.
If ak.random.Generator, use as given.
Returns:
A new DataFrame containing items randomly sampled from each group
sorted according to the grouped columns.
n (int, optional, default = 5) – Maximum number of rows to return for each group.
If the number of rows in a group is less than n,
all the rows from that group will be returned.
sort_index (bool, default = True) – If true, return the DataFrame with indices sorted.
Datetime is the Arkouda analog to pandas DatetimeIndex and
other timeseries data types.
Parameters:
pda (int64 pdarray, pd.DatetimeIndex, pd.Series, or np.datetime64 array)
unit (str, default 'ns') –
For int64 pdarray, denotes the unit of the input. Ignored for pandas
and numpy arrays, which carry their own unit. Not case-sensitive;
prefixes of full names (like ‘sec’) are accepted.
Possible values:
’weeks’ or ‘w’
’days’ or ‘d’
’hours’ or ‘h’
’minutes’, ‘m’, or ‘t’
’seconds’ or ‘s’
’milliseconds’, ‘ms’, or ‘l’
’microseconds’, ‘us’, or ‘u’
’nanoseconds’, ‘ns’, or ‘n’
Unlike in pandas, units cannot be combined or mixed with integers
Notes
The .values attribute is always in nanoseconds with int64 dtype.
Register this Datetime object and underlying components with the Arkouda server
Parameters:
user_defined_name (str) – user defined name the Datetime is to be registered under,
this will be the root name for underlying components
Returns:
The same Datetime which is now registered with the arkouda server and has an updated name.
This is an in-place modification, the original is returned to support
a fluid programming style.
Please note you cannot register two different Datetimes with the same name.
Return sum of array elements along the given axis.
Parameters:
axis (int, Tuple[int, ...], optional, default = None) – The axis or axes along which to do the operation
If None, the computation is done across the entire array.
keepdims (bool, optional, default = False) – Whether to keep the singleton dimension(s) along axis in the result.
Returns:
numeric_scalars if axis is omitted, in which case operation is done over entire array
pdarray if axis is supplied, in which case the operation is done along that axis
The bool type is not a subclass of the int_ type
(the bool is not even a number type). This is different
than Python’s default implementation of bool as a
sub-class of int.
An integer-backed representation of a set of named binary fields, e.g. flags.
Parameters:
values (pdarray or Strings) – The array of field values. If (u)int64, the values are used as-is for the
binary representation of fields. If Strings, the values are converted
to binary according to the mapping defined by the names and MSB_left
arguments.
names (str or sequence of str) – The names of the fields, in order. A string will be treated as a list
of single-character field names. Multi-character field names are allowed,
but must be passed as a list or tuple and user must specify a separator.
MSB_left (bool) – Controls how field names are mapped to binary values. If True (default),
the left-most field name corresponds to the most significant bit in the
binary representation. If False, the left-most field name corresponds to
the least significant bit.
pad (str) – Character to display when field is not present. Use empty string if no
padding is desired.
separator (str) – Substring that separates fields. Used to parse input values (if ak.Strings)
and to display output.
show_int (bool) – If True (default), display the integer value of the binary fields in output.
This class is a thin wrapper around pdarray that mostly affects
how values are displayed to the user. Operators and methods will
typically treat this class like an int64 pdarray.
If True, and the groupby keys contain NaN values,
the NaN values together with the corresponding row will be dropped.
Otherwise, the rows corresponding to NaN values will be kept.
The default is True
Type:
bool (default=True)
Raises:
TypeError – Raised if keys is a pdarray with a dtype other than int64
Notes
Integral pdarrays, Strings, and Categoricals are natively supported, but
float64 and bool arrays are not.
For a user-defined class to be groupable, it must inherit from pdarray
and define or overload the grouping API:
a ._get_grouping_keys() method that returns a list of pdarrays
that can be (co)argsorted.
(Optional) a .group() method that returns the permutation that
groups the array
If the input is a single array with a .group() method defined, method 2
will be used; otherwise, method 1 will be used.
values (pdarray, Strings) – The values to put in each group’s segment
permute (bool) – If True (default), permute broadcast values back to the ordering
of the original array on which GroupBy was called. If False, the
broadcast values are grouped by value.
TypeError – Raised if value is not a pdarray object
ValueError – Raised if the values array does not have one
value per segment
Notes
This function is a sparse analog of np.broadcast. If a
GroupBy object represents a sparse matrix (tensor), then
this function takes a (dense) column vector and replicates
each value to the non-zero elements in the corresponding row.
Build a new GroupBy object from component keys and permutation.
Parameters:
user_defined_name (str (Optional) Passing a name will init the new GroupBy) – and assign it the given name
kwargs (dict Dictionary of components required for rebuilding the GroupBy.) – Expected keys are “orig_keys”, “permutation”, “unique_keys”, and “segments”
Returns:
The GroupBy object created by using the given components
values ((list of) pdarray-like) – The values from which to select, according to their group membership.
n (int, optional, default = 5) – Maximum number of items to return for each group.
If the number of values in a group is less than n,
all the values from that group will be returned.
return_indices (bool, default False) – If True, return the indices of the sampled values.
Otherwise, return the selected values.
Returns:
unique_keys(list of) pdarray-like
The unique keys, in grouped order
resultpdarray-like
The first n items of each group.
If return_indices is True, the result are indices.
O.W. the result are values.
Group (1,1,1) has values [3,4,3] -> there are 2 unique values 3&4
Group (2,2,2) has values [1,1,4] -> 2 unique values 1&4
Group (3,3,3) has values [3,4,1] -> 3 unique values
Group (4) has values [4] -> 1 unique value
Create a new string object from the given object. If encoding or
errors is specified, then the object must expose a data buffer
that will be decoded using the given encoding and error handler.
Otherwise, returns the result of object.__str__() (if defined)
or repr(object).
encoding defaults to ‘utf-8’.
errors defaults to ‘strict’.
Register this GroupBy object and underlying components with the Arkouda server.
Parameters:
user_defined_name (str) – user defined name the GroupBy is to be registered under,
this will be the root name for underlying components
Returns:
The same GroupBy which is now registered with the arkouda server and has an updated name.
This is an in-place modification, the original is returned to support a
fluid programming style.
Please note you cannot register two different GroupBys with the same name.
You can either specify the number of elements
or the fraction of elements to be sampled. random_state can be used for reproducibility
Parameters:
values ((list of) pdarray-like) – The values from which to sample, according to their group membership.
n (int, optional) – Number of items to return for each group.
Cannot be used with frac and must be no larger than
the smallest group unless replace is True.
Default is one if frac is None.
frac (float, optional) – Fraction of items to return. Cannot be used with n.
replace (bool, default False) – Allow or disallow sampling of the value more than once.
weights (pdarray, optional) – Default None results in equal probability weighting.
If passed a pdarray, then values must have the same length as the groupby keys
and will be used as sampling probabilities after normalization within each group.
Weights must be non-negative with at least one positive element within each group.
random_state (int or ak.random.Generator, optional) – If int, seed for random number generator.
If ak.random.Generator, use as given.
return_indices (bool, default False) – if True, return the indices of the sampled values.
Otherwise, return the sample values.
permute_samples (bool, default False) – if True, return permute the samples according to group
Otherwise, keep samples in original order.
Returns:
if return_indices is True, return the indices of the sampled values.
Otherwise, return the sample values.
TypeError – Raised if the values array is not a pdarray object
ValueError – Raised if the key array size does not match the values size
or if the operator is not in the GroupBy.Reductions array
Notes
The return dtype is always float64.
The standard deviation is the square root of the average of the squared
deviations from the mean, i.e., std=sqrt(mean((x-x.mean())**2)).
The average squared deviation is normally calculated as
x.sum()/N, where N=len(x). If, however, ddof is specified,
the divisor N-ddof is used instead. In standard statistical
practice, ddof=1 provides an unbiased estimator of the variance
of the infinite population. ddof=0 provides a maximum likelihood
estimate of the variance for normally distributed variables. The
standard deviation computed in this function is the square root of
the estimated variance, so even with ddof=1, it will not be an
unbiased estimate of the standard deviation per se.
values ((list of) pdarray-like) – The values from which to select, according to their group membership.
n (int, optional, default = 5) – Maximum number of items to return for each group.
If the number of values in a group is less than n,
all the values from that group will be returned.
return_indices (bool, default False) – If True, return the indices of the sampled values.
Otherwise, return the selected values.
Returns:
unique_keys(list of) pdarray-like
The unique keys, in grouped order
resultpdarray-like
The last n items of each group.
If return_indices is True, the result are indices.
O.W. the result are values.
The result is a collection of HDF5 files, one file
per locale of the arkouda server, where each filename starts with prefix_path.
Parameters:
prefix_path (str) – Directory and filename prefix that all output files will share
dataset (str) – Name prefix for saved data within the HDF5 file
mode (str {'truncate' | 'append'}) – By default, truncate (overwrite) output files, if they exist.
If ‘append’, add data as a new column to existing files.
file_type (str ("single" | "distribute")) – Default: “distribute”
When set to single, dataset is written to a single file.
When distribute, dataset is written on a file per locale.
This is only supported by HDF5 files and will have no impact of Parquet Files.
TypeError – Raised if the values array is not a pdarray object
ValueError – Raised if the key array size does not match the values size
or if the operator is not in the GroupBy.Reductions array
Notes
The return dtype is always float64.
The variance is the average of the squared deviations from the mean,
i.e., var=mean((x-x.mean())**2).
The mean is normally calculated as x.sum()/N, where N=len(x).
If, however, ddof is specified, the divisor N-ddof is used
instead. In standard statistical practice, ddof=1 provides an
unbiased estimator of the variance of a hypothetical infinite population.
ddof=0 provides a maximum likelihood estimate of the variance for
normally distributed variables.
This class is a thin wrapper around pdarray that mostly affects
how values are displayed to the user. Operators and methods will
typically treat this class like an int64 pdarray.
Register this IPv4 object and underlying components with the Arkouda server.
Parameters:
user_defined_name (str) – user defined name the IPv4 is to be registered under,
this will be the root name for underlying components
Returns:
The same IPv4 which is now registered with the arkouda server and has an updated name.
This is an in-place modification, the original is returned to support
a fluid programming style.
Please note you cannot register two different IPv4s with the same name.
D.update([E, ]**F) -> None. Update D from mapping/iterable E and F.
If E is present and has a .keys() method, then does: for k in E.keys(): D[k] = E[k]
If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v
In either case, this is followed by: for k in F: D[k] = F[k]
Append other to self, either vertically (axis=0, length of resulting SegArray
increases), or horizontally (axis=1, each sub-array of other appends to the
corresponding sub-array of self).
Select the j-th element of each sub-array, where possible.
Parameters:
j (int) – The index of the value to get from each sub-array. If j is negative,
it counts backwards from the end of each sub-array.
return_origins (bool) – If True, return a logical index indicating where j is in bounds
compressed (bool) – If False, return array is same size as self, with default value
where j is out of bounds. If True, the return array only contains
values where j is in bounds.
default (scalar) – When compressed=False, the value to return when j is out of bounds
for the sub-array
Returns:
valpdarray
compressed=False: The j-th value of each sub-array where j is in
bounds and the default value where j is out of bounds.
compressed=True: The j-th values of only the sub-arrays where j is
in bounds
origin_indicespdarray, bool
A Boolean array that is True where j is in bounds for the sub-array.
Return all sub-arrays of length n, as a list of columns.
Parameters:
n (int) – Length of sub-arrays to select
return_origins (bool) – Return a logical index indicating which sub-arrays are length n
Returns:
columnslist of pdarray
An n-long list of pdarray, where each row is one of the n-long
sub-arrays from the SegArray. The number of rows is the number of
True values in the returned mask.
origin_indicespdarray, bool
Array of bool for each element of the SegArray, True where sub-array
has length n.
Return all sub-array prefixes of length n (for sub-arrays that are at least n+1 long)
Parameters:
n (int) – Length of suffix
return_origins (bool) – If True, return a logical index indicating which sub-arrays
were long enough to return an n-prefix
proper (bool) – If True, only return proper prefixes, i.e. from sub-arrays
that are at least n+1 long. If False, allow the entire
sub-array to be returned as a prefix.
Returns:
prefixeslist of pdarray
An n-long list of pdarrays, essentially a table where each row is an n-prefix.
The number of rows is the number of True values in the returned mask.
origin_indicespdarray, bool
Boolean array that is True where the sub-array was long enough to return
an n-suffix, False otherwise.
Return the n-long suffix of each sub-array, where possible
Parameters:
n (int) – Length of suffix
return_origins (bool) – If True, return a logical index indicating which sub-arrays
were long enough to return an n-suffix
proper (bool) – If True, only return proper suffixes, i.e. from sub-arrays
that are at least n+1 long. If False, allow the entire
sub-array to be returned as a suffix.
Returns:
suffixeslist of pdarray
An n-long list of pdarrays, essentially a table where each row is an n-suffix.
The number of rows is the number of True values in the returned mask.
origin_indicespdarray, bool
Boolean array that is True where the sub-array was long enough to return
an n-suffix, False otherwise.
Register this SegArray object and underlying components with the Arkouda server
Parameters:
user_defined_name (str) – user defined name which this SegArray object will be registered under
Returns:
The same SegArray which is now registered with the arkouda server and has an updated name.
This is an in-place modification, the original is returned to support
a fluid programming style.
Please note you cannot register two different SegArrays with the same name.
Save the SegArray to HDF5. The result is a collection of HDF5 files, one file
per locale of the arkouda server, where each filename starts with prefix_path.
Parameters:
prefix_path (str) – Directory and filename prefix that all output files will share
dataset (str) – Name prefix for saved data within the HDF5 file
mode (str {'truncate' | 'append'}) – By default, truncate (overwrite) output files, if they exist.
If ‘append’, add data as a new column to existing files.
file_type (str ("single" | "distribute")) – Default: “distribute”
When set to single, dataset is written to a single file.
When distribute, dataset is written on a file per locale.
This is only supported by HDF5 files and will have no impact of Parquet Files.
Save the SegArray object to Parquet. The result is a collection of files,
one file per locale of the arkouda server, where each filename starts
with prefix_path. Each locale saves its chunk of the object to its
corresponding file.
:param prefix_path: Directory and filename prefix that all output files share
:type prefix_path: str
:param dataset: Name of the dataset to create in files (must not already exist)
:type dataset: str
:param mode: Deprecated.
Parameter kept to maintain functionality of other calls. Only Truncate
supported.
By default, truncate (overwrite) output files, if they exist.
If ‘append’, attempt to create new dataset in existing files.
Parameters:
compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”)
Sets the compression type used with Parquet files
Return type:
string message indicating result of save operation
Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
ValueError – If write mode is not Truncate.
Notes
Append mode for Parquet has been deprecated. It was not implemented for SegArray.
The prefix_path must be visible to the arkouda server and the user must
have write permission.
- Output files have names of the form <prefix_path>_LOCALE<i>, where <i>
ranges from 0 to numLocales for file_type=’distribute’.
- If any of the output files already exist and
the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’
and the number of output files is less than the number of locales or a
dataset with the same name already exists, a RuntimeError will result.
- Any file extension can be used.The file I/O does not rely on the extension to
determine the file format.
Send a Segmented Array to a different Arkouda server.
Parameters:
hostname (str) – The hostname where the Arkouda server intended to
receive the Segmented Array is running.
port (int_scalars) – The port to send the array over. This needs to be an
open port (i.e., not one that the Arkouda server is
running on). This will open up numLocales ports,
each of which in succession, so will use ports of the
range {port..(port+numLocales)} (e.g., running an
Arkouda server of 4 nodes, port 1234 is passed as
port, Arkouda will use ports 1234, 1235, 1236,
and 1237 to send the array data).
This port much match the port passed to the call to
ak.receive_array().
Return type:
A message indicating a complete transfer
Raises:
ValueError – Raised if the op is not within the pdarray.BinOps set
TypeError – Raised if other is not a pdarray or the pdarray.dtype is not
a supported dtype
Overwrite the dataset with the name provided with this SegArray object. If
the dataset does not exist it is added.
Parameters:
prefix_path (str) – Directory and filename prefix that all output files share
dataset (str) – Name of the dataset to create in files
repack (bool) – Default: True
HDF5 does not release memory on delete. When True, the inaccessible
data (that was overwritten) is removed. When False, the data remains, but is
inaccessible. Setting to false will yield better performance, but will cause
file sizes to expand.
Raises:
RuntimeError – Raised if a server-side error is thrown saving the SegArray
Notes
If file does not contain File_Format attribute to indicate how it was saved,
the file name is checked for _LOCALE#### to determine if it is distributed.
If the dataset provided does not exist, it will be added
Because HDF5 deletes do not release memory, this will create a copy of the
file with the new data
index (pdarray, Strings) – an array of indices associated with the data array.
If empty, it will default to a range of ints whose size match the size of the data.
optional
data (Tuple, List, groupable_element_type, Series, SegArray) – a 1D array. Must not be None.
Raises:
TypeError – Raised if index is not a pdarray or Strings object
Raised if data is not a pdarray, Strings, or Categorical object
ValueError – Raised if the index size does not match data size
Notes
The Series class accepts either positional arguments or keyword arguments.
If entering positional arguments,
2 arguments entered:
argument 1 - data
argument 2 - index
1 argument entered:
argument 1 - data
If entering 1 positional argument, it is assumed that this is the data argument.
If only ‘data’ argument is passed in, Index will automatically be generated.
If entering keywords,
‘data’ (see Parameters)
‘index’ (optional) must match size of ‘data’
Concatenate a list of Arkouda Series or grouped arrays horizontally or vertically.
If a list of grouped Arkouda arrays is passed, they are converted to Series. Each grouping
is a 2-tuple where the first item is the key(s) and the second is the value. If concatenating
horizontally (axis=1), all series/groupings must have the same length and the same index.
The index is converted to a column in the resulting DataFrame; if it’s a MultiIndex,
each level is converted to a separate column.
Parameters:
arrays (List) – A list of Series or groupings (tuples of index and values) to concatenate.
axis (int, default=0) – The axis to concatenate along:
- 0 = vertical (stack series into one)
- 1 = horizontal (align by index and produce a DataFrame)
index_labels (List of str or None, optional) – Column name(s) to label the index when axis=1.
value_labels (List of str or None, optional) – Column names to label the values of each Series.
ordered (bool, default=False) – Unused parameter. Reserved for future support of deterministic
vs. performance-optimized concatenation.
value (supported_scalars, Series, or pdarray) – Value to use to fill holes (e.g. 0), alternately a
Series of values specifying which value to use for
each index. Values not in the Series will not be filled.
This value cannot be a list.
Return a boolean same-sized object indicating if the values are NA. NA values,
such as numpy.NaN, gets mapped to True values.
Everything else gets mapped to False values.
Characters such as empty strings ‘’ are not considered NA values.
Returns:
Mask of bool values for each element in Series
that indicates whether an element is an NA value.
Return a boolean same-sized object indicating if the values are NA. NA values,
such as numpy.NaN, gets mapped to True values.
Everything else gets mapped to False values.
Characters such as empty strings ‘’ are not considered NA values.
Returns:
Mask of bool values for each element in Series
that indicates whether an element is an NA value.
The key or keys to look up. This can be:
- A scalar
- A list of scalars
- A list of lists (for MultiIndex)
- A Series (in which case labels are preserved, and its values are used as keys)
Keys will be converted to Arkouda arrays as needed.
Returns:
A Series containing the values corresponding to the key.
Map values of Series according to an input mapping.
Parameters:
arg (dict or Series) – The mapping correspondence.
Returns:
A new series with the same index as the caller.
When the input Series has Categorical values,
the return Series will have Strings values.
Otherwise, the return type will match the input type.
Return a boolean same-sized object indicating if the values are not NA.
Non-missing values get mapped to True.
Characters such as empty strings ‘’ are not considered NA values.
NA values, such as numpy.NaN, get mapped to False values.
Returns:
Mask of bool values for each element in Series
that indicates whether an element is not an NA value.
Return a boolean same-sized object indicating if the values are not NA.
Non-missing values get mapped to True.
Characters such as empty strings ‘’ are not considered NA values.
NA values, such as numpy.NaN, get mapped to False values.
Returns:
Mask of bool values for each element in Series
that indicates whether an element is not an NA value.
Concatenate a list of Arkouda Series or grouped arrays, returning a local pandas object.
If a list of grouped Arkouda arrays is passed, they are converted to Series. Each grouping
is a 2-tuple with the first item being the key(s) and the second the value.
If axis=1 (horizontal), each Series or grouping must have the same length and the same index.
The index is converted to a column in the resulting DataFrame. If it is a MultiIndex,
each level is converted to a separate column.
Parameters:
arrays (List) – A list of Series or groupings (tuples of index and values) to concatenate.
axis (int, default=0) – The axis along which to concatenate:
- 0 = vertical (stack into a Series)
- 1 = horizontal (align by index into a DataFrame)
labels (Strings or None, optional) – Names to assign to the resulting columns in the DataFrame.
Register this Series object and underlying components with the Arkouda server
Parameters:
user_defined_name (str) – user defined name the Series is to be registered under,
this will be the root name for underlying components
Returns:
The same Series which is now registered with the arkouda server and has an updated name.
This is an in-place modification, the original is returned to support
a fluid programming style.
Please note you cannot register two different Series with the same name.
storage_options (dict, optional) – Extra options that make sense for a particular storage connection,
e.g. host, port, username, password, etc., if using a URL that will be parsed by fsspec,
e.g., starting “s3://”, “gcs://”.
An error will be raised if providing this argument with a non-fsspec URL.
See the fsspec and backend storage implementation docs for the set
of allowed keys and values.
**kwargs – These parameters will be passed to tabulate.
D.update([E, ]**F) -> None. Update D from mapping/iterable E and F.
If E is present and has a .keys() method, then does: for k in E.keys(): D[k] = E[k]
If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v
In either case, this is followed by: for k in F: D[k] = F[k]
Represents an array of strings whose data resides on the
arkouda server. The user should not call this class directly;
rather its instances are created by other arkouda functions.
Strings is composed of two pdarrays: (1) offsets, which contains the
starting indices for each string and (2) bytes, which contains the
raw bytes of all strings, delimited by nulls.
>>> importarkoudaasak>>> strings=ak.array([f'StrINgS aRe Here {i}'foriinrange(5)])>>> stringsarray(['StrINgS aRe Here 0', 'StrINgS aRe Here 1', 'StrINgS aRe Here 2', 'StrINgS aRe Here 3', 'StrINgS aRe Here 4'])>>> strings.title()array(['Strings Are Here 0', 'Strings Are Here 1', 'Strings Are Here 2', 'Strings Are Here 3', 'Strings Are Here 4'])
Check whether each element contains the given substring.
Parameters:
substr (bytes or str_scalars) – The substring in the form of string or byte array to search for
regex (bool, default=False) – Indicates whether substr is a regular expression
Note: only handles regular expressions supported by re2
(does not support lookaheads/lookbehinds)
Returns:
True for elements that contain substr, False otherwise
Check whether each element ends with the given substring.
Parameters:
substr (bytes or str_scalars) – The suffix to search for
regex (bool, default=False) – Indicates whether substr is a regular expression
Note: only handles regular expressions supported by re2
(does not support lookaheads/lookbehinds)
Returns:
True for elements that end with substr, False otherwise
Assemble a Strings object from separate offset and bytes arrays.
This factory method constructs a segmented Strings array by sending two
separate components—offsets and values—to the Arkouda server and instructing
it to assemble them into a single Strings object. Use this when offsets
and byte data are created or transported independently.
Parameters:
offset_attrib (pdarray or str) – The array of starting positions for each string, or a string
expression that can be passed to create_pdarray to build it.
bytes_attrib (pdarray or str) – The array of raw byte values (e.g., uint8 character codes), or a string
expression that can be passed to create_pdarray to build it.
Returns:
A Strings object representing the assembled segmented strings array
on the Arkouda server.
Create a Strings object from an Arkouda server response message.
Parse the server’s response descriptor and construct a Strings array
with its underlying pdarray and total byte size.
Parameters:
rep_msg (str) – Server response message of the form:
`created<name><type><size><ndim><shape><itemsize>+...bytes.size<total_bytes>`
For example:
`"createdfooStrings31(3,)8+createdbytes.size24"`
Returns:
A Strings object representing the segmented strings array on the server,
initialized with the returned pdarray and byte-size metadata.
return_origins (bool, default=True) – If True, return a logical index indicating which strings
were long enough to return an n-prefix
proper (bool, default=True) – If True, only return proper prefixes, i.e. from strings
that are at least n+1 long. If False, allow the entire
string to be returned as a prefix.
Returns:
prefixesStrings
The array of n-character prefixes; the number of elements is the number of
True values in the returned mask.
origin_indicespdarray, bool
Boolean array that is True where the string was long enough to return
an n-character prefix, False otherwise.
return_origins (bool, default=True) – If True, return a logical index indicating which strings
were long enough to return an n-suffix
proper (bool, default=True) – If True, only return proper suffixes, i.e. from strings
that are at least n+1 long. If False, allow the entire
string to be returned as a suffix.
Returns:
suffixesStrings
The array of n-character suffixes; the number of elements is the number of
True values in the returned mask.
origin_indicespdarray, bool
Boolean array that is True where the string was long enough to return
an n-character suffix, False otherwise.
Return the permutation that groups the array, placing equivalent
strings together. All instances of the same string are guaranteed to lie
in one contiguous block of the permuted array, but the blocks are not
necessarily ordered.
If the arkouda server is compiled with “-sSegmentedString.useHash=true”,
then arkouda uses 128-bit hash values to group strings, rather than sorting
the strings directly. This method is fast, but the resulting permutation
merely groups equivalent strings and does not sort them. If the “useHash”
parameter is false, then a full sort is performed.
Raises:
RuntimeError – Raised if there is a server-side error in executing group request or
creating the pdarray encapsulating the return message
The implementation uses SipHash128, a fast and balanced hash function (used
by Python for dictionaries and sets). For realistic numbers of strings (up
to about 10**15), the probability of a collision between two 128-bit hash
values is negligible.
Return a boolean pdarray where index i indicates whether string i of the
Strings is alphabetic. This means there is at least one character,
and all the characters are alphabetic.
Returns:
True for elements that are alphabetic, False otherwise
Join the strings from another array onto the left of the strings
of this array, optionally inserting a delimiter.
Warning: This function is experimental and not guaranteed to work.
Parameters:
other (Strings) – The strings to join onto self’s strings
delimiter (bytes or str_scalars, default="") – String inserted between self and other
Peel off one or more delimited fields from each string (similar
to string.partition), returning two new arrays of strings.
Warning: This function is experimental and not guaranteed to work.
Parameters:
delimiter (bytes or str_scalars) – The separator where the split will occur
times (int_scalars, default=1) – The number of times the delimiter is sought, i.e. skip over
the first (times-1) delimiters
includeDelimiter (bool, default=False) – If true, append the delimiter to the end of the first return
array. By default, it is prepended to the beginning of the
second return array.
keepPartial (bool, default=False) – If true, a string that does not contain <times> instances of
the delimiter will be returned in the first array. By default,
such strings are returned in the second array.
fromRight (bool, default=False) – If true, peel from the right instead of the left (see also rpeel)
regex (bool, default=False) – Indicates whether delimiter is a regular expression
Note: only handles regular expressions supported by re2
(does not support lookaheads/lookbehinds)
Returns:
left: Strings
The field(s) peeled from the end of each string (unless
fromRight is true)
right: Strings
The remainder of each string after peeling (unless fromRight
is true)
TypeError – Raised if the delimiter parameter is not byte or str_scalars, if
times is not int64, or if includeDelimiter, keepPartial, or
fromRight is not bool
ValueError – Raised if times is < 1 or if delimiter is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
Register this Strings object with a user defined name in the arkouda server
so it can be attached to later using Strings.attach()
This is an in-place operation, registering a Strings object more than once will
update the name in the registry and remove the previously registered name.
A name can only be registered to one object at a time.
Parameters:
user_defined_name (str) – user defined name which the Strings object is to be registered under
Returns:
The same Strings object which is now registered with the arkouda server and
has an updated name.
This is an in-place modification, the original is returned to support a
fluid programming style.
Please note you cannot register two different objects with the same name.
TypeError – Raised if user_defined_name is not a str
RegistrationError – If the server was unable to register the Strings object with the user_defined_name
If the user is attempting to register more than one object with the same name,
the former should be unregistered first to free up the registration name.
Peel off one or more delimited fields from the end of each string
(similar to string.rpartition), returning two new arrays of strings.
Warning: This function is experimental and not guaranteed to work.
Parameters:
delimiter (bytes or str_scalars) – The separator where the split will occur
times (int_scalars, default=1) – The number of times the delimiter is sought, i.e. skip over
the last (times-1) delimiters
includeDelimiter (bool, default=False) – If true, prepend the delimiter to the start of the first return
array. By default, it is appended to the end of the
second return array.
keepPartial (bool, default=False) – If true, a string that does not contain <times> instances of
the delimiter will be returned in the second array. By default,
such strings are returned in the first array.
regex (bool, default=False) – Indicates whether delimiter is a regular expression
Note: only handles regular expressions supported by re2
(does not support lookaheads/lookbehinds)
Returns:
left: Strings
The remainder of the string after peeling
right: Strings
The field(s) that were peeled from the right of each string
Return a match object with the first location in each element where pattern produces a match.
Elements match if any part of the string matches the regular expression pattern
Parameters:
pattern (bytes or str_scalars) – Regex used to find matches
Returns:
Match object where elements match if any part of the string matches the
regular expression pattern
Unpack delimiter-joined substrings into a flat array.
Parameters:
delimiter (str) – Characters used to split strings into substrings
return_segments (bool, default=False) – If True, also return mapping of original strings to first substring
in return array.
regex (bool, default=False) – Indicates whether delimiter is a regular expression
Note: only handles regular expressions supported by re2
(does not support lookaheads/lookbehinds)
Returns:
Strings
Flattened substrings with delimiters removed
pdarray, int64 (optional)
For each original string, the index of first corresponding substring
in the return array
Check whether each element starts with the given substring.
Parameters:
substr (bytes or str_scalars) – The prefix to search for
regex (bool, default=False) – Indicates whether substr is a regular expression
Note: only handles regular expressions supported by re2
(does not support lookaheads/lookbehinds)
Returns:
True for elements that start with substr, False otherwise
Join the strings from another array onto one end of the strings
of this array, optionally inserting a delimiter.
Warning: This function is experimental and not guaranteed to work.
Parameters:
other (Strings) – The strings to join onto self’s strings
delimiter (bytes or str_scalars, default="") – String inserted between self and other
toLeft (bool, default=False) – If true, join other strings to the left of self. By default,
other is joined to the right of self.
Return a new Strings object with all leading and trailing occurrences of characters contained
in chars removed. The chars argument is a string specifying the set of characters to be removed.
If omitted, the chars argument defaults to removing whitespace. The chars argument is not a
prefix or suffix; rather, all combinations of its values are stripped.
Parameters:
chars (bytes or str_scalars, optional) – the set of characters to be removed
Returns:
Strings object with the leading and trailing characters matching the set of characters in
the chars argument removed
Return new Strings obtained by replacing non-overlapping occurrences of pattern with the
replacement repl.
If count is nonzero, at most count substitutions occur
Parameters:
pattern (bytes or str_scalars) – The regex to substitue
repl (bytes or str_scalars) – The substring to replace pattern matches with
count (int, default=0) – The max number of pattern match occurences in each element to replace.
The default count=0 replaces all occurences of pattern with repl
Perform the same operation as sub(), but return a tuple (new_Strings, number_of_substitions)
Parameters:
pattern (bytes or str_scalars) – The regex to substitue
repl (bytes or str_scalars) – The substring to replace pattern matches with
count (int, default=0) – The max number of pattern match occurences in each element to replace.
The default count=0 replaces all occurences of pattern with repl
Returns:
Strings
Strings with pattern matches replaced
pdarray, int64
The number of substitutions made for each element of Strings
Write Strings to CSV file(s). File will contain a single column with the Strings data.
All CSV Files written by Arkouda include a header denoting data types of the columns.
Unlike other file formats, CSV files store Strings as their UTF-8 format instead of storing
bytes as uint(8).
Parameters:
prefix_path (str) – The filename prefix to be used for saving files. Files will have _LOCALE#### appended
when they are written to disk.
dataset (str, default="strings_array") – Column name to save the Strings under. Defaults to “strings_array”.
col_delim (str, default=",") – Defaults to “,”. Value to be used to separate columns within the file.
Please be sure that the value used DOES NOT appear in your dataset.
overwrite (bool, default=False) – Defaults to False. If True, any existing files matching your provided prefix_path will
be overwritten. If False, an error will be returned if existing files are found.
Returns:
response message
Return type:
str
Raises:
ValueError – Raised if all datasets are not present in all parquet files or if one or
more of the specified files do not exist
RuntimeError – Raised if one or more of the specified files cannot be opened.
If allow_errors is true this may be raised if no values are returned
from the server.
TypeError – Raised if we receive an unknown arkouda_type returned from the server
Notes
CSV format is not currently supported by load/load_all operations
The column delimiter is expected to be the same for column names and data
Be sure that column delimiters are not found within your data.
All CSV files must delimit rows using newline (\n) at this time.
Save the Strings object to HDF5.
The object can be saved to a collection of files or single file.
Parameters:
prefix_path (str) – Directory and filename prefix that all output files share
dataset (str, default="strings_array") – The name of the Strings dataset to be written, defaults to strings_array
mode ({"truncate", "append"}, default = "truncate") – By default, truncate (overwrite) output files, if they exist.
If ‘append’, create a new Strings dataset within existing files.
save_offsets (bool, default=True) – Defaults to True which will instruct the server to save the offsets array to HDF5
If False the offsets array will not be save and will be derived from the string values
upon load/read.
file_type ({"single", "distribute"}, default = "distribute") – Default: Distribute
Distribute the dataset over a file per locale.
Single file will save the dataset to one file
Returns:
String message indicating result of save operation
Return type:
str
Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
Parquet files do not store the segments, only the values.
Strings state is saved as two datasets within an hdf5 group:
one for the string characters and one for the
segments corresponding to the start of each string
the hdf5 group is named via the dataset parameter.
The prefix_path must be visible to the arkouda server and the user must
have write permission.
Output files have names of the form <prefix_path>_LOCALE<i>, where <i>
ranges from 0 to numLocales for file_type=’distribute’. Otherwise,
the file name will be prefix_path.
If any of the output files already exist and
the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’
and the number of output files is less than the number of locales or a
dataset with the same name already exists, a RuntimeError will result.
Any file extension can be used.The file I/O does not rely on the extension to
determine the file format.
Convert the SegString to a list, transferring data from the
arkouda server to Python. If the SegString exceeds a built-in size limit,
a RuntimeError is raised.
The number of bytes in the array cannot exceed ak.client.maxTransferBytes,
otherwise a RuntimeError will be raised. This is to protect the user
from overflowing the memory of the system on which the Python client
is running, under the assumption that the server is running on a
distributed system with much more memory than the client. The user
may override this limit by setting ak.client.maxTransferBytes to a larger
value, but proceed with caution.
Convert the array to a np.ndarray, transferring array data from the
arkouda server to Python. If the array exceeds a built-in size limit,
a RuntimeError is raised.
Returns:
A numpy ndarray with the same strings as this array
Return type:
np.ndarray
Notes
The number of bytes in the array cannot exceed ak.client.maxTransferBytes,
otherwise a RuntimeError will be raised. This is to protect the user
from overflowing the memory of the system on which the Python client
is running, under the assumption that the server is running on a
distributed system with much more memory than the client. The user
may override this limit by setting ak.client.maxTransferBytes to a larger
value, but proceed with caution.
Save the Strings object to Parquet. The result is a collection of files,
one file per locale of the arkouda server, where each filename starts
with prefix_path. Each locale saves its chunk of the array to its
corresponding file.
:param prefix_path: Directory and filename prefix that all output files share
:type prefix_path: str
:param dataset: Name of the dataset to create in files (must not already exist)
:type dataset: str, default=”strings_array”
:param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, attempt to create new dataset in existing files.
Parameters:
compression ({"snappy", "gzip", "brotli", "zstd", "lz4"}, optional) – Sets the compression type used with Parquet files
Returns:
string message indicating result of save operation
Return type:
str
Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
The prefix_path must be visible to the arkouda server and the user must
have write permission.
- Output files have names of the form <prefix_path>_LOCALE<i>, where <i>
ranges from 0 to numLocales for file_type=’distribute’.
- ‘append’ write mode is supported, but is not efficient.
- If any of the output files already exist and
the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’
and the number of output files is less than the number of locales or a
dataset with the same name already exists, a RuntimeError will result.
- Any file extension can be used.The file I/O does not rely on the extension to
determine the file format.
Send a Strings object to a different Arkouda server.
Parameters:
hostname (str) – The hostname where the Arkouda server intended to
receive the Strings object is running.
port (int_scalars) – The port to send the array over. This needs to be an
open port (i.e., not one that the Arkouda server is
running on). This will open up numLocales ports,
each of which in succession, so will use ports of the
range {port..(port+numLocales)} (e.g., running an
Arkouda server of 4 nodes, port 1234 is passed as
port, Arkouda will use ports 1234, 1235, 1236,
and 1237 to send the array data).
This port much match the port passed to the call to
ak.receive_array().
Returns:
A message indicating a complete transfer
Return type:
str
Raises:
ValueError – Raised if the op is not within the pdarray.BinOps set
TypeError – Raised if other is not a pdarray or the pdarray.dtype is not
a supported dtype
Overwrite the dataset with the name provided with this Strings object. If
the dataset does not exist it is added
Parameters:
prefix_path (str) – Directory and filename prefix that all output files share
dataset (str, default="strings_array") – Name of the dataset to create in files
save_offsets (bool, default=True) – Defaults to True which will instruct the server to save the offsets array to HDF5
If False the offsets array will not be save and will be derived from the string values
upon load/read.
repack (bool, default=True) – Default: True
HDF5 does not release memory on delete. When True, the inaccessible
data (that was overwritten) is removed. When False, the data remains, but is
inaccessible. Setting to false will yield better performance, but will cause
file sizes to expand.
Returns:
success message if successful
Return type:
str
Raises:
RuntimeError – Raised if a server-side error is thrown saving the Strings object
Notes
If file does not contain File_Format attribute to indicate how it was saved,
the file name is checked for _LOCALE#### to determine if it is distributed.
If the dataset provided does not exist, it will be added
Represents a duration, the difference between two dates or times.
Timedelta is the Arkouda equivalent of pandas.TimedeltaIndex.
Parameters:
pda (int64 pdarray, pd.TimedeltaIndex, pd.Series, or np.timedelta64 array)
unit (str, default 'ns') –
For int64 pdarray, denotes the unit of the input. Ignored for pandas
and numpy arrays, which carry their own unit. Not case-sensitive;
prefixes of full names (like ‘sec’) are accepted.
Possible values:
’weeks’ or ‘w’
’days’ or ‘d’
’hours’ or ‘h’
’minutes’, ‘m’, or ‘t’
’seconds’ or ‘s’
’milliseconds’, ‘ms’, or ‘l’
’microseconds’, ‘us’, or ‘u’
’nanoseconds’, ‘ns’, or ‘n’
Unlike in pandas, units cannot be combined or mixed with integers
Notes
The .values attribute is always in nanoseconds with int64 dtype.
Register this Timedelta object and underlying components with the Arkouda server
Parameters:
user_defined_name (str) – user defined name the timedelta is to be registered under,
this will be the root name for underlying components
Returns:
The same Timedelta which is now registered with the arkouda server and has an updated name.
This is an in-place modification, the original is returned to support
a fluid programming style.
Please note you cannot register two different Timedeltas with the same name.
Return sum of array elements along the given axis.
Parameters:
axis (int, Tuple[int, ...], optional, default = None) – The axis or axes along which to do the operation
If None, the computation is done across the entire array.
keepdims (bool, optional, default = False) – Whether to keep the singleton dimension(s) along axis in the result.
Returns:
numeric_scalars if axis is omitted, in which case operation is done over entire array
pdarray if axis is supplied, in which case the operation is done along that axis
The bool type is not a subclass of the int_ type
(the bool is not even a number type). This is different
than Python’s default implementation of bool as a
sub-class of int.
The function should take one argument
and return a new value. The function will then be called on each element in
the pdarray.
Warning: This function is experimental and may not work as expected.
Known limitations:
- Any python modules used inside of the function must be installed on the server.
Parameters:
arr (pdarray) – The pdarray to which the function is applied
func (Union[Callable, str]) – The function to apply to the array. This can be a callable function or
a string, but either way it should take a single argument and return a
single value. If a string, it should be a lambda function that takes a
single argument, e.g. “lambda x,: x+1”. Note the dangling comma after
the argument, this is required for string functions.
result_dtype (Optional[Union[np.dtype, str]]) – The dtype of the resulting pdarray. If None, the dtype of the resulting
pdarray will be the same as the input pdarray. If a string, it should be
a valid numpy dtype string, e.g. “float64”. If a numpy dtype, it should
be a valid numpy dtype object, e.g. np.float64. This is not supported
for functions passed as strings.
Returns:
The pdarray resulting from applying the function to the input array
Create a pdarray of consecutive integers within the interval [start, stop).
If only one arg is given then arg is the stop parameter. If two args are
given, then the first arg is start and second is stop. If three args are
given, then the first arg is start, second is stop, third is step.
step (int_scalars, optional) – if one of these three is supplied, it’s used as stop, and start = 0, step = 1
if two of them are supplied, start = start, stop = stop, step = 1
if all three are supplied, start = start, stop = stop, step = step
dtype (np.dtype, type, or str) – The target dtype to cast values to
max_bits (int) – Specifies the maximum number of bits; only used for bigint pdarrays
Returns:
Integers from start (inclusive) to stop (exclusive) by step
Negative steps result in decreasing values. Currently, only int64
pdarrays can be created with this method. For float64 arrays, use
the linspace method.
where (bool or pdarray, default=True) – This condition is broadcast over the input. At locations where the condition is True,
the inverse cosine will be applied to the corresponding value. Elsewhere, it will retain
its original value. Default set to True.
Returns:
A pdarray containing inverse cosine for each element
of the original pdarray
where (bool or pdarray, default=True) – This condition is broadcast over the input. At locations where the condition is True,
the inverse hyperbolic cosine will be applied to the corresponding value. Elsewhere, it will
retain its original value. Default set to True.
Returns:
A pdarray containing inverse hyperbolic cosine for each element
of the original pdarray
where (bool or pdarray, default=True) – This condition is broadcast over the input. At locations where the condition is True,
the inverse sine will be applied to the corresponding value. Elsewhere, it will retain
its original value. Default set to True.
Returns:
A pdarray containing inverse sine for each element
of the original pdarray
where (bool or pdarray, default=True) – This condition is broadcast over the input. At locations where the condition is True,
the inverse hyperbolic sine will be applied to the corresponding value. Elsewhere, it will retain
its original value. Default set to True.
Returns:
A pdarray containing inverse hyperbolic sine for each element
of the original pdarray
where (bool or pdarray, default=True) – This condition is broadcast over the input. At locations where the condition is True,
the inverse tangent will be applied to the corresponding value. Elsewhere, it will retain
its original value. Default set to True.
Returns:
A pdarray containing inverse tangent for each element
of the original pdarray
Return the element-wise inverse tangent of the array pair. The result chosen is the
signed angle in radians between the ray ending at the origin and passing through the
point (1,0), and the ray ending at the origin and passing through the point (denom, num).
The result is between -pi and pi.
where (bool or pdarray, default=True) – This condition is broadcast over the input. At locations where the condition is True,
the inverse tangent will be applied to the corresponding values. Elsewhere, it will retain
its original value. Default set to True.
Returns:
A pdarray containing inverse tangent for each corresponding element pair
of the original pdarray, using the signed values or the numerator and
denominator to get proper placement on unit circle.
where (bool or pdarray, default=True) – This condition is broadcast over the input. At locations where the condition is True,
the inverse hyperbolic tangent will be applied to the corresponding value. Elsewhere,
it will retain its original value. Default set to True.
Returns:
A pdarray containing inverse hyperbolic tangent for each element
of the original pdarray
TypeError – Raised if pda is not a pdarray or k is not an integer
ValueError – Raised if the pda is empty, or pda.ndim > 1, or k < 1
Notes
This call is equivalent in value to ak.argsort(a)[k:]
and generally outperforms this operation.
This reduction will see a significant drop in performance as k grows
beyond a certain value. This value is system dependent, but generally
about a k of 5 million is where performance degradation has been observed.
TypeError – Raised if pda is not a pdarray or k is not an integer
ValueError – Raised if the pda is empty, or pda.ndim > 1, or k < 1
Notes
This call is equivalent in value to ak.argsort(a)[:k]
and generally outperforms this operation.
This reduction will see a significant drop in performance as k grows
beyond a certain value. This value is system dependent, but generally
about a k of 5 million is where performance degradation has been observed.
TypeError – Raised if a is not a pdarray, np.ndarray, or Python Iterable such as a
list, array, tuple, or deque
RuntimeError – Raised if nbytes > maxTransferBytes, a.dtype is not supported (not in DTypes),
or if the product of a size and a.itemsize > maxTransferBytes
ValueError – Raised if a has rank is not in get_array_ranks(), or if the returned message is malformed or does
not contain the fields required to generate the array.
The number of bytes in the input array cannot exceed ak.client.maxTransferBytes,
otherwise a RuntimeError will be raised. This is to protect the user
from overwhelming the connection between the Python client and the arkouda
server, under the assumption that it is a low-bandwidth connection. The user
may override this limit by setting ak.client.maxTransferBytes to a larger value,
but should proceed with caution.
If the pdrray or ndarray is of type U, this method is called twice recursively
to create the Strings object and the two corresponding pdarrays for string
bytes and offsets, respectively.
Compares two pdarrays for equality.
If neither array has any nan elements, then if all elements are pairwise equal,
it returns True.
If equal_Nan is False, then any nan element in either array gives a False return.
If equal_Nan is True, then pairwise-corresponding nans are considered equal.
check_dtype (bool, default True) – Check that integer dtype of the codes are the same.
check_category_order (bool, default True) – Whether the order of the categories should be compared, which
implies identical integer codes. If False, only the resulting
values are compared. The ordered attribute is
checked regardless.
obj (str, default 'Categorical') – Specify object name being compared, internally used to show appropriate
assertion message.
Assert that two dictionaries are equal.
Values must be arkouda objects.
:param left: The dictionaries to be compared.
:type left: dict
:param right: The dictionaries to be compared.
:type right: dict
:param compare_keys: Whether to compare the keys.
This function is intended to compare two DataFrames and output any
differences. It is mostly intended for use in unit tests.
Additional parameters allow varying the strictness of the
equality checks performed.
check_dtype (bool, default True) – Whether to check the DataFrame dtype is identical.
check_index_type (bool, default = True) – Whether to check the Index class, dtype and inferred_type
are identical.
check_column_type (bool or {'equiv'}, default 'equiv') – Whether to check the columns class, dtype and inferred_type
are identical. Is passed as the exact argument of
assert_index_equal().
check_frame_type (bool, default True) – Whether to check the DataFrame class is identical.
check_names (bool, default True) – Whether to check that the names attribute for both the index
and column attributes of the DataFrame is identical.
check_exact (bool, default False) – Whether to compare number exactly.
check_like (bool, default False) – If True, ignore the order of index & columns.
Note: index labels must match their respective rows
(same as in columns) - same labels must be with the same data.
rtol (float, default 1e-5) – Relative tolerance. Only used when check_exact is False.
atol (float, default 1e-8) – Absolute tolerance. Only used when check_exact is False.
obj (str, default 'DataFrame') – Specify object name being compared, internally used to show appropriate
assertion message.
This function is intended to compare two DataFrames and output any
differences. It is mostly intended for use in unit tests.
Additional parameters allow varying the strictness of the
equality checks performed.
pd.DataFrame’s will be converted to the arkouda equivalent.
Then assert_frame_equal will be applied to the result.
Parameters:
left (DataFrame or pd.DataFrame) – First DataFrame to compare.
right (DataFrame or pd.DataFrame) – Second DataFrame to compare.
check_dtype (bool, default True) – Whether to check the DataFrame dtype is identical.
check_index_type (bool, default = True) – Whether to check the Index class, dtype and inferred_type
are identical.
check_column_type (bool or {'equiv'}, default 'equiv') – Whether to check the columns class, dtype and inferred_type
are identical. Is passed as the exact argument of
assert_index_equal().
check_frame_type (bool, default True) – Whether to check the DataFrame class is identical.
check_names (bool, default True) – Whether to check that the names attribute for both the index
and column attributes of the DataFrame is identical.
check_exact (bool, default False) – Whether to compare number exactly.
check_like (bool, default False) – If True, ignore the order of index & columns.
Note: index labels must match their respective rows
(same as in columns) - same labels must be with the same data.
rtol (float, default 1e-5) – Relative tolerance. Only used when check_exact is False.
atol (float, default 1e-8) – Absolute tolerance. Only used when check_exact is False.
obj (str, default 'DataFrame') – Specify object name being compared, internally used to show appropriate
assertion message.
check_order (bool, default True) – Whether to compare the order of index entries as well as their values.
If True, both indexes must contain the same elements, in the same order.
If False, both indexes must contain the same elements, but in any order.
rtol (float, default 1e-5) – Relative tolerance. Only used when check_exact is False.
atol (float, default 1e-8) – Absolute tolerance. Only used when check_exact is False.
obj (str, default 'Index') – Specify object name being compared, internally used to show appropriate
assertion message.
check_order (bool, default True) – Whether to compare the order of index entries as well as their values.
If True, both indexes must contain the same elements, in the same order.
If False, both indexes must contain the same elements, but in any order.
rtol (float, default 1e-5) – Relative tolerance. Only used when check_exact is False.
atol (float, default 1e-8) – Absolute tolerance. Only used when check_exact is False.
obj (str, default 'Index') – Specify object name being compared, internally used to show appropriate
assertion message.
check_category_order (bool, default True) – Whether to compare category order of internal Categoricals.
rtol (float, default 1e-5) – Relative tolerance. Only used when check_exact is False.
atol (float, default 1e-8) – Absolute tolerance. Only used when check_exact is False.
obj (str, default 'Series') – Specify object name being compared, internally used to show appropriate
assertion message.
check_index (bool, default True) – Whether to check index equivalence. If False, then compare only values.
check_like (bool, default False) – If True, ignore the order of the index. Must be False if check_index is False.
Note: same labels must be with the same data.
check_category_order (bool, default True) – Whether to compare category order of internal Categoricals.
rtol (float, default 1e-5) – Relative tolerance. Only used when check_exact is False.
atol (float, default 1e-8) – Absolute tolerance. Only used when check_exact is False.
obj (str, default 'Series') – Specify object name being compared, internally used to show appropriate
assertion message.
check_index (bool, default True) – Whether to check index equivalence. If False, then compare only values.
check_like (bool, default False) – If True, ignore the order of the index. Must be False if check_index is False.
Note: same labels must be with the same data.
Attach a previously created Arkouda object by its registered name.
This function retrieves an Arkouda object (e.g., pdarray, DataFrame,
Series, etc.) associated with a given name. It returns the corresponding
object based on the type of object stored under that name.
Parameters:
name (str) – The name of the object to attach.
Returns:
The Arkouda object associated with the given name. The returned object
could be of any supported type, such as pdarray, DataFrame, Series,
etc.
Return type:
object
Raises:
ValueError – If the object type in the response message does not match any known types.
Convert a number or string to an integer, or return 0 if no arguments
are given. If x is a number, return x.__int__(). For floating-point
numbers, this truncates towards zero.
If x is not a number or if base is given, then x must be a string,
bytes, or bytearray instance representing an integer literal in the
given base. The literal can be preceded by ‘+’ or ‘-’ and be surrounded
by whitespace. The base defaults to 10. Valid bases are 0 and 2-36.
Base 0 means to interpret the base from the string as an integer literal.
>>> int(‘0b100’, base=0)
4
Create a new string object from the given object. If encoding or
errors is specified, then the object must expose a data buffer
that will be decoded using the given encoding and error handler.
Otherwise, returns the result of object.__str__() (if defined)
or repr(object).
encoding defaults to ‘utf-8’.
errors defaults to ‘strict’.
Convert a number or string to an integer, or return 0 if no arguments
are given. If x is a number, return x.__int__(). For floating-point
numbers, this truncates towards zero.
If x is not a number or if base is given, then x must be a string,
bytes, or bytearray instance representing an integer literal in the
given base. The literal can be preceded by ‘+’ or ‘-’ and be surrounded
by whitespace. The base defaults to 10. Valid bases are 0 and 2-36.
Base 0 means to interpret the base from the string as an integer literal.
>>> int(‘0b100’, base=0)
4
Create a bigint pdarray from an iterable of uint pdarrays.
The first item in arrays will be the highest 64 bits and
the last item will be the lowest 64 bits.
Parameters:
arrays (Sequence[pdarray]) – An iterable of uint pdarrays used to construct the bigint pdarray.
The first item in arrays will be the highest 64 bits and
the last item will be the lowest 64 bits.
max_bits (int) – Specifies the maximum number of bits; only used for bigint pdarrays
Return the binary representation of the input number as a string.
For negative numbers, if width is not given, a minus sign is added to the
front. If width is given, the two’s complement of the number is
returned, with respect to that width.
In a two’s-complement system negative numbers are represented by the two’s
complement of the absolute value. This is the most common method of
representing signed integers on computers [1]_. A N-bit two’s-complement
system can represent every integer in the range
\(-2^{N-1}\) to \(+2^{N-1}-1\).
Parameters:
num (int) – Only an integer decimal number can be used.
width (int, optional) – The length of the returned string if num is positive, or the length
of the two’s complement if num is negative, provided that width is
at least a sufficient number of bits for num to be represented in
the designated form. If the width value is insufficient, an error is
raised.
Returns:
bin – Binary representation of num or two’s complement of num.
Broadcast a dense column vector to the rows of a sparse matrix or grouped array.
Parameters:
segments (pdarray, int64) – Offsets of the start of each row in the sparse matrix or grouped array.
Must be sorted in ascending order.
values (pdarray, Strings) – The values to broadcast, one per row (or group)
size (int) – The total number of nonzeros in the matrix. If permutation is given, this
argument is ignored and the size is inferred from the permutation array.
permutation (pdarray, int64) – The permutation to go from the original ordering of nonzeros to the ordering
grouped by row. To broadcast values back to the original ordering, this
permutation will be inverted. If no permutation is supplied, it is assumed
that the original nonzeros were already grouped by row. In this case, the
size argument must be given.
If number of nonzeros (either user-specified or inferred from permutation)
is less than one
Examples
>>> importarkoudaasak>>># Define a sparse matrix with 3 rows and 7 nonzeros>>> row_starts=ak.array([0,2,5])>>> nnz=7
Broadcast the row number to each nonzero element
>>> row_number = ak.arange(3)
>>> ak.broadcast(row_starts, row_number, nnz)
array([0 0 1 1 1 2 2])
If the original nonzeros were in reverse order…
>>> permutation = ak.arange(6, -1, -1)
>>> ak.broadcast(row_starts, row_number, permutation=permutation)
array([2 2 1 1 1 0 0])
Create a “broadcasted” array (of rank ‘nd’) by copying an array into an
array of the given shape.
E.g., given the following broadcast:
pda (3d array): 1 x 4 x 1
shape ( shape ): 7 x 4 x 2
Result (3d array): 7 x 4 x 2
When copying from a singleton dimension, the value is repeated along
that dimension (e.g., pda’s 1st and 3rd above).
For non singleton dimensions, the size of the two arrays must match,
and the values are copied into the result array.
When prepending a new dimension to increase an array’s rank, the
values from the other dimensions are repeated along the new dimension.
The cast is performed according to Chapel’s casting rules and is NOT safe
from overflows or underflows. The user must ensure that the target dtype
has the precision and capacity to hold the desired result.
hi (numeric_scalars or pdarray) – the higher value of the clipping range
If lo or hi (or both) are pdarrays, the check is by pairwise elements.
See examples.
Returns:
A pdarray matching pda, except that element x remains x if lo <= x <= hi,
Either lo or hi may be None, but not both.
If lo > hi, all x = hi.
If all inputs are int64, output is int64, but if any input is float64, output is float64.
Return the permutation that groups the rows (left-to-right), if the
input arrays are treated as columns. The permutation sorts numeric
columns, but not Strings or Categoricals — those are grouped, not ordered.
Parameters:
arrays (Sequence of Strings, pdarray, or Categorical) – The columns (int64, uint64, float64, Strings, or Categorical) to sort by row.
algorithm (SortingAlgorithm, default=SortingAlgorithm.RadixSortLSD) – The algorithm to be used for sorting the arrays.
ascending (bool, default=True) – Whether to sort in ascending order. Ignored when arrays have ndim > 1.
Returns:
The indices that permute the rows into grouped order.
Uses a least-significant-digit radix sort, which is stable and resilient
to non-uniformity in data but communication intensive. Starts with the
last array and moves forward.
For Strings, sorting is based on a hash. This ensures grouping of identical strings,
but not lexicographic order. For Categoricals, sorting is based on the internal codes.
axis (int, default = 0) – The axis along which the arrays will be joined.
If axis is None, arrays are flattened before use. Only for use with pdarray, and when
ordered is True. Default is 0.
ordered (bool) – If True (default), the arrays will be appended in the
order given. If False, array data may be interleaved
in blocks, which can greatly improve performance but
results in non-deterministic ordering of elements.
Returns:
Single pdarray or Strings object containing all values, returned in
the original order
where (bool or pdarray, default=True) – This condition is broadcast over the input. At locations where the condition is True,
the cosine will be applied to the corresponding value. Elsewhere, it will retain
its original value. Default set to True.
Returns:
A pdarray containing cosine for each element
of the original pdarray
where (bool or pdarray, default=True) – This condition is broadcast over the input. At locations where the condition is True,
the hyperbolic cosine will be applied to the corresponding value. Elsewhere, it will retain
its original value. Default set to True.
Returns:
A pdarray containing hyperbolic cosine for each element
of the original pdarray
Create a fixed frequency Datetime range. Alias for
ak.Datetime(pd.date_range(args)). Subject to size limit
imposed by client.maxTransferBytes.
Parameters:
start (str or datetime-like, optional) – Left bound for generating dates.
end (str or datetime-like, optional) – Right bound for generating dates.
periods (int, optional) – Number of periods to generate.
freq (str or DateOffset, default 'D') – Frequency strings can have multiples, e.g. ‘5H’. See
timeseries.offset_aliases for a list of
frequency aliases.
tz (str or tzinfo, optional) – Time zone name for returning localized DatetimeIndex, for example
‘Asia/Hong_Kong’. By default, the resulting DatetimeIndex is
timezone-naive.
normalize (bool, default False) – Normalize start/end dates to midnight before generating date range.
name (str, default None) – Name of the resulting DatetimeIndex.
inclusive ({"both", "neither", "left", "right"}, default "both") – Include boundaries. Whether to set each bound as closed or open.
**kwargs – For compatibility. Has no effect on the result.
Returns:
rng
Return type:
DatetimeIndex
Notes
Of the four parameters start, end, periods, and freq,
exactly three must be specified. If freq is omitted, the resulting
DatetimeIndex will have periods linearly spaced elements between
start and end (closed on both sides).
To learn more about the frequency strings, please see this link.
If created from a 64-bit integer, it represents an offset from
1970-01-01T00:00:00.
If created from string, the string can be in ISO 8601 date
or datetime format.
When parsing a string to create a datetime object, if the string contains
a trailing timezone (A ‘Z’ or a timezone offset), the timezone will be
dropped and a User Warning is given.
Datetime64 objects should be considered to be UTC and therefore have an
offset of +0000.
where (bool or pdarray, default=True) – This condition is broadcast over the input. At locations where the condition is True, the
corresponding value will be converted from degrees to radians. Elsewhere, it will retain its
original value. Default set to True.
Returns:
A pdarray containing an angle converted to radians, from degrees, for each element
of the original pdarray
obj (slice, int, Sequence of int, Sequence of bool, or pdarray) – The indices to remove from ‘arr’. If obj is a pdarray, it must
have an integer or bool dtype.
axis (Optional[int], optional) – The axis along which to remove elements. If None, the array will
be flattened before removing elements. Defaults to None.
Calculate the n-th discrete difference along the given axis.
The first difference is given by out[i]=a[i+1]-a[i] along the given axis,
higher differences are calculated by using diff iteratively.
Parameters:
a (pdarray) – The array to calculate the difference
n (int, optional) – The order of the finite difference. Default is 1.
axis (int, optional) – The axis along which to calculate the difference. Default is the last axis.
prepend (pdarray, optional) – The pdarray to prepend to a along axis before calculating the difference.
append (pdarray, optional) – The pdarray to append to a along axis before calculating the difference.
Returns:
The n-th differences. The shape of the output is the same as a
except along axis where the dimension is smaller by n. The
type of the output is the same as the type of the difference
between any two elements of a. This is the same as the type of
a in most cases. A notable exception is datetime64, which
results in a timedelta64 output array.
Type is preserved for boolean arrays, so the result will contain
False when consecutive elements are the same and True when they
differ.
For unsigned integer arrays, the results will also be unsigned. This
should not be surprising, as the result is consistent with
calculating the difference directly.
If this is not desirable, then the array should be cast to a larger
integer type first:
x (numeric_scalars(float_scalars, int_scalars) or pdarray) – The dividend array, the values that will be the numerator of the floordivision and will be
acted on by the bases for modular division.
where (Boolean or pdarray) – This condition is broadcast over the input. At locations where the condition is True, the
corresponding value will be divided using floor and modular division. Elsewhere, it will retain
its original value. Default set to True.
Returns:
Returns a tuple that contains quotient and remainder of the division
Return a pair of integers, whose ratio is exactly equal to the original
floating point number, and with a positive denominator.
Raise OverflowError on infinities and a ValueError on NaNs.
Return indices of query items in a search list of items.
Parameters:
query ((sequence of) array-like) – The items to search for. If multiple arrays, each “row” is an item.
space ((sequence of) array-like) – The set of items in which to search. Must have same shape/dtype as query.
all_occurrences (bool) – When duplicate terms are present in search space, if all_occurrences is True,
return all occurrences found as a SegArray, otherwise return only the first
occurrences as a pdarray. Defaults to only finding the first occurrence.
Finding all occurrences is not yet supported on sequences of arrays
remove_missing (bool) – If all_occurrences is True, remove_missing is automatically enabled.
If False, return -1 for any items in query not found in space. If True,
remove these and only return indices of items that are found.
Returns:
indices – For each item in query, its index in space. If all_occurrences is False,
the return will be a pdarray of the first index where each value in the
query appears in the space. If all_occurrences is True, the return will be
a SegArray containing every index where each value in the query appears in
the space. If all_occurrences is True, remove_missing is automatically enabled.
If remove_missing is True, exclude missing values, otherwise return -1.
Returns the dtype for which finfo returns information. For complex
input, the returned dtype is the associated float* dtype for its
real and complex components.
The difference between 1.0 and the next smallest representable float
larger than 1.0. For example, for 64-bit binary floats in the IEEE-754
standard, eps=2**-52, approximately 2.22e-16.
The difference between 1.0 and the next smallest representable float
less than 1.0. For example, for 64-bit binary floats in the IEEE-754
standard, epsneg=2**-53, approximately 1.11e-16.
For developers of NumPy: do not instantiate this at the module level.
The initial calculation of these parameters is expensive and negatively
impacts import times. These objects are cached, so calling finfo()
repeatedly inside your functions is not a problem.
Note that smallest_normal is not actually the smallest positive
representable value in a NumPy floating point type. As in the IEEE-754
standard [1]_, NumPy floating point types make use of subnormal numbers to
fill the gap between 0 and smallest_normal. However, subnormal numbers
may have significantly reduced precision [2].
This function can also be used for complex data types as well. If used,
the output will be the same as the corresponding real float type
(e.g. numpy.finfo(numpy.csingle) is the same as numpy.finfo(numpy.single)).
However, the output is true for the real and imaginary components.
Return a pair of integers, whose ratio is exactly equal to the original
floating point number, and with a positive denominator.
Raise OverflowError on infinities and a ValueError on NaNs.
Return a pair of integers, whose ratio is exactly equal to the original
floating point number, and with a positive denominator.
Raise OverflowError on infinities and a ValueError on NaNs.
Return a pair of integers, whose ratio is exactly equal to the original
floating point number, and with a positive denominator.
Raise OverflowError on infinities and a ValueError on NaNs.
TypeError – Raised if neither dividend nor divisor is a pdarray (at least one must be)
or if any scalar or pdarray element is not one of int, uint, float, bigint
Format a floating-point scalar as a decimal string in positional notation.
Provides control over rounding, trimming and padding. Uses and assumes
IEEE unbiased rounding. Uses the “Dragon4” algorithm.
Parameters:
x (python float or numpy floating scalar) – Value to format.
precision (non-negative integer or None, optional) – Maximum number of digits to print. May be None if unique is
True, but must be an integer if unique is False.
unique (boolean, optional) – If True, use a digit-generation strategy which gives the shortest
representation which uniquely identifies the floating-point number from
other values of the same type, by judicious rounding. If precision
is given fewer digits than necessary can be printed, or if min_digits
is given more can be printed, in which cases the last digit is rounded
with unbiased rounding.
If False, digits are generated as if printing an infinite-precision
value and stopping after precision digits, rounding the remaining
value with unbiased rounding
fractional (boolean, optional) – If True, the cutoffs of precision and min_digits refer to the
total number of digits after the decimal point, including leading
zeros.
If False, precision and min_digits refer to the total number of
significant digits, before or after the decimal point, ignoring leading
zeros.
trim (one of 'k', '.', '0', '-', optional) –
Controls post-processing trimming of trailing digits, as follows:
’k’ : keep trailing zeros, keep decimal point (no trimming)
’.’ : trim all trailing zeros, leave decimal point
’0’ : trim all but the zero before the decimal point. Insert the
zero if it is missing.
’-’ : trim trailing zeros and any trailing decimal point
sign (boolean, optional) – Whether to show the sign for positive values.
pad_left (non-negative integer, optional) – Pad the left side of the string with whitespace until at least that
many characters are to the left of the decimal point.
pad_right (non-negative integer, optional) – Pad the right side of the string with whitespace until at least that
many characters are to the right of the decimal point.
min_digits (non-negative integer or None, optional) –
Minimum number of digits to print. Only has an effect if unique=True
in which case additional digits past those necessary to uniquely
identify the value may be printed, rounding the last additional digit.
Added in version 1.21.0.
Returns:
rep – The string representation of the floating point value
Format a floating-point scalar as a decimal string in scientific notation.
Provides control over rounding, trimming and padding. Uses and assumes
IEEE unbiased rounding. Uses the “Dragon4” algorithm.
Parameters:
x (python float or numpy floating scalar) – Value to format.
precision (non-negative integer or None, optional) – Maximum number of digits to print. May be None if unique is
True, but must be an integer if unique is False.
unique (boolean, optional) – If True, use a digit-generation strategy which gives the shortest
representation which uniquely identifies the floating-point number from
other values of the same type, by judicious rounding. If precision
is given fewer digits than necessary can be printed. If min_digits
is given more can be printed, in which cases the last digit is rounded
with unbiased rounding.
If False, digits are generated as if printing an infinite-precision
value and stopping after precision digits, rounding the remaining
value with unbiased rounding
trim (one of 'k', '.', '0', '-', optional) –
Controls post-processing trimming of trailing digits, as follows:
’k’ : keep trailing zeros, keep decimal point (no trimming)
’.’ : trim all trailing zeros, leave decimal point
’0’ : trim all but the zero before the decimal point. Insert the
zero if it is missing.
’-’ : trim trailing zeros and any trailing decimal point
sign (boolean, optional) – Whether to show the sign for positive values.
pad_left (non-negative integer, optional) – Pad the left side of the string with whitespace until at least that
many characters are to the left of the decimal point.
exp_digits (non-negative integer, optional) – Pad the exponent with zeros until it contains at least this
many digits. If omitted, the exponent will be at least 2 digits.
min_digits (non-negative integer or None, optional) –
Minimum number of digits to print. This only has an effect for
unique=True. In that case more digits than necessary to uniquely
identify the value may be printed and rounded unbiased.
Added in version 1.21.0.
Returns:
rep – The string representation of the floating point value
Converts a Pandas Series to an Arkouda pdarray or Strings object. If
dtype is None, the dtype is inferred from the Pandas Series. Otherwise,
the dtype parameter is set if the dtype of the Pandas Series is to be
overridden or is unknown (for example, in situations where the Series
dtype is object).
Parameters:
series (Pandas Series) – The Pandas Series with a dtype of bool, float64, int64, or string
dtype (Optional[type]) – The valid dtype types are np.bool, np.float64, np.int64, and np.str
The supported datatypes are bool, float64, int64, string, and datetime64[ns]. The
data type is either inferred from the the Series or is set via the dtype parameter.
Series of datetime or timedelta are converted to Arkouda arrays of dtype int64 (nanoseconds)
A Pandas Series containing strings has a dtype of object. Arkouda assumes the Series
contains strings and sets the dtype to str
Get the names of the datasets in the provide files.
Parameters:
filenames (str or List[str]) – Name of the file/s from which to return datasets
allow_errors (bool) – Default: False
Whether or not to allow errors while accessing datasets
column_delim (str) – Column delimiter to be used if dataset is CSV. Otherwise, unused.
read_nested (bool) – Default True, when True, SegArray objects will be read from the file. When False,
SegArray (or other nested Parquet columns) will be ignored.
Only used for Parquet Files.
Get null indices of a string column in a Parquet file.
Parameters:
filenames (list or str) – Either a list of filenames or shell expression
datasets (list or str or None) – (List of) name(s) of dataset(s) to read. Each dataset must be a string
column. There is no default value for this function, the datasets to be
read must be specified.
Returns:
Dictionary of {datasetName: pdarray}
Return type:
returns a dictionary of Arkouda pdarrays
Raises:
RuntimeError – Raised if one or more of the specified files cannot be opened.
TypeError – Raised if we receive an unknown arkouda_type returned from the server
Return a pair of integers, whose ratio is exactly equal to the original
floating point number, and with a positive denominator.
Raise OverflowError on infinities and a ValueError on NaNs.
full (bool, default=True) – This is only used when a single pdarray is passed into hash
By default, a 128-bit hash is computed and returned as
two int64 arrays. If full=False, then a 64-bit hash
is computed and returned as a single int64 array.
Returns:
If full=True or a list of pdarrays is passed,
a 2-tuple of pdarrays containing the high
and low 64 bits of each hash, respectively.
If full=False and a single pdarray is passed,
a single pdarray containing a 64-bit hash
Return type:
hashes
Raises:
TypeError – Raised if the parameter is not a pdarray
In the case of a single pdarray being passed, this function
uses the SIPhash algorithm, which can output either a 64-bit
or 128-bit hash. However, the 64-bit hash runs a significant
risk of collisions when applied to more than a few million
unique values. Unless the number of unique values is known to
be small, the 128-bit hash is strongly recommended.
Note that this hash should not be used for security, or for
any cryptographic application. Not only is SIPhash not
intended for such uses, but this implementation employs a
fixed key for the hash, which makes it possible for an
adversary with control over input to engineer collisions.
In the case of a list of pdrrays, Strings, Categoricals, or Segarrays
being passed, a non-linear function must be applied to each
array since hashes of subsequent arrays cannot be simply XORed
because equivalent values will cancel each other out, hence we
do a rotation by the ordinal of the array.
To plot, export the left edges and the histogram to NumPy
>>> b_np = b.to_ndarray()
>>> import numpy as np
>>> b_widths = np.diff(b_np)
>>> plt.bar(b_np[:-1], h.to_ndarray(), width=b_widths, align=’edge’, edgecolor=’black’)
<BarContainer object of 3 artists>
>>> plt.show() # doctest: +SKIP
Compute the bi-dimensional histogram of two data samples with evenly spaced bins
Parameters:
x (pdarray) – A pdarray containing the x coordinates of the points to be histogrammed.
y (pdarray) – A pdarray containing the y coordinates of the points to be histogrammed.
bins (int_scalars or [int, int], default=10) – The number of equal-size bins to use.
If int, the number of bins for the two dimensions (nx=ny=bins).
If [int, int], the number of bins in each dimension (nx, ny = bins).
Defaults to 10
range (((xMin, xMax), (yMin, yMax)), optional) – The ranges of the values in x and y to count.
Values outside of these ranges are dropped.
By default, all values are counted.
Returns:
histpdarray
shape(nx, ny)
The bi-dimensional histogram of samples x and y.
Values in x are histogrammed along the first dimension and
values in y are histogrammed along the second dimension.
The x bins are evenly spaced in the interval [x.min(), x.max()]
and y bins are evenly spaced in the interval [y.min(), y.max()].
If range parameter is provided, the intervals are given
by range[0] for x and range[1] for y..
Compute the multidimensional histogram of data in sample with evenly spaced bins.
Parameters:
sample (Sequence of pdarray) – A sequence of pdarrays containing the coordinates of the points to be histogrammed.
bins (int_scalars or Sequence of int_scalars, default=10) – The number of equal-size bins to use.
If int, the number of bins for all dimensions (nx=ny=…=bins).
If [int, int, …], the number of bins in each dimension (nx, ny, … = bins).
Defaults to 10
range (Sequence[optional (minVal, maxVal)], optional) – The ranges of the values to count for each array in sample.
Values outside of these ranges are dropped.
By default, all values are counted.
Returns:
histpdarray
shape(nx, ny, …, nd)
The multidimensional histogram of pdarrays in sample.
Values in first pdarray are histogrammed along the first dimension.
Values in second pdarray are histogrammed along the second dimension and so on.
edgesList[pdarray]
A list of pdarrays containing the bin edges for each dimension.
Stack arrays in sequence horizontally (column wise).
This is equivalent to concatenation along the second axis, except for 1-D arrays
where it concatenates along the first axis. Rebuilds arrays divided by hsplit.
This function makes most sense for arrays with up to 3 dimensions. For instance, for pixel-data
with a height (first axis), width (second axis), and r/g/b channels (third axis). The functions
concatenate, stack and block provide more general stacking and concatenation operations.
Parameters:
tup (sequence of pdarray) – The arrays must have the same shape along all but the second axis, except 1-D arrays which
can be any length. In the case of a single array_like input, it will be treated as a sequence of
arrays; i.e., each element along the zeroth axis is treated as a separate array.
dtype (str or type, optional) – If provided, the destination array will have this type.
casting ({‘no’, ‘equiv’, ‘safe’, ‘same_kind’, ‘unsafe’}, optional) – Controls what kind of data casting may occur. Defaults to ‘same_kind’. Currently unused.
B (list of pdarrays, pdarray, Strings, or Categorical) – The set of elements in which to test membership
assume_unique (bool, optional, defaults to False) – If true, assume rows of a and b are each unique and sorted.
By default, sort and unique them explicitly.
symmetric (bool, optional, defaults to False) – Return in1d(A, B), in1d(B, A) when A and B are single items.
invert (bool, optional, defaults to False) – If True, the values in the returned array are inverted (that is,
False where an element of A is in B and True otherwise).
Default is False. ak.in1d(a,b,invert=True) is equivalent
to (but is faster than) ~ak.in1d(a,b).
TypeError – Raised if either A or B is not a pdarray, Strings, or Categorical
object, or if both are pdarrays and either has rank > 1,
or if invert is not a bool
RuntimeError – Raised if the dtype of either array is not supported
in1d can be considered as an element-wise function version of the
python keyword in, for 1-D sequences. in1d(a,b) is logically
equivalent to ak.array([iteminbforitemina]), but is much
faster and scales to arbitrarily large a.
ak.in1d is not supported for bool or float64 pdarrays
Return indices of query items in a search list of items. Items not found will be excluded.
When duplicate terms are present in search space return indices of all occurrences.
Parameters:
query ((sequence of) pdarray or Strings or Categorical) – The items to search for. If multiple arrays, each “row” is an item.
space ((sequence of) pdarray or Strings or Categorical) – The set of items in which to search. Must have same shape/dtype as query.
Returns:
For each item in query that is found in space, its index in space.
Return a JSON formatted string containing information about the objects in names.
Parameters:
names (Union[List[str], str]) – names is either the name of an object or list of names of objects to retrieve info
if names is ak.AllSymbols, retrieves info for all symbols in the symbol table
if names is ak.RegisteredSymbols, retrieves info for all symbols in the registry
Returns:
JSON formatted string containing a list of information for each object in names
Return type:
str
Raises:
RuntimeError – Raised if a server-side error is thrown in the process of
retrieving information about the objects in names
positions (bool, default=True) – Return tuple of boolean pdarrays that indicate positions in a and b
of the intersection values.
unique (bool, default=False) – If the number of distinct values in a (and b) is equal to the size of
a (and b), there is a more efficient method to compute the intersection.
Returns:
(arkouda.numpy.pdarrayclass.pdarray, arkouda.numpy.pdarrayclass.pdarray) or
arkouda.numpy.pdarrayclass.pdarray – The indices of a and b where any element occurs at least once in both
arrays.
Apply a function defined over intervals to an array of arguments.
Parameters:
keys (2-tuple of (sequences of) pdarrays) – Tuple of closed intervals expressed as (lower_bounds_inclusive, upper_bounds_inclusive).
Must have same dtype(s) as vals.
values (pdarray) – Function value to return for each entry in keys.
arguments ((sequences of) pdarray) – Values to search for in intervals. If multiple arrays, each “row” is an item.
fillvalue (scalar) – Default value to return when argument is not in any interval.
tiebreak ((optional) pdarray, numeric) – When an argument is present in more than one key interval, the interval with the
lowest tiebreak value will be chosen. If no tiebreak is given, the
first valid key interval will be chosen.
Returns:
Value of function corresponding to the keys interval
containing each argument, or fillvalue if argument not
in any interval.
This helper is intended to help future proof changes made to
accomodate IPv6 and to prevent errors if a user inadvertently
casts a IPv4 instead of a int64 pdarray. It can also be used
for importing Python lists of IP addresses into Arkouda.
Determine if the provided name is associated with a registered Arkouda object.
This function checks if the name is found in the registry of objects,
and optionally checks if it is registered as a component of a registered object.
Parameters:
name (str) – The name to check for in the registry.
as_component (bool, default=False) – When True, the function checks if the name is registered as a component
of a registered object (rather than as a standalone object).
Returns:
True if the name is found in the registry, False otherwise.
Return type:
bool
Raises:
KeyError – If the registry query encounters an issue (e.g., invalid registry data or access issues).
Examples
>>> importarkoudaasak
Check if a name is registered as an object
>>> obj = ak.array([1, 2, 3])
>>> registered_obj = obj.register(“my_array”)
>>> result = ak.is_registered(“my_array”)
>>> print(result)
True
>>> registered_obj.unregister()
Check if a name is registered as a component
>>> result = ak.is_registered(“my_component”, as_component=True)
>>> print(result)
False
Returns True if the type of element is a scalar type.
Parameters:
element (any) – Input argument, can be of any type and shape.
Returns:
val – True if element is a scalar type, False if it is not.
Return type:
bool
See also
ndim
Get the number of dimensions of an array
Notes
If you need a stricter way to identify a numerical scalar, use
isinstance(x,numbers.Number), as that returns False for most
non-numerical elements such as strings.
In most cases np.ndim(x)==0 should be used instead of this function,
as that will also return true for 0d arrays. This is how numpy overloads
functions in the style of the dx arguments to gradient and
the bins argument to histogram. Some key differences:
x
isscalar(x)
np.ndim(x)==0
PEP 3141 numeric objects
(including builtins)
True
True
builtin string and buffer objects
True
True
other builtin objects, like
pathlib.Path, Exception,
the result of re.compile
Load a pdarray previously saved with pdarray.save().
Parameters:
path_prefix (str) – Filename prefix used to save the original pdarray
file_format (str) – ‘INFER’, ‘HDF5’ or ‘Parquet’. Defaults to ‘INFER’. Used to indicate the file type being loaded.
If INFER, this will be detected during processing
dataset (str) – Dataset name where the pdarray was saved, defaults to ‘array’
calc_string_offsets (bool) – If True the server will ignore Segmented Strings ‘offsets’ array and derive
it from the null-byte terminators. Defaults to False currently
column_delim (str) – Column delimiter to be used if dataset is CSV. Otherwise, unused.
Returns:
Dictionary of {datsetName: Union[pdarray, Strings, SegArray, Categorical]}
with the previously saved pdarrays, Strings, SegArrays, or Categoricals
TypeError – Raised if either path_prefix or dataset is not a str
ValueError – Raised if invalid file_format or if the dataset is not present in all hdf5 files or if the
path_prefix does not correspond to files accessible to Arkouda
RuntimeError – Raised if the hdf5 files are present but there is an error in opening
one or more of them
If you have a previously saved Parquet file that is raising a FileNotFound error, try loading it
with a .parquet appended to the prefix_path.
Parquet files were previously ALWAYS stored with a .parquet extension.
ak.load does not support loading a single file.
For loading single HDF5 files without the _LOCALE#### suffix please use ak.read().
CSV files without the Arkouda Header are not supported.
Examples
>>> importarkoudaasakLoading from file without extension>>> obj=ak.load('path/prefix')Loads the array from numLocales files with the name ``cwd/path/name_prefix_LOCALE####``.The file type is inferred during processing.
Loading with an extension (HDF5)
>>> obj = ak.load(‘path/prefix.test’)
Loads the object from numLocales files with the name cwd/path/name_prefix_LOCALE####.test where
#### is replaced by each locale numbers. Because filetype is inferred during processing,
the extension is not required to be a specific format.
Load multiple pdarrays, Strings, SegArrays, or Categoricals previously saved with save_all().
Parameters:
path_prefix (str) – Filename prefix used to save the original pdarray
file_format (str) – ‘INFER’, ‘HDF5’, ‘Parquet’, or ‘CSV’. Defaults to ‘INFER’. Indicates the format being loaded.
When ‘INFER’ the processing will detect the format
Defaults to ‘INFER’
column_delim (str) – Column delimiter to be used if dataset is CSV. Otherwise, unused.
read_nested (bool) – Default True, when True, SegArray objects will be read from the file. When False,
SegArray (or other nested Parquet columns) will be ignored.
Parquet files only
Returns:
Dictionary of {datsetName: Union[pdarray, Strings, SegArray, Categorical]}
with the previously saved pdarrays, Strings, SegArrays, or Categoricals
ValueError – Raised if file_format/extension is encountered that is not hdf5 or parquet or
if all datasets are not present in all hdf5/parquet files or if the
path_prefix does not correspond to files accessible to Arkouda
RuntimeError – Raised if the hdf5 files are present but there is an error in opening
one or more of them
Return a pair of integers, whose ratio is exactly equal to the original
floating point number, and with a positive denominator.
Raise OverflowError on infinities and a ValueError on NaNs.
Apply the function defined by the mapping keys –> values to arguments.
Parameters:
keys ((sequence of) array-like) – The domain of the function. Entries must be unique (if a sequence of
arrays is given, each row is treated as a tuple-valued entry).
values (pdarray) – The range of the function. Must be same length as keys.
arguments ((sequence of) array-like) – The arguments on which to evaluate the function. Must have same dtype
(or tuple of dtypes, for a sequence) as keys.
fillvalue (scalar) – The default value to return for arguments not in keys.
Returns:
evaluated – The result of evaluating the function over arguments.
While the values cannot be Strings (or other complex objects), the same
result can be achieved by passing an arange as the values, then using
the return as indices into the desired object.
List the contents of an HDF5 or Parquet file on the Arkouda server.
This function invokes the HDF5 h5ls utility on a file visible to the
Arkouda server, or simulates a similar listing for Parquet files. For CSV
files without headers, see ls_csv.
Parameters:
filename (str) – Path to the file on the Arkouda server. Must be a non-empty string.
col_delim (str, default=",") – Delimiter to use when interpreting CSV files.
read_nested (bool, default=True) – If True, include nested Parquet columns (e.g., SegArray). If False,
nested columns are ignored. Only applies to Parquet files.
Returns:
A list of lines describing each dataset or column in the file.
TypeError – Raised if pda is not a pdarray or k is not an integer
ValueError – Raised if the pda is empty, or pda.ndim > 1, or k < 1
Notes
This call is equivalent in value to a[ak.argsort(a)[k:]]
and generally outperforms this operation.
This reduction will see a significant drop in performance as k grows
beyond a certain value. This value is system dependent, but generally
about a k of 5 million is where performance degredation has been observed.
Compute the median of a given array. 1d case only, for now.
Parameters:
pda (pdarray) – The input data, in pdarray form, numeric type or boolean
Returns:
The median of the entire pdarray
The array is sorted, and then if the number of elements is odd,
the return value is the middle element. If even, then the
mean of the two middle elements.
left (DataFrame) – The Left DataFrame to be joined.
right (DataFrame) – The Right DataFrame to be joined.
on (Optional[Union[str, List[str]]] = None) – The name or list of names of the DataFrame column(s) to join on.
If on is None, this defaults to the intersection of the columns in both DataFrames.
left_on (str or List of str, optional) – Column name or names to join on in the left DataFrame. If this is not None, then right_on
must also not be None, and this will override on.
right_on (str or List of str, optional) – Column name or names to join on in the right DataFrame. If this is not None, then left_on
must also not be None, and this will override on.
how (str, default = "inner") – The merge condition.
Must be one of “inner”, “left”, “right”, or “outer”.
left_suffix (str, default = "_x") – A string indicating the suffix to add to columns from the left dataframe for overlapping
column names in both left and right. Defaults to “_x”. Only used when how is “inner”.
right_suffix (str, default = "_y") – A string indicating the suffix to add to columns from the right dataframe for overlapping
column names in both left and right. Defaults to “_y”. Only used when how is “inner”.
convert_ints (bool = True) – If True, convert columns with missing int values (due to the join) to float64.
This is to match pandas.
If False, do not convert the column dtypes.
This has no effect when how = “inner”.
sort (bool = True) – If True, DataFrame is returned sorted by “on”.
Otherwise, the DataFrame is not sorted.
ValueError – Raised if the pda is empty, or pda.ndim > 1, or k < 1
Notes
This call is equivalent in value to a[ak.argsort(a)[:k]]
and generally outperforms this operation.
This reduction will see a significant drop in performance as k grows
beyond a certain value. This value is system dependent, but generally
about a k of 5 million is where performance degredation has been observed.
x2 (pdarray, numeric_scalars, or bigint) – The direction where to look for the next representable value of x1.
If x1.shape != x2.shape, they must be broadcastable to a common shape
(which becomes the shape of the output).
Returns:
The next representable values of x1 in the direction of x2.
This is a scalar if both x1 and x2 are scalars.
max_bits (int) – Specifies the maximum number of bits; only used for bigint pdarrays
Included for consistency, as ones are all zeros ending on a one, regardless
of max_bits
The basic arkouda array class. This class contains only the
attributes of the array; the data resides on the arkouda
server. When a server operation results in a new array, arkouda
will create a pdarray instance that points to the array data on
the server. As such, the user should not initialize pdarray
instances directly.
Return True iff all elements of the array along the given axis evaluate to True.
Parameters:
axis (int, Tuple[int, ...], optional, default = None) – The axis or axes along which to do the operation
If None, the computation is done across the entire array.
keepdims (bool, optional, default = False) – Whether to keep the singleton dimension(s) along axis in the result.
Returns:
boolean if axis is omitted, pdarray if axis is supplied
Return True iff any element of the array along the given axis evaluates to True.
Parameters:
axis (int, Tuple[int, ...], optional, default = None) – The axis or axes along which to do the operation
If None, the computation is done across the entire array.
keepdims (bool, optional, default = False) – Whether to keep the singleton dimension(s) along axis in the result.
Returns:
boolean if axis is omitted, else pdarray if axis is supplied
Return index of the first occurrence of the maximum along the given axis.
Parameters:
axis (int, Tuple[int, ...], optional, default = None) – The axis or axes along which to do the operation
If None, the computation is done across the entire array.
keepdims (bool, optional, default = False) – Whether to keep the singleton dimension(s) along axis in the result.
Returns:
int64 or uint64 if axis is omitted, in which case operation is done over entire array
pdarray if axis is supplied, in which case the operation is done along that axis
Return index of the first occurrence of the minimum along the given axis.
Parameters:
axis (int, Tuple[int, ...], optional, default = None) – The axis or axes along which to do the operation
If None, the computation is done across the entire array.
keepdims (bool, optional, default = False) – Whether to keep the singleton dimension(s) along axis in the result.
Returns:
int64 or uint64 if axis is omitted, in which case operation is done over entire array
pdarray if axis is supplied, in which case the operation is done along that axis
Create a list of uint pdarrays from a bigint pdarray.
The first item in return will be the highest 64 bits of the
bigint pdarray and the last item will be the lowest 64 bits.
Returns:
A list of uint pdarrays where:
The first item in return will be the highest 64 bits of the
bigint pdarray and the last item will be the lowest 64 bits.
Attempt to cast scalar other to the element dtype of this pdarray,
and print the resulting value to a string (e.g. for sending to a
server command). The user should not call this function directly.
Parameters:
other (object) – The scalar to be cast to the pdarray.dtype
Return type:
string representation of np.dtype corresponding to the other parameter
Raises:
TypeError – Raised if the other parameter cannot be converted to
Numpy dtype
Return True iff the array (or given axis of the array) is monotonically non-decreasing.
Parameters:
axis (int, Tuple[int, ...], optional, default = None) – The axis or axes along which to do the operation
If None, the computation is done across the entire array.
keepdims (bool, optional, default = False) – Whether to keep the singleton dimension(s) along axis in the result.
Returns:
boolean if axis is omitted, else pdarray if axis is supplied
Return max of array elements along the given axis.
Parameters:
axis (int, Tuple[int, ...], optional, default = None) – The axis or axes along which to do the operation
If None, the computation is done across the entire array.
keepdims (bool, optional, default = False) – Whether to keep the singleton dimension(s) along axis in the result.
Returns:
numeric_scalar if axis is omitted, in which case operation is done over entire array
pdarray if axis is supplied, in which case the operation is done along that axis
pda (pdarray) – Values for which to calculate the mean
axis (int, Tuple[int, ...], optional, default = None) – The axis or axes along which to do the operation
If None, the computation is done across the entire array.
keepdims (bool, optional, default = False) – Whether to keep the singleton dimension(s) along axis in the result.
Returns:
The mean calculated from the pda sum and size, along the axis/axes if
those are given.
Return min of array elements along the given axis.
Parameters:
axis (int, Tuple[int, ...], optional, default = None) – The axis or axes along which to do the operation
If None, the computation is done across the entire array.
keepdims (bool, optional, default = False) – Whether to keep the singleton dimension(s) along axis in the result.
Returns:
numeric_scalar if axis is omitted, in which case operation is done over entire array
pdarray if axis is supplied, in which case the operation is done along that axis
Return prod of array elements along the given axis.
Parameters:
axis (int, Tuple[int, ...], optional, defalt = None) – The axis or axes along which to do the operation
If None, the computation is done across the entire array.
keepdims (bool, optional, default = False) – Whether to keep the singleton dimension(s) along axis in the result.
Returns:
numeric_scalars if axis is omitted, in which case operation is done over entire array
pdarray if axis is supplied, in which case the operation is done along that axis
Register this pdarray with a user defined name in the arkouda server
so it can be attached to later using pdarray.attach()
This is an in-place operation, registering a pdarray more than once will
update the name in the registry and remove the previously registered name.
A name can only be registered to one pdarray at a time.
Parameters:
user_defined_name (str) – user defined name array is to be registered under
Returns:
The same pdarray which is now registered with the arkouda server and has an updated name.
This is an in-place modification, the original is returned to support a
fluid programming style.
Please note you cannot register two different pdarrays with the same name.
TypeError – Raised if user_defined_name is not a str
RegistrationError – If the server was unable to register the pdarray with the user_defined_name
If the user is attempting to register more than one pdarray with the same name,
the former should be unregistered first to free up the registration name.
Return the standard deviation of values in the array. The standard
deviation is implemented as the square root of the variance.
Parameters:
pda (pdarray) – values for which to calculate the standard deviation
ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating std
axis (int, Tuple[int, ...], optional, default = None) – The axis or axes along which to do the operation
If None, the computation is done across the entire array.
keepdims (bool, optional, default = False) – Whether to keep the singleton dimension(s) along axis in the result.
Returns:
The scalar standard deviation of the array, or the standard deviation
The standard deviation is the square root of the average of the squared
deviations from the mean, i.e., std=sqrt(mean((x-x.mean())**2)).
The average squared deviation is normally calculated as
x.sum()/N, where N=len(x). If, however, ddof is specified,
the divisor N-ddof is used instead. In standard statistical
practice, ddof=1 provides an unbiased estimator of the variance
of the infinite population. ddof=0 provides a maximum likelihood
estimate of the variance for normally distributed variables. The
standard deviation computed in this function is the square root of
the estimated variance, so even with ddof=1, it will not be an
unbiased estimate of the standard deviation per se.
Return sum of array elements along the given axis.
Parameters:
axis (int, Tuple[int, ...], optional, default = None) – The axis or axes along which to do the operation
If None, the computation is done across the entire array.
keepdims (bool, optional, default = False) – Whether to keep the singleton dimension(s) along axis in the result.
Returns:
numeric_scalars if axis is omitted, in which case operation is done over entire array
pdarray if axis is supplied, in which case the operation is done along that axis
Write pdarry to CSV file(s). File will contain a single column
with the pdarray data. All CSV files written by Arkouda include
a header denoting data types of the columns.
Parameters:
prefix_path (str) – filename prefix to be used for saving files. Files will have
_LOCALE#### appended when they are written to disk.
dataset (str, defaults to "array") – column name to save the pdarray under.
col_delim (str, defaults to ",") – value to be used to separate columns within the file. Please
be sure that the value used DOES NOT appear in your dataset.
overwrite (bool, defaults to False) – If True, existing files matching the provided path will be overwritten.
if False and existing files are found, an error will be returned.
Returns:
response message
Return type:
str
Raises:
ValueError – Raised if all datasets are not present in all parquet files or if one
or more of the specified files do not exist
RuntimeError – Raised if one or more of the specified files cannot be opened.
if ‘allow_errors’ is true, this may be raised if no values are returned
from the server.
TypeError – Raise if the server returns an unknown arkouda_type
Notes
CSV format is not currently supported by load/load_all operations
The column delimiter is expected to be the same for all column names and data
Be sure that column delimiters are not found within your data.
All CSV files must delimit rows using newline (”n”) at this time.
Convert the array to a Numba DeviceND array, transferring array data from the
arkouda server to Python via ndarray. If the array exceeds a builtin size limit,
a RuntimeError is raised.
Returns:
A Numba ndarray with the same attributes and data as the pdarray; on GPU
Return type:
numba.DeviceNDArray
Raises:
ImportError – Raised if CUDA is not available
ModuleNotFoundError – Raised if Numba is either not installed or not enabled
RuntimeError – Raised if there is a server-side error thrown in the course of retrieving
the pdarray.
Notes
The number of bytes in the array cannot exceed client.maxTransferBytes,
otherwise a RuntimeError will be raised. This is to protect the user
from overflowing the memory of the system on which the Python client
is running, under the assumption that the server is running on a
distributed system with much more memory than the client. The user
may override this limit by setting client.maxTransferBytes to a larger
value, but proceed with caution.
Save the pdarray to HDF5.
The object can be saved to a collection of files or single file.
Parameters:
prefix_path (str) – Directory and filename prefix that all output files share
dataset (str) – Name of the dataset to create in files (must not already exist)
mode (str {'truncate' | 'append'}) – By default, truncate (overwrite) output files, if they exist.
If ‘append’, attempt to create new dataset in existing files.
file_type (str ("single" | "distribute")) – Default: “distribute”
When set to single, dataset is written to a single file.
When distribute, dataset is written on a file per locale.
This is only supported by HDF5 files and will have no impact of Parquet Files.
Return type:
string message indicating result of save operation
Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
The prefix_path must be visible to the arkouda server and the user must
have write permission.
- Output files have names of the form <prefix_path>_LOCALE<i>, where <i>
ranges from 0 to numLocales for file_type=’distribute’. Otherwise,
the file name will be prefix_path.
- If any of the output files already exist and
the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’
and the number of output files is less than the number of locales or a
dataset with the same name already exists, a RuntimeError will result.
- Any file extension can be used.The file I/O does not rely on the extension to
determine the file format.
Examples
>>> importarkoudaasak>>> a=ak.arange(25)
Saving without an extension
>>> a.to_hdf(‘path/prefix’, dataset=’array’) # doctest: +SKIP
Saves the array to numLocales HDF5 files with the name cwd/path/name_prefix_LOCALE####
Saving with an extension (HDF5)
>>> a.to_hdf(‘path/prefix.h5’, dataset=’array’) # doctest: +SKIP
Saves the array to numLocales HDF5 files with the name
cwd/path/name_prefix_LOCALE####.h5 where #### is replaced by each locale number
Saving to a single file
>>> a.to_hdf(‘path/prefix.hdf5’, dataset=’array’, file_type=’single’) # doctest: +SKIP
Saves the array in to single hdf5 file on the root node.
cwd/path/name_prefix.hdf5
Convert the array to a list, transferring array data from the
Arkouda server to client-side Python. Note: if the pdarray size exceeds
client.maxTransferBytes, a RuntimeError is raised.
RuntimeError – Raised if there is a server-side error thrown, if the pdarray size
exceeds the built-in client.maxTransferBytes size limit, or if the bytes
received does not match expected number of bytes
Notes
The number of bytes in the array cannot exceed client.maxTransferBytes,
otherwise a RuntimeError will be raised. This is to protect the user
from overflowing the memory of the system on which the Python client
is running, under the assumption that the server is running on a
distributed system with much more memory than the client. The user
may override this limit by setting client.maxTransferBytes to a larger
value, but proceed with caution.
Convert the array to a np.ndarray, transferring array data from the
Arkouda server to client-side Python. Note: if the pdarray size exceeds
client.maxTransferBytes, a RuntimeError is raised.
Returns:
A numpy ndarray with the same attributes and data as the pdarray
Return type:
np.ndarray
Raises:
RuntimeError – Raised if there is a server-side error thrown, if the pdarray size
exceeds the built-in client.maxTransferBytes size limit, or if the bytes
received does not match expected number of bytes
Notes
The number of bytes in the array cannot exceed client.maxTransferBytes,
otherwise a RuntimeError will be raised. This is to protect the user
from overflowing the memory of the system on which the Python client
is running, under the assumption that the server is running on a
distributed system with much more memory than the client. The user
may override this limit by setting client.maxTransferBytes to a larger
value, but proceed with caution.
Save the pdarray to Parquet. The result is a collection of files,
one file per locale of the arkouda server, where each filename starts
with prefix_path. Each locale saves its chunk of the array to its
corresponding file.
Parameters:
prefix_path (str) – Directory and filename prefix that all output files share
dataset (str) – Name of the dataset to create in files (must not already exist)
mode (str {'truncate' | 'append'}) – By default, truncate (overwrite) output files, if they exist.
If ‘append’, attempt to create new dataset in existing files.
compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”)
Sets the compression type used with Parquet files
Return type:
string message indicating result of save operation
Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
The prefix_path must be visible to the arkouda server and the user must
have write permission.
- Output files have names of the form <prefix_path>_LOCALE<i>, where <i>
ranges from 0 to numLocales for file_type=’distribute’.
- ‘append’ write mode is supported, but is not efficient.
- If any of the output files already exist and
the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’
and the number of output files is less than the number of locales or a
dataset with the same name already exists, a RuntimeError will result.
- Any file extension can be used.The file I/O does not rely on the extension to
determine the file format.
Examples
>>> importarkoudaasak>>> a=ak.arange(25)
Saving without an extension
>>> a.to_parquet(‘path/prefix’, dataset=’array’) # doctest: +SKIP
Saves the array to numLocales HDF5 files with the name cwd/path/name_prefix_LOCALE####
Saving with an extension (HDF5)
>>> a.to_parqet(‘path/prefix.parquet’, dataset=’array’) # doctest: +SKIP
Saves the array to numLocales HDF5 files with the name
cwd/path/name_prefix_LOCALE####.parquet where #### is replaced by each locale number
hostname (str) – The hostname where the Arkouda server intended to
receive the pdarray is running.
port (int_scalars) – The port to send the array over. This needs to be an
open port (i.e., not one that the Arkouda server is
running on). This will open up numLocales ports,
each of which in succession, so will use ports of the
range {port..(port+numLocales)} (e.g., running an
Arkouda server of 4 nodes, port 1234 is passed as
port, Arkouda will use ports 1234, 1235, 1236,
and 1237 to send the array data).
This port much match the port passed to the call to
ak.receive_array().
Return type:
A message indicating a complete transfer
Raises:
ValueError – Raised if the op is not within the pdarray.BinOps set
TypeError – Raised if other is not a pdarray or the pdarray.dtype is not
a supported dtype
Overwrite the dataset with the name provided with this pdarray. If
the dataset does not exist it is added
Parameters:
prefix_path (str) – Directory and filename prefix that all output files share
dataset (str) – Name of the dataset to create in files
repack (bool) – Default: True
HDF5 does not release memory on delete. When True, the inaccessible
data (that was overwritten) is removed. When False, the data remains, but is
inaccessible. Setting to false will yield better performance, but will cause
file sizes to expand.
Return type:
str - success message if successful
Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
If file does not contain File_Format attribute to indicate how it was saved,
the file name is checked for _LOCALE#### to determine if it is distributed.
If the dataset provided does not exist, it will be added
pda (pdarray) – Values for which to calculate the variance
ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating var
axis (int, Tuple[int, ...], optional, default = None) – The axis or axes along which to do the operation
If None, the computation is done across the entire array.
keepdims (bool, optional, default = False) – Whether to keep the singleton dimension(s) along axis in the result.
Returns:
The scalar variance of the array, or the variance along the axis/axes
if supplied
The variance is the average of the squared deviations from the mean,
i.e., var=mean((x-x.mean())**2).
The mean is normally calculated as x.sum()/N, where N=len(x).
If, however, ddof is specified, the divisor N-ddof is used
instead. In standard statistical practice, ddof=1 provides an
unbiased estimator of the variance of a hypothetical infinite population.
ddof=0 provides a maximum likelihood estimate of the variance for
normally distributed variables.
Compute the q-th percentile of the data along the specified axis.
Parameters:
a (pdarray) – data whose percentile will be computed
q (pdarray, Tuple, or np.ndarray) – a scalar, tuple, or np.ndarray of q values for the computation. All values
must be in the range 0 <= q <= 100
axis (None, int scalar, or tuple of int scalars) – the axis or axes along which the percentiles are computed. The default is None,
which computes the percenntile along a flattened version of the array.
method (string) – one of “inverted_cdf,” “averaged_inverted_cdf”, “closest_observation”,
“interpolated_inverted_cdf”, “hazen”, “weibull”, “linear”, ‘median_unbiased”,
“normal_unbiased”, “lower”,” higher”, “midpoint”
keepdims (bool) – True if the degenerate axes are to be retained after slicing, False if not
Returns:
If q is a scalar and axis is None, the result is a scalar.
If q is a scalar and axis is supplied, the result is a pdarray of rank len(axis)
less than the rank of a.
If q is an array and axis is None, the result is a pdarray of shape q.shape
If q is an array and axis is None, the result is a pdarray of rank q.ndim +
pda.ndim - len(axis). However, there is an intermediate result which is of rank
q.ndim + pda.ndim. If this is not in the compiled ranks, an error will be thrown
even if the final result would be in the compiled ranks.
np.percentile also supports the method “nearest,” however its behavior does not match
the numpy documentation, so it’s not supported here.
np.percentile also allows for weighted inputs, but only for the method “inverted_cdf.”
That also is not supported here.
ValueError – Raised if scalar q or any value of array q is outside the range [0,100]
Raised if the method is not one of the 12 supported methods.
Raised if the result would have a rank not in the compiled ranks.
Plot the distribution and cumulative distribution of histogram Data.
Parameters:
b (np.ndarray) – Bin edges
h (np.ndarray) – Histogram data
log (bool) – use log to scale y
xlabel (str) – Label for the x axis of the graph
newfig (bool) – Generate a new figure or not
Notes
This function does not return or display the plot. A user must have matplotlib imported in
addition to arkouda to display plots. This could be updated to return the object or have a
flag to show the resulting plots.
See Examples Below.
Raises an array to a power. If where is given, the operation will only take place in the positions
where the where condition is True.
Note:
Our implementation of the where argument deviates from numpy. The difference in behavior occurs
at positions where the where argument contains a False. In numpy, these position will have
uninitialized memory (which can contain anything and will vary between runs). We have chosen to
instead return the value of the original array in these positions.
Parameters:
pda (pdarray) – A pdarray of values that will be raised to a power (pwr)
pwr (integer, float, or pdarray) – The power(s) that pda is raised to
where (Boolean or pdarray) – This condition is broadcast over the input. At locations where the condition is True, the
corresponding value will be raised to the respective power. Elsewhere, it will retain its
original value. Default set to True.
Returns:
a pdarray of values raised to a power, under the boolean where condition.
f_exp (pdarray, default = None) – The expected frequency.
ddof (int) – The delta degrees of freedom.
lambda (string, default = "pearson") –
The power in the Cressie-Read power divergence statistic.
Allowed values: “pearson”, “log-likelihood”, “freeman-tukey”, “mod-log-likelihood”,
“neyman”, “cressie-read”
Print verbose information for each object in names in a human readable format.
Parameters:
names (Union[List[str], str]) – names is either the name of an object or list of names of objects to retrieve info
if names is ak.AllSymbols, retrieves info for all symbols in the symbol table
if names is ak.RegisteredSymbols, retrieves info for all symbols in the registry
Raises:
RuntimeError – Raised if a server-side error is thrown in the process of
retrieving information about the objects in names
Overwrite elements of A with elements from B based upon a mask array.
Similar to numpy.putmask, where mask = False, A retains its original value,
but where mask = True, A is overwritten with the corresponding entry from Values.
This is similar to ak.where, except that (1) no new pdarray is created, and
(2) Values does not have to be the same size as A and mask.
Parameters:
A (pdarray) – Value(s) used when mask is False (see Notes for allowed dtypes)
mask (pdarray) – Used to choose values from A or B, must be same size as A, and of type ak.bool_
Values (pdarray) – Value(s) used when mask is False (see Notes for allowed dtypes)
Compute the q-th quantile of the data along the specified axis.
Parameters:
a (pdarray) – data whose quantile will be computed
q (pdarray, Tuple, or np.ndarray) – a scalar, tuple, or np.ndarray of q values for the computation. All values
must be in the range 0 <= q <= 1
axis (None, int scalar, or tuple of int scalars) – the axis or axes along which the quantiles are computed. The default is None,
which computes the quantile along a flattened version of the array.
method (string) – one of “inverted_cdf,” “averaged_inverted_cdf”, “closest_observation”,
“interpolated_inverted_cdf”, “hazen”, “weibull”, “linear”, ‘median_unbiased”,
“normal_unbiased”, “lower”,” higher”, “midpoint”
keepdims (bool) – True if the degenerate axes are to be retained after slicing, False if not
Returns:
If q is a scalar and axis is None, the result is a scalar.
If q is a scalar and axis is supplied, the result is a pdarray of rank len(axis)
less than the rank of a.
If q is an array and axis is None, the result is a pdarray of shape q.shape
If q is an array and axis is None, the result is a pdarray of rank q.ndim +
pda.ndim - len(axis). However, there is an intermediate result which is of rank
q.ndim + pda.ndim. If this is not in the compiled ranks, an error will be thrown
even if the final result would be in the compiled ranks.
np.quantile also supports the method “nearest,” however its behavior does not match
the numpy documentation, so it’s not supported here.
np.quantile also allows for weighted inputs, but only for the method “inverted_cdf.”
That also is not supported here.
ValueError – Raised if scalar q or any value of array q is outside the range [0,1]
Raised if the method is not one of the 12 supported methods.
Raised if the result would have a rank not in the compiled ranks.
where (bool or pdarray, default=True) – This condition is broadcast over the input. At locations where the condition is True, the
corresponding value will be converted from radians to degrees. Elsewhere, it will retain its
original value. Default set to True.
Returns:
A pdarray containing an angle converted to degrees, from radians, for each element
of the original pdarray
TypeError – Raised if logmean is neither a float nor a int, logstd is not a float,
seed is not an int, size is not an int, or if characters is not a str
The lengths of the generated strings are distributed $Lognormal(mu, sigma^2)$,
with \(\mu = logmean\) and \(\sigma = logstd\). Thus, the strings will
have an average length of \(exp(\mu + 0.5*\sigma^2)\), a minimum length of
zero, and a heavy tail towards longer strings.
filenames (list or str) – Either a list of filenames or shell expression
datasets (list or str or None) – (List of) name(s) of dataset(s) to read (default: all available)
iterative (bool) – Iterative (True) or Single (False) function call(s) to server
strictTypes (bool) – If True (default), require all dtypes of a given dataset to have the
same precision and sign. If False, allow dtypes of different
precision and sign across different files. For example, if one
file contains a uint32 dataset and another contains an int64
dataset with the same name, the contents of both will be read
into an int64 pdarray.
allow_errors (bool) – Default False, if True will allow files with read errors to be skipped
instead of failing. A warning will be included in the return containing
the total number of files skipped due to failure and up to 10 filenames.
calc_string_offsets (bool) – Default False, if True this will tell the server to calculate the
offsets/segments array on the server versus loading them from HDF5 files.
In the future this option may be set to True as the default.
column_delim (str) – Column delimiter to be used if dataset is CSV. Otherwise, unused.
read_nested (bool) – Default True, when True, SegArray objects will be read from the file. When False,
SegArray (or other nested Parquet columns) will be ignored.
Ignored if datasets is not None
Parquet Files only.
has_non_float_nulls (bool) – Default False. This flag must be set to True to read non-float parquet columns
that contain null values.
fixed_len (int) – Default -1. This value can be set for reading Parquet string columns when the
length of each string is known at runtime. This can allow for skipping byte
calculation, which can have an impact on performance.
Returns:
Dictionary of {datasetName: pdarray, String, or SegArray}
Return type:
Returns a dictionary of Arkouda pdarrays, Arkouda Strings, or Arkouda Segarrays.
If filenames is a string, it is interpreted as a shell expression
(a single filename is a valid expression, so it will work) and is
expanded with glob to read all matching files.
If iterative == True each dataset name and file names are passed to
the server as independent sequential strings while if iterative == False
all dataset names and file names are passed to the server in a single
string.
If datasets is None, infer the names of datasets from the first file
and read all of them. Use get_datasets to show the names of datasets
to HDF5/Parquet files.
CSV files without the Arkouda Header are not supported.
Examples
>>> importarkoudaasakRead with file Extension>>> x=ak.read('path/name_prefix.h5')# load HDF5 - processing determines file type not extensionRead without file Extension>>> x=ak.read('path/name_prefix.parquet')# load ParquetRead Glob Expression>>> x=ak.read('path/name_prefix*')# Reads HDF5
If more than one dataset is found, the objects
will be returned in a dictionary mapping the dataset name to the Arkouda object
containing the data. If the file contains the appropriately formatted header, typed
data will be returned. Otherwise, all data will be returned as a Strings object.
Parameters:
filenames (str or List[str]) – The filenames to read data from
datasets (str or List[str] (Optional)) – names of the datasets to read. When None, all datasets will be read.
column_delim (str) – The delimiter for column names and data. Defaults to “,”.
allow_errors (bool) – Default False, if True will allow files with read errors to be skipped
instead of failing. A warning will be included in the return containing
the total number of files skipped due to failure and up to 10 filenames.
Returns:
Dictionary of {datasetName: pdarray, String, or SegArray}
Return type:
Returns a dictionary of Arkouda pdarrays, Arkouda Strings, or Arkouda Segarrays.
Raises:
ValueError – Raised if all datasets are not present in all parquet files or if one or
more of the specified files do not exist
RuntimeError – Raised if one or more of the specified files cannot be opened.
If allow_errors is true this may be raised if no values are returned
from the server.
TypeError – Raised if we receive an unknown arkouda_type returned from the server
filenames (str, List[str]) – Filename/s to read objects from
datasets (Optional str, List[str]) – datasets to read from the provided files
iterative (bool) – Iterative (True) or Single (False) function call(s) to server
strict_types (bool) – If True (default), require all dtypes of a given dataset to have the
same precision and sign. If False, allow dtypes of different
precision and sign across different files. For example, if one
file contains a uint32 dataset and another contains an int64
dataset with the same name, the contents of both will be read
into an int64 pdarray.
allow_errors (bool) – Default False, if True will allow files with read errors to be skipped
instead of failing. A warning will be included in the return containing
the total number of files skipped due to failure and up to 10 filenames.
calc_string_offsets (bool) – Default False, if True this will tell the server to calculate the
offsets/segments array on the server versus loading them from HDF5 files.
In the future this option may be set to True as the default.
tagData (bool) – Default False, if True tag the data with the code associated with the filename
that the data was pulled from.
Returns:
Dictionary of {datasetName: pdarray, String, SegArray}
Return type:
Returns a dictionary of Arkouda pdarrays, Arkouda Strings, or Arkouda Segarrays.
Raises:
ValueError – Raised if all datasets are not present in all hdf5 files or if one or
more of the specified files do not exist
RuntimeError – Raised if one or more of the specified files cannot be opened.
If allow_errors is true this may be raised if no values are returned
from the server.
TypeError – Raised if we receive an unknown arkouda_type returned from the server
Notes
If filenames is a string, it is interpreted as a shell expression
(a single filename is a valid expression, so it will work) and is
expanded with glob to read all matching files.
If iterative == True each dataset name and file names are passed to
the server as independent sequential strings while if iterative == False
all dataset names and file names are passed to the server in a single
string.
If datasets is None, infer the names of datasets from the first file
and read all of them. Use get_datasets to show the names of datasets
to HDF5 files.
filenames (str, List[str]) – Filename/s to read objects from
datasets (Optional str, List[str]) – datasets to read from the provided files
iterative (bool) – Iterative (True) or Single (False) function call(s) to server
strict_types (bool) – If True (default), require all dtypes of a given dataset to have the
same precision and sign. If False, allow dtypes of different
precision and sign across different files. For example, if one
file contains a uint32 dataset and another contains an int64
dataset with the same name, the contents of both will be read
into an int64 pdarray.
allow_errors (bool) – Default False, if True will allow files with read errors to be skipped
instead of failing. A warning will be included in the return containing
the total number of files skipped due to failure and up to 10 filenames.
tagData (bool) – Default False, if True tag the data with the code associated with the filename
that the data was pulled from.
read_nested (bool) – Default True, when True, SegArray objects will be read from the file. When False,
SegArray (or other nested Parquet columns) will be ignored.
If datasets is not None, this will be ignored.
has_non_float_nulls (bool) – Default False. This flag must be set to True to read non-float parquet columns
that contain null values.
fixed_len (int) – Default -1. This value can be set for reading Parquet string columns when the
length of each string is known at runtime. This can allow for skipping byte
calculation, which can have an impact on performance.
Returns:
Dictionary of {datasetName: pdarray, String, or SegArray}
Return type:
Returns a dictionary of Arkouda pdarrays, Arkouda Strings, or Arkouda Segarrays.
Raises:
ValueError – Raised if all datasets are not present in all parquet files or if one or
more of the specified files do not exist
RuntimeError – Raised if one or more of the specified files cannot be opened.
If allow_errors is true this may be raised if no values are returned
from the server.
TypeError – Raised if we receive an unknown arkouda_type returned from the server
Notes
If filenames is a string, it is interpreted as a shell expression
(a single filename is a valid expression, so it will work) and is
expanded with glob to read all matching files.
If iterative == True each dataset name and file names are passed to
the server as independent sequential strings while if iterative == False
all dataset names and file names are passed to the server in a single
string.
If datasets is None, infer the names of datasets from the first file
and read all of them. Use get_datasets to show the names of datasets
to Parquet files.
Parquet always recomputes offsets at this time
This will need to be updated once parquets workflow is updated
Read datasets from files and tag each record to the file it was read from.
File Type is determined automatically.
Parameters:
filenames (list or str) – Either a list of filenames or shell expression
datasets (list or str or None) – (List of) name(s) of dataset(s) to read (default: all available)
strictTypes (bool) – If True (default), require all dtypes of a given dataset to have the
same precision and sign. If False, allow dtypes of different
precision and sign across different files. For example, if one
file contains a uint32 dataset and another contains an int64
dataset with the same name, the contents of both will be read
into an int64 pdarray.
allow_errors (bool) – Default False, if True will allow files with read errors to be skipped
instead of failing. A warning will be included in the return containing
the total number of files skipped due to failure and up to 10 filenames.
calc_string_offsets (bool) – Default False, if True this will tell the server to calculate the
offsets/segments array on the server versus loading them from HDF5 files.
In the future this option may be set to True as the default.
read_nested (bool) – Default True, when True, SegArray objects will be read from the file. When False,
SegArray (or other nested Parquet columns) will be ignored.
Ignored if datasets is not None
Parquet Files only.
has_non_float_nulls (bool) – Default False. This flag must be set to True to read non-float parquet columns
that contain null values.
Notes
Not currently supported for Categorical or GroupBy datasets
Examples
>>> importarkoudaasakRead files and return data with tagging corresponding to the Categorical returnedcat.codes will link the codes in data to the filename. Data will contain the code `Filename_Codes`>>> data,cat=ak.read_tagged_data('path/name')>>> data{'Filname_Codes': array([0 3 6 9 12]), 'col_name': array([0 0 0 1])}
Supports multi-dimensional pdarrays of numeric types.
To use this function, ensure you have installed the blosc dependency (make install-blosc)
and have included ZarrMsg.chpl in the ServerModules.cfg file.
Parameters:
store_path (str) – The path to the Zarr store. The path must be to a directory that contains a .zarray
file containing the Zarr store metadata.
ndim (int) – The number of dimensions in the array
hostname (str) – The hostname of the pdarray that sent the array
port (int_scalars) – The port to send the array over. This needs to be an
open port (i.e., not one that the Arkouda server is
running on). This will open up numLocales ports,
each of which in succession, so will use ports of the
range {port..(port+numLocales)} (e.g., running an
Arkouda server of 4 nodes, port 1234 is passed as
port, Arkouda will use ports 1234, 1235, 1236,
and 1237 to send the array data).
This port much match the port passed to the call to
pdarray.transfer().
Returns:
The pdarray sent from the sending server to the current
receiving server.
hostname (str) – The hostname of the dataframe that sent the array
port (int_scalars) – The port to send the dataframe over. This needs to be an
open port (i.e., not one that the Arkouda server is
running on). This will open up numLocales ports,
each of which in succession, so will use ports of the
range {port..(port+numLocales)} (e.g., running an
Arkouda server of 4 nodes, port 1234 is passed as
port, Arkouda will use ports 1234, 1235, 1236,
and 1237 to send the array data).
This port much match the port passed to the call to
pdarray.send_array().
Returns:
The dataframe sent from the sending server to the
current receiving server.
This function iterates through the dictionary data, registering each object
with its corresponding name. It is useful for batch registering multiple
objects in Arkouda.
Parameters:
data (dict) – A dictionary that maps the name to register the object to the object itself.
For example, {“MyArray”: ak.array([0, 1, 2])}.
After calling this function, “array1” and “array2” are registered
in Arkouda, and can be accessed by their names.
>>> ak.unregister_all([“array1”, “array2”])
Unlike other save/load methods using snapshot restore will save DataFrames alongside other
objects in HDF5. Thus, they are returned within the dictionary as a dataframe.
Determine the promoted result dtype of inputs, including support for Arkouda’s bigint.
Determine the result dtype that would be returned by a NumPy-like operation
on the provided input arguments, accounting for Arkouda’s extended types
such as ak.bigint.
This function mimics numpy.result_type, with support for Arkouda types.
Parameters:
*args (Union[pdarray, np.dtype, type]) – One or more input objects. These can be NumPy arrays, dtypes, Python
scalar types, or Arkouda pdarrays.
Returns:
The dtype (or equivalent Arkouda type) that results from applying
type promotion rules to the inputs.
Records some metadata about the server, and saves
all pdarrays into parquet files.
Parameters:
name (str) – Name of the checkpoint. The default will be the server session ID, which
is typically in format id_<hash>_. A directory will be created in
path with this name.
path (str) – The directory to save the checkpoint. If the directory doesn’t exist, it
will be created. If it exists, a new directory for the checkpoint
instance will be created inside this directory.
mode ({'overwrite' | 'preserve_previous' | 'error'}) – By default, overwrite the checkpoint files if they exist.
If ‘preserve_previous’, an existing checkpoint with ‘name’ will be
renamed to ‘name.prev’, overwriting ‘name.prev’ if it existed,
before creating a new checkpoint with ‘name’.
If ‘error’, an error will be raised if a checkpoint with the same name
exists.
Notes
Only ``pdarray``s are saved. Other data structures will not be recorded. We
expect to expand the coverage in the future.
Returns:
The checkpoint name, which will be the same as the name argument if
it was passed.
D.update([E, ]**F) -> None. Update D from mapping/iterable E and F.
If E is present and has a .keys() method, then does: for k in E.keys(): D[k] = E[k]
If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v
In either case, this is followed by: for k in F: D[k] = F[k]
Return the index of the best interval containing each query value.
Given an array of query vals and non-overlapping, closed intervals, return
the index of the best (see tiebreak) interval containing each query value,
or -1 if not present in any interval.
Parameters:
vals ((sequence of) pdarray(int, uint, float)) – Values to search for in intervals. If multiple arrays, each “row” is an item.
intervals (2-tuple of (sequences of) pdarrays) – Non-overlapping, half-open intervals, as a tuple of
(lower_bounds_inclusive, upper_bounds_exclusive)
Must have same dtype(s) as vals.
tiebreak ((optional) pdarray, numeric) – When a value is present in more than one interval, the interval with the
lowest tiebreak value will be chosen. If no tiebreak is given, the
first containing interval will be chosen.
hierarchical (boolean) – When True, sequences of pdarrays will be treated as components specifying
a single dimension (i.e. hierarchical)
When False, sequences of pdarrays will be specifying multi-dimensional intervals
Returns:
idx – Index of interval containing each query value, or -1 if not found
Find indices where elements should be inserted to maintain order.
Find the indices into a sorted array a such that, if the corresponding
elements in v were inserted before the indices, the order of a would be preserved.
Parameters:
a (pdarray) – 1-D input array. Must be sorted in ascending order. sorter is not currently supported.
where (bool or pdarray, default=True) – This condition is broadcast over the input. At locations where the condition is True,
the sine will be applied to the corresponding value. Elsewhere, it will retain
its original value. Default set to True.
Returns:
A pdarray containing sin for each element
of the original pdarray
Return a pair of integers, whose ratio is exactly equal to the original
floating point number, and with a positive denominator.
Raise OverflowError on infinities and a ValueError on NaNs.
where (bool or pdarray, default=True) – This condition is broadcast over the input. At locations where the condition is True,
the hyperbolic sine will be applied to the corresponding value. Elsewhere, it will retain
its original value. Default set to True.
Returns:
A pdarray containing hyperbolic sine for each element
of the original pdarray
Takes the square root of array. If where is given, the operation will only take place in
the positions where the where condition is True.
Parameters:
pda (pdarray) – A pdarray of values the square roots of which will be computed
where (Boolean or pdarray) – This condition is broadcast over the input. At locations where the condition is True, the
corresponding value will be square rooted. Elsewhere, it will retain its original value.
Default set to True.
Returns:
a pdarray of square roots of the original values, or the original values themselves,
subject to the boolean where condition.
When axis is not None, this function does the same thing as “fancy” indexing (indexing arrays
using arrays); however, it can be easier to use if you need elements along a given axis.
A call such as np.take(arr,indices,axis=3) is equivalent to arr[:,:,:,indices,...].
Parameters:
a (pdarray) – The array from which to take elements
indices (numeric_scalars or pdarray) – The indices of the values to extract. Also allow scalars for indices.
axis (int, optional) – The axis over which to select values. By default, the flattened input array is used.
where (bool or pdarray, default=True) – This condition is broadcast over the input. At locations where the condition is True,
the tangent will be applied to the corresponding value. Elsewhere, it will retain
its original value. Default set to True.
Returns:
A pdarray containing tangent for each element
of the original pdarray
where (bool or pdarray, default=True) – This condition is broadcast over the input. At locations where the condition is True,
the hyperbolic tangent will be applied to the corresponding value. Elsewhere, it will retain
its original value. Default set to True.
Returns:
A pdarray containing hyperbolic tangent for each element
of the original pdarray
Construct an array by repeating A the number of times given by reps.
If reps has length d, the result will have dimension of max(d,A.ndim).
If A.ndim<d, A is promoted to be d-dimensional by prepending new axes. So a shape (3,) array is promoted to (1, 3) for 2-D replication, or shape (1, 1, 3) for 3-D replication. If this is not the desired behavior, promote A to d-dimensions manually before calling this function.
If A.ndim>d, reps is promoted to A.ndim by prepending 1’s to it. Thus for an A of shape (2, 3, 4, 5), a reps of (2, 2) is treated as (1, 1, 2, 2).
Return a fixed frequency TimedeltaIndex, with day as the default
frequency. Alias for ak.Timedelta(pd.timedelta_range(args)).
Subject to size limit imposed by client.maxTransferBytes.
Parameters:
start (str or timedelta-like, default None) – Left bound for generating timedeltas.
end (str or timedelta-like, default None) – Right bound for generating timedeltas.
periods (int, default None) – Number of periods to generate.
freq (str or DateOffset, default 'D') – Frequency strings can have multiples, e.g. ‘5H’.
name (str, default None) – Name of the resulting TimedeltaIndex.
closed (str, default None) – Make the interval closed with respect to the given frequency to
the ‘left’, ‘right’, or both sides (None).
Returns:
rng
Return type:
TimedeltaIndex
Notes
Of the four parameters start, end, periods, and freq,
exactly three must be specified. If freq is omitted, the resulting
TimedeltaIndex will have periods linearly spaced elements between
start and end (closed on both sides).
To learn more about the frequency strings, please see this link.
All CSV Files written by Arkouda
include a header denoting data types of the columns.
Parameters:
columns (Mapping[str, pdarray] or List[pdarray]) – The objects to be written to CSV file. If a mapping is used and names is None
the keys of the mapping will be used as the dataset names.
prefix_path (str) – The filename prefix to be used for saving files. Files will have _LOCALE#### appended
when they are written to disk.
names (List[str] (Optional)) – names of dataset to be written. Order should correspond to the order of data
provided in columns.
col_delim (str) – Defaults to “,”. Value to be used to separate columns within the file.
Please be sure that the value used DOES NOT appear in your dataset.
overwrite (bool) – Defaults to False. If True, any existing files matching your provided prefix_path will
be overwritten. If False, an error will be returned if existing files are found.
Raises:
ValueError – Raised if any datasets are present in all csv files or if one or
more of the specified files do not exist
RuntimeError – Raised if one or more of the specified files cannot be opened.
If allow_errors is true this may be raised if no values are returned
from the server.
TypeError – Raised if we receive an unknown arkouda_type returned from the server
columns (dict or list of pdarrays) – Collection of arrays to save
prefix_path (str) – Directory and filename prefix for output files
names (list of str) – Dataset names for the pdarrays
mode ({'truncate' | 'append'}) – By default, truncate (overwrite) the output files if they exist.
If ‘append’, attempt to create new dataset in existing files.
file_type (str ("single" | "distribute")) – Default: distribute
Single writes the dataset to a single file
Distribute writes the dataset to a file per locale
Raises:
ValueError – Raised if (1) the lengths of columns and values differ or (2) the mode
is not ‘truncate’ or ‘append’
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Creates one file per locale containing that locale’s chunk of each pdarray.
If columns is a dictionary, the keys are used as the HDF5 dataset names.
Otherwise, if no names are supplied, 0-up integers are used. By default,
any existing files at path_prefix will be overwritten, unless the user
specifies the ‘append’ mode, in which case arkouda will attempt to add
<columns> as new datasets to existing files. If the wrong number of files
is present or dataset names already exist, a RuntimeError is raised.
columns (dict or list of pdarrays) – Collection of arrays to save
prefix_path (str) – Directory and filename prefix for output files
names (list of str) – Dataset names for the pdarrays
mode ({'truncate' | 'append'}) – By default, truncate (overwrite) the output files if they exist.
If ‘append’, attempt to create new dataset in existing files.
‘append’ is deprecated, please use the multi-column write
compression (str (Optional)) –
Default None
Provide the compression type to use when writing the file.
Supported values: snappy, gzip, brotli, zstd, lz4
convert_categoricals: bool
Defaults to False
Parquet requires all columns to be the same size and Categoricals
don’t satisfy that requirement.
if set, write the equivalent Strings in place of any Categorical columns.
Raises:
ValueError – Raised if (1) the lengths of columns and values differ or (2) the mode
is not ‘truncate’ or ‘append’
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Creates one file per locale containing that locale’s chunk of each pdarray.
If columns is a dictionary, the keys are used as the Parquet column names.
Otherwise, if no names are supplied, 0-up integers are used. By default,
any existing files at path_prefix will be deleted
(regardless of whether they would be overwritten), unless the user
specifies the ‘append’ mode, in which case arkouda will attempt to add
<columns> as new datasets to existing files. If the wrong number of files
is present or dataset names already exist, a RuntimeError is raised.
Supports multi-dimensional pdarrays of numeric types.
To use this function, ensure you have installed the blosc dependency (make install-blosc)
and have included ZarrMsg.chpl in the ServerModules.cfg file.
Parameters:
store_path (str) – The path at which Zarr store should be written
chunk_shape (tuple) – The shape of the chunks to be used in the Zarr store
Raises:
ValueError – Raised if the number of dimensions in the chunk shape does not match
the number of dimensions in the array or if the array is not a 32 or 64 bit numeric type
axes (Tuple[int,...] Optional, defaults to None) – If specified, must be a tuple which contains a permutation of the axes of pda.
Returns:
the transpose of the input matrix
For a 1-D array, this is the original array.
For a 2-D array, this is the standard matrix transpose.
For an n-D array, if axes are given, their order indicates how the axes are permuted.
If axes is None, the axes are reversed.
Returns the unique elements of an array, sorted if the values are integers.
There is an optional output in addition to the unique elements: the number
of times each unique value comes up in the input array.
TypeError – Raised if pda is not a pdarray or Strings object
RuntimeError – Raised if the pdarray or Strings dtype is unsupported
Notes
For integer arrays, this function checks to see whether pda is sorted
and, if so, whether it is already unique. This step can save considerable
computation. Otherwise, this function will sort pda.
This function sends a request to unregister the Arkouda object associated
with the specified name. It returns a response message indicating the
success or failure of the operation.
Parameters:
name (str) – The name of the object to unregister.
Returns:
A message indicating the result of the unregister operation.
Return type:
str
Raises:
RuntimeError – If the object associated with the given name does not exist or cannot
be unregistered.
After calling this function, “array1” and “array2” are registered
in Arkouda, and can be accessed by their names.
>>> ak.unregister_all([“array1”, “array2”])
Overwrite the datasets with name appearing in names or keys in columns if columns is a dictionary.
Parameters:
columns (dict or list of pdarrays) – Collection of arrays to save
prefix_path (str) – Directory and filename prefix for output files
names (list of str) – Dataset names for the pdarrays
repack (bool) – Default: True
HDF5 does not release memory on delete. When True, the inaccessible
data (that was overwritten) is removed. When False, the data remains, but is
inaccessible. Setting to false will yield better performance, but will cause
file sizes to expand.
Raises:
RuntimeError – Raised if a server-side error is thrown saving the datasets
Notes
If file does not contain File_Format attribute to indicate how it was saved,
the file name is checked for _LOCALE#### to determine if it is distributed.
If the datasets provided do not exist, they will be added
Because HDF5 deletes do not release memory, this will create a copy of the
file with the new data
This workflow is slightly different from to_hdf to prevent reading and
creating a copy of the file for each dataset
This function differs from histogram() in that it only returns
counts for values that are present, leaving out empty “bins”. This
function delegates all logic to the unique() method where the
return_counts parameter is set to True.
Create a new structured or unstructured void scalar.
length_or_dataint, array-like, bytes-like, object
One of multiple meanings (see notes). The length or
bytes data of an unstructured void. Or alternatively,
the data to be stored in the new scalar when dtype
is provided.
This can be an array-like, in which case an array may
be returned.
dtypedtype, optional
If provided the dtype of the new scalar. This dtype must
be “void” dtype (i.e. a structured or unstructured void,
see also defining-structured-types).
Added in version 1.24.
For historical reasons and because void scalars can represent both
arbitrary byte data and structured dtypes, the void constructor
has three calling conventions:
np.void(5) creates a dtype="V5" scalar filled with five
\0 bytes. The 5 can be a Python or NumPy integer.
np.void(b"bytes-like") creates a void scalar from the byte string.
The dtype itemsize will match the byte string length, here "V10".
When a dtype= is passed the call is roughly the same as an
array creation. However, a void scalar rather than array is returned.
Please see the examples which show all three different conventions.
This is equivalent to concatenation along the first axis after
1-D arrays of shape (N,) have been reshaped to (1,N). Rebuilds arrays divided by vsplit.
This function makes most sense for arrays with up to 3 dimensions.
For instance, for pixel-data with a height (first axis), width (second axis),
and r/g/b channels (third axis). The functions concatenate, stack and block
provide more general stacking and concatenation operations.
Parameters:
tup (sequence of pdarray) – The arrays must have the same shape along all but the first axis. 1-D arrays
must have the same length. In the case of a single array_like input, it will be
treated as a sequence of arrays; i.e., each element along the zeroth axis is treated
as a separate array.
dtype (str or type, optional) – If provided, the destination array will have this dtype.
casting ({"no", "equiv", "safe", "same_kind", "unsafe"], optional) – Controls what kind of data casting may occur. Defaults to ‘same_kind’. Currently unused.
Returns:
The array formed by stacking the given arrays, will be at least 2-D.
Return an array with elements chosen from A and B based upon a
conditioning array. As is the case with numpy.where, the return array
consists of values from the first array (A) where the conditioning array
elements are True and from the second array (B) where the conditioning
array elements are False.
Parameters:
condition (pdarray) – Used to choose values from A or B
TypeError – Raised if the condition object is not a pdarray, if A or B is not
an int, np.int64, float, np.float64, bool, pdarray, str, Strings, Categorical
if pdarray dtypes are not supported or do not match, or multiple
condition clauses (see Notes section) are applied
ValueError – Raised if the shapes of the condition, A, and B pdarrays are unequal
A and B must have the same dtype and only one conditional clause
is supported e.g., n < 5, n > 1, which is supported in numpy
is not currently supported in Arkouda
log_msg (str) – The message to be added to the server log
tag (str) – The tag to use in the log. This takes the place of the server function name.
Allows for easy identification of custom logs.
Defaults to “ClientGeneratedLog”
log_lvl (LogLevel) – The type of log to be written
Defaults to LogLevel.INFO
dtype (all_scalars) – Type of resulting array, default ak.float64
max_bits (int) – Specifies the maximum number of bits; only used for bigint pdarrays
Included for consistency, as zeros are represented as all zeros, regardless
of the value of max_bits